On the biases in data

Whether we’re developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it’s always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft’s AI Blog discusses the inherent biases found in many data sets: “The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will…
Original Post: On the biases in data

Understanding Bias in Peer Review

Posted by Andrew Tomkins, Director of Engineering and William D. Heavlin, Statistician, Google ResearchIn the 1600’s, a series of practices came into being known collectively as the “scientific method.” These practices encoded verifiable experimentation as a path to establishing scientific fact. Scientific literature arose as a mechanism to validate and disseminate findings, and standards of scientific peer review developed as a means to control the quality of entrants into this literature. Over the course of development of peer review, one key structural question remains unresolved to the current day: should the reviewers of a piece of scientific work be made aware of the identify of the authors? Those in favor argue that such additional knowledge may allow the reviewer to set the work in perspective and evaluate it more completely. Those opposed argue instead that the reviewer may form an…
Original Post: Understanding Bias in Peer Review

Understanding Bias in Peer Review

Posted by Andrew Tomkins, Director of Engineering and William D. Heavlin, Statistician, Google ResearchIn the 1600’s, a series of practices came into being known collectively as the “scientific method.” These practices encoded verifiable experimentation as a path to establishing scientific fact. Scientific literature arose as a mechanism to validate and disseminate findings, and standards of scientific peer review developed as a means to control the quality of entrants into this literature. Over the course of development of peer review, one key structural question remains unresolved to the current day: should the reviewers of a piece of scientific work be made aware of the identify of the authors? Those in favor argue that such additional knowledge may allow the reviewer to set the work in perspective and evaluate it more completely. Those opposed argue instead that the reviewer may form an…
Original Post: Understanding Bias in Peer Review

KDnuggets™ News 17:n45, Nov 29: New Poll: Data Science Methods Used? Deep Learning Specialization: 21 Lessons Learned

[unable to retrieve full-text content]Also The 10 Statistical Techniques Data Scientists Need to Master; Did Spark Really Kill Hadoop? A Framework for Textual Data Science.
Original Post: KDnuggets™ News 17:n45, Nov 29: New Poll: Data Science Methods Used? Deep Learning Specialization: 21 Lessons Learned

You have created your first Linear Regression Model. Have you validated the assumptions?

[unable to retrieve full-text content]Linear Regression is an excellent starting point for Machine Learning, but it is a common mistake to focus just on the p-values and R-Squared values while determining validity of model. Here we examine the underlying assumptions of a Linear Regression, which need to be validated before applying the model.
Original Post: You have created your first Linear Regression Model. Have you validated the assumptions?

The 10 Statistical Techniques Data Scientists Need to Master

[unable to retrieve full-text content]The author presents 10 statistical techniques which a data scientist needs to master. Build up your toolbox of data science tools by having a look at this great overview post.
Original Post: The 10 Statistical Techniques Data Scientists Need to Master

How Bayesian Networks Are Superior in Understanding Effects of Variables

[unable to retrieve full-text content]Bayes Nets have remarkable properties that make them better than many traditional methods in determining variables’ effects. This article explains the principle advantages.
Original Post: How Bayesian Networks Are Superior in Understanding Effects of Variables

Calculating the house edge of a slot machine, with R

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They’re also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots … the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.  All slot machines are designed to have a “house edge” — the percentage of player bets retained by the machine in the long run — greater than…
Original Post: Calculating the house edge of a slot machine, with R