Understanding overfitting: an inaccurate meme in supervised learning

[unable to retrieve full-text content]Applying cross-validation prevents overfitting” is a popular meme, but is not actually true – it more of an urban legend. We examine what is true and how overfitting is different from overtraining.
Original Post: Understanding overfitting: an inaccurate meme in supervised learning

Making Predictive Models Robust: Holdout vs Cross-Validation

[unable to retrieve full-text content]The validation step helps you find the best parameters for your predictive model and prevent overfitting. We examine pros and cons of two popular validation strategies: the hold-out strategy and k-fold.
Original Post: Making Predictive Models Robust: Holdout vs Cross-Validation

Sound Data Science: Avoiding the Most Pernicious Prediction Pitfall

In this excerpt from the updated edition of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, Revised and Updated Edition, I show that, although data science and predictive analytics’ explosive popularity promises meteoric value, a common misapplication readily backfires. The number crunching only delivers if a fundamental—yet often omitted—failsafe is applied. Prediction is booming. Data scientists have the “sexiest job of the 21st century” (as Professor Thomas Davenport and US Chief Data Scientist D.J. Patil declared in 2012). Fueled by the data tsunami, we’ve entered a golden age of predictive discoveries. A frenzy of analysis churns out a bonanza of colorful, valuable, and sometimes surprising insights:[i] • People who “like” curly fries on Facebook are more intelligent. • Typing with proper capitalization indicates creditworthiness. • Users of the Chrome and Firefox browsers make better employees.…
Original Post: Sound Data Science: Avoiding the Most Pernicious Prediction Pitfall

4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)

By Bilal Mahmood, Bolt. There are a number of machine learning models to choose from. We can use Linear Regression to predict a value, Logistic Regression to classify distinct outcomes, and Neural Networks to model non-linear behaviors. When we build these models, we always use a set of historical data to help our machine learning algorithms learn what is the relationship between a set of input features to a predicted output. But even if this model can accurately predict a value from historical data, how do we know it will work as well on new data? Or more plainly, how do we evaluate whether a machine learning model is actually “good”? In this post we’ll walk through some common scenarios where a seemingly good machine learning model may still be wrong. We’ll show how you can evaluate these issues by assessing metrics of bias vs.…
Original Post: 4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)

Data Science Basics: 3 Insights for Beginners

For data science beginners, 3 elementary issues are given overview treatment: supervised vs. unsupervised learning, decision tree pruning, and training vs. testing datasets. 1. What is the difference between supervised and unsupervised learning? In supervised learning, the learning algorithm is provided outcome data in advance, in the form of a pre-labeled set of instances. It is from this set that…
Original Post: Data Science Basics: 3 Insights for Beginners

A Neat Trick to Increase Robustness of Regression Models

Previous post Next post            Tweet Tags: CleverTap, Linear Regression, Outliers, Overfitting, Regression Read this take on the validity of choosing a different approach to regression modeling. Why isn’t L1 norm used more often? By Jacob Joseph, CleverTap. The first predictive model that an analyst encounters is Linear Regression. A linear regression line has an equation of the form, where X =…
Original Post: A Neat Trick to Increase Robustness of Regression Models

The Fallacy of Seeing Patterns

Previous post Next post            Tweet Tags: Analysis, Bias, CleverTap, Correlation, Overfitting, Sampling Analysts are often on the lookout for patterns, often relying on spurious patterns. This post looks at some spurious patterns in univariate, bivariate & multivariate analysis. By Jacob Joseph, CleverTap. Human beings try to find patterns to explain the reason behind almost every phenomenon, but…
Original Post: The Fallacy of Seeing Patterns

Data Mining Most Vexing Problem Solved, or is this drug REALLY working?

Previous post Next post            Tweet Tags: Bonferroni, Francois Petitjean, KDD-2016, Overfitting This is a summary of the basic principle behind a new paper on multiple test correction for streams and cascades of statistical hypothesis tests, showing how to strictly control the risk of making a mistake over a series of tests and draw appropriate conclusions. comments François…
Original Post: Data Mining Most Vexing Problem Solved, or is this drug REALLY working?

The “Thinking” Part of “Thinking Like A Data Scientist”

Previous post            Tweet Tags: Data Scientist, Overfitting, P-value People have a tendency to blindly trust claims from any source that they deem credible, whether or not it conflicts with their own experiences or common sense. Basic stats – common sense = dangerous conclusions viewed as fact. By William Schmarzo, EMC. comments Imagine my surprise when reading the…
Original Post: The “Thinking” Part of “Thinking Like A Data Scientist”