Editor’s note: For the last few years I have made a list of awesome things that other people did (2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Thomas Lin Pedersen created the tweenr package for interpolating graphs in animations. Check out this awesome logo he made with it.…

Original Post: A Non-comprehensive List of Awesome Things Other People Did in 2016

# Statistics

## 3 methods to deal with outliers

By Alberto Quesada, Artelnics. An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors. If possible, outliers should be excluded from the data set. However, detecting that anomalous instances might be very difficult, and is not always possible. Introduction Machine learning algorithms are very sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results. Along this article, we are going to talk about 3 different methods of dealing with outliers: Univariate method: This method looks for data points with extreme values on one variable. Multivariate method: Here we look for unusual combinations on all the variables. Minkowski error: This method reduces the contribution of potential outliers…

Original Post: 3 methods to deal with outliers

## The Basics of Bayesian Statistics

Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior knowledge that people are generally between 5 and 6 feet tall to inform the results from the data we collect. If our prior is informative and we don’t have much data, this will help us to get a better estimate. If we have a lot of data, even if the prior is wrong (say, our population is NBA players), the prior won’t change the estimate much. You might say that including such “subjective” information in a statistical model isn’t right, but there’s subjectivity in the selection of any statistical model. Bayesian Inference makes that subjectivity explicit. Bayesian Inference can seem complicated, but as…

Original Post: The Basics of Bayesian Statistics

## Top KDnuggets tweets, Dec 14-20: False positives versus false negatives: Best explanation ever

Most popular @KDnuggets tweets for Dec 14-20 wereMost Retweeted:False positives versus false negatives: Best explanation ever https://t.co/YAJW01V6ro https://t.co/4V3eILADD2Most Favorited:False positives versus false negatives: Best explanation ever https://t.co/YAJW01V6ro https://t.co/4V3eILADD2Most Viewed:#MachineLearning & #AI experts: Main Developments 2016, Key Trends 2017 https://t.co/ZccewwG8dV @AjitJaokar @dtunkelang @randal_olson https://t.co/kH2HPP6WylMost Clicked:#MachineLearning & #AI experts: Main Developments 2016, Key Trends 2017 https://t.co/GI0GLkxAY4 @xamat @pmddomingos @*brohrer* @etzioni https://t.co/X7FpbHne4ETop 10 most engaging Tweets False positives versus false negatives: Best explanation ever https://t.co/YAJW01V6ro https://t.co/4V3eILADD2 #MachineLearning & #AI experts: Main Developments 2016, Key Trends 2017 https://t.co/GI0GLkxAY4 @xamat @pmddomingos @*brohrer* @etzioni https://t.co/X7FpbHne4E Official code repository for #MachineLearning with #TensorFlow book https://t.co/P24WagesuX https://t.co/b1BFNhXcHq Top 10 Essential Books for the #Data Enthusiast https://t.co/MroBU0npwl https://t.co/o8vtgola37 #DeepLearning Works Great Because the Universe, Physics and the Game of Go are Vastly Simpler than Prior Models https://t.co/KvrYHnrnCM https://t.co/qMzK0Fe8Ne Building Jarvis by some guy named Mark Zuckerberg via @Facebook #AI…

Original Post: Top KDnuggets tweets, Dec 14-20: False positives versus false negatives: Best explanation ever

## Predicting flu deaths with R

As Google learned, predicting the spread of influenza, even with mountains of data, is notoriously difficult. Nonetheless, bioinformatician and R user Shirin Glander has created a two-part tutorial about predicting flu deaths with R (part 2 here). The analysis is based on just 136 cases of influenza A H7N9 in China in 2013 (data provided in the outbreaks package) so the intent was not to create a generally predictive model, but by providing all of the R code and graphics Shirin has created a useful example of real-word predictive modeling with R. The tutorial covers loading and cleaning the data (including a nice example of using the mice package to impute missing values) and begins with some exploratory data visualizations. I was particularly impressed by the use of density charts (using the stat_density2d ggplot2 aesthetic) to highlight differences in the scatterplots of flu cases ending…

Original Post: Predicting flu deaths with R

## Machine Learning vs Statistics

Machine learning is all about predictions, supervised learning, and unsupervised learning, while statistics is about sample, population, and hypotheses. But are they actually that different? Aatash Shah, CEO of Edvancer Eduventures. Many people have this doubt, what’s the difference between statistics and machine learning? Is there something like machine learning vs. statistics? From a traditional data analytics standpoint, the answer to…

Original Post: Machine Learning vs Statistics

## Calculating AUC: the area under a ROC Curve

Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string…

Original Post: Calculating AUC: the area under a ROC Curve

## How Bayesian Inference Works

Bayesian inference isn’t magic or mystical; the concepts behind it are completely accessible. In brief, Bayesian inference lets you draw stronger conclusions from your data by folding in what you already know about the answer. Read an in-depth overview here. Get the slides. Bayesian inference is a way to get sharper predictions from your data. It’s particularly useful…

Original Post: How Bayesian Inference Works

## Trump, The Statistics of Polling, and Forecasting Home Prices

Why polling has failed in US Presidential election? The home price index offers an apt comparison inasmuch as sample selection is problematic, equally snagging both election predictions and home price futures. By Joseph R. Barr, Polemicist & Statistician. Full of emotional turmoil, the ups and the downs, the topsy and the turvy, this has been an interesting election year –…

Original Post: Trump, The Statistics of Polling, and Forecasting Home Prices

## How Can Lean Six Sigma Help Machine Learning?

The data cleansing phase alone is not sufficient to ensure the accuracy of the machine learning, when noise / bias exists in input data. The lean six sigma variance reduction can improve the accuracy of machine learning results. By Joseph Chen, Senior Management and Architect in BI, Data Warehouse, Six Sigma, and Operations Research. Introduction I have been using Lean…

Original Post: How Can Lean Six Sigma Help Machine Learning?