Is Regression Analysis Really Machine Learning?

[unable to retrieve full-text content]What separates “traditional” applied statistics from machine learning? Is statistics the foundation on top of which machine learning is built? Is machine learning a superset of “traditional” statistics? Do these 2 concepts have a third unifying concept in common? So, in that vein… is regression analysis actually a form of machine learning?
Original Post: Is Regression Analysis Really Machine Learning?

Who is the caretaker? Evidence-based probability estimation with the bnlearn package

by Juan M. Lavista Ferres , Senior Director of Data Science at MicrosoftIn what was one of the most viral episodes of 2017, political science Professor Robert E Kelly was live on BBC World News talking about the South Korean president being forced out of office when both his kids decided to take an easy path to fame by showing up in their dad’s interview.  The video immediately went viral, and the BBC reported that within five days more than 100 million people from all over the world had watched it. Many people around the globe via Facebook, Twitter and reporters from reliable sources like Time.com thought the woman that went after the children was her nanny, when in fact, the woman in the video was Robert’s wife, Jung-a Kim, who is Korean.  The confusion over this episode caused…
Original Post: Who is the caretaker? Evidence-based probability estimation with the bnlearn package

An Introduction to Spatial Data Analysis and Visualization in R

The Consumer Data Research Centre, the UK-based organization that works with consumer-related organisations to open up their data resources, recently published a new course online: An Introduction to Spatial Data Analysis and Visualization in R. Created by James Cheshire (whose blog Spatial.ly regularly features interesting R-based data visualizations) and Guy Lansley, both of University College London Department of Geography, this practical series is designed to provide an accessible introduction to techniques for handling, analysing and visualising spatial data in R. In addition to a basic introduction to R, the course covers specialized topics around handling spatial and geographic data in R, including: Making maps in R Mapping point data in R Using R to create, explore and interact with data maps (like the one shown below) Performing statistical analysis on spatial data: interpolation and kriging, spatial autocorrelation, geographically weighted regression and more. The course, tutorials…
Original Post: An Introduction to Spatial Data Analysis and Visualization in R

Madrid UPM Advanced Statistics and Data Mining Summer School, June 26 – July 7

[unable to retrieve full-text content]The courses cover topics such as Neural Networks and Deep Learning, Bayesian Networks, Big Data with Apache Spark, Bayesian Inference, Text Mining and Time Series, and each has theoretical as well as practical classes, done with R or Python. Early bird till June 5.
Original Post: Madrid UPM Advanced Statistics and Data Mining Summer School, June 26 – July 7

Because it's Friday: Bayesian Trap

If you get a blood test to diagnose a rare disease, and the test (which is very accurate) comes back positive, what’s the chance you have the disease? Well if “rare” means only 1 in a thousand people have the disease, and “very accurate” means the test returns the correct result 99% of the time, the answer is … just 9%. There’s less than a 1 in 10 chance you actually have the disease (which is why doctor will likely have you tested a second time). Now that result might seem surprising, but it makes sense if you apply Bayes Theorem. (A simple way to think of it is that in a population of 1000 people, 10 people will have a positive test result, plus the one who actually has the disease. One in eleven of the positive results, or 9%,…
Original Post: Because it's Friday: Bayesian Trap

Statistician ranked as best job in 2017 by CareerCast

According to job hunting site CareerCast, the best job to have in 2017 is: Statistician. This is according to their 2017 Jobs Rated report, based on an evaluation of Bureau of Labor Statistics metrics including environment, income, employment and income growth, and stress factors. In their rankings, Statistician is the role that took the top spot with a “work environment” score of 4 out of 199 (lower is better) , a stress factors score of 39, and a projected growth score of 3. The median salary is reported at USD$80,110 and projected to grow at 34% (per annum, I assume). Also in the top ten: Data Scientist, in fifth place. This role scored similarly in Work Environment (12 out of 199) and stress factors (37), but had slightly lower prospects for growth (37).  Here, the median salary is reported at $111,267 and…
Original Post: Statistician ranked as best job in 2017 by CareerCast

A Non-comprehensive List of Awesome Things Other People Did in 2016

Editor’s note: For the last few years I have made a list of awesome things that other people did (2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Thomas Lin Pedersen created the tweenr package for interpolating graphs in animations. Check out this awesome logo he made with it.…
Original Post: A Non-comprehensive List of Awesome Things Other People Did in 2016

3 methods to deal with outliers

By Alberto Quesada, Artelnics. An outlier is a data point that is distant from other similar points. They may be due to variability in the measurement or may indicate experimental errors. If possible, outliers should be excluded from the data set. However, detecting that anomalous instances might be very difficult, and is not always possible. Introduction Machine learning algorithms are very sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results. Along this article, we are going to talk about 3 different methods of dealing with outliers: Univariate method: This method looks for data points with extreme values on one variable. Multivariate method: Here we look for unusual combinations on all the variables. Minkowski error: This method reduces the contribution of potential outliers…
Original Post: 3 methods to deal with outliers