By OMKAR MURALIDHARAN, NIALL CARDIN, TODD PHILLIPS, AMIR NAJMIGiven recent advances and interest in machine learning, those of us with traditional statistical training have had occasion to ponder the similarities and differences between the fields. Many of the distinctions are due to culture and tooling, but there are also differences in thinking which run deeper. Take, for instance, how each field views the provenance of the training data when building predictive models. For most of ML, the training data is a given, often presumed to be representative of the data against which the prediction model will be deployed, but not much else. With a few notable exceptions, ML abstracts away from the data generating mechanism, and hence sees the data as raw material from which predictions are to be extracted. Indeed, machine learning generally lacks the vocabulary to capture the…

Original Post: Causality in machine learning

# Posts for January 2017

## Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day, so it is obvious that those people must have a substantial knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills. We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here. This is the first part of the series, it is going to be about data display. Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point . install.packages(‘ggplot2’)library(ggplot) I have also changed the values of…

Original Post: Data Science for Doctors – Part 1 : Data Display

## Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements. BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our replyr package to greatly improve scripting or programming over dplyr. Some articles on replyr can be found here. Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling. In partnership with RStudio. Hope to see you there! Related
To leave a comment…

Original Post: Upcoming Win-Vector LLC public speaking engagements

## This is just a test page for forecastersblog.org

This is just a test page for forecastersblog.org. Ignore it, please. es() allows selecting between AIC (Akaike Information Criterion), AICc (Akaike Information Criterion corrected) and BIC (Bayesian Information Criterion, also known as Schwarz IC). The very basic information criterion is AIC. It is calculated for a chosen model using formula:begin{equation} label{eq:AIC}text{AIC} = -2 ell left(theta, hat{sigma}^2 | Y right) + 2k,end{equation}where (k) is number of parameters of the model. Not going too much into details, the model with the smallest AIC is considered to be the closest to the true model. Obviously IC cannot be calculated without model fitting, which implies that a human being needs to form a pool of models, then fit each of them to the data, calculate an information criterion for each of them and after that select the one model that has the lowest…

Original Post: This is just a test page for forecastersblog.org

## On occasion of the 10,000th R package: The eoda Top 10

R just passed another milestone: 10,000 packages on CRAN. From A3 to zyp, from ABC analysis to zero-inflated models – 10,000 R packages mean great variety and methods for almost every use case. On occasion of this event we collected the top 10 R packages in collaboration with the ones who should know best: our data scientists. The eoda Top 10 R-PackagesOur Top 10 R packages Hmisc: This was one of the first R packages to be used at eoda on a regular basis. Today we barely use Hmisc anymore but nonetheless it had to be part of our top 10 simply for nostalgic reasons. data.table: R as an in-memory database with data.table? Who said R was slow? TraceR: Excellent profiling package. Will find every bottleneck. dplyR: Not only fast when it comes to evaluation but also easy to…

Original Post: On occasion of the 10,000th R package: The eoda Top 10

## Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.Abstractions matter An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I’ve tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute…

Original Post: Reinforcement Learning and Language Support

## The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

“Ten Simple Rules for Reproducible Computational Research” is afreely available paper on PLOS computational biology. As I’m currently very interested on the subject of reproducible data analysis, I will these ten rules and the possible implementation in R with my point of view of epidemiologist interested in healthcare data reuse. I will also check if my workflow comply with these rules. For those who are in a hurry, I summarised these rules and possible implementation in R in a table at the end of this post. The author of this paper, Geir Kjetil Sandve, is assistant-professor in a Norwegian biomedical informatics lab. It is possible he wrote these rules with R in mind because he cites R in rule 7. However, biomedical informatics is quite far from my practice and he describes other integrated framework allowing reproducibility. Two successive…

Original Post: The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

## CRAN now has 10,000 R packages. Here’s how to find the ones you need.

CRAN, the global repository of open-source packages that extend the capabiltiies of R, reached a milestone today. There are now more than 10,000 R packages available for download*. (Incidentally, that count doesn’t even include all the R packages out there. There are also another 1294 packages for genomic analysis in the BioConductor repository, hundreds of R packages published only on GitHub, commercial R packages from vendors such as Microsoft and Oracle, and an unknowable number of private, unpublished packages.) Why so many packages? R has a very active developer community, who contribute new packages to CRAN on a daily basis. As a result, R is unparalleled in its capabilities for statistical computing, data science, and data visualization: almost anything you might care to do with data has likely already been implemented for you and released as an R package. There are several…

Original Post: CRAN now has 10,000 R packages. Here’s how to find the ones you need.

## January ’17 Tips and Tricks

by Sean Lopp This month’s collection of Tips and Tricks comes from an excellent talk given at the 2017 RStudio::Conf in Orlando by RStudio Software Engineer Kevin Ushey. The slides from his talk are embedded below and cover features from autocompletion to R Markdown shortcuts. Use the left and right arrow keys to change slides. Enjoy! Related
To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more… If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook…

Original Post: January ’17 Tips and Tricks

## Simulating from a specified seasonal ARIMA model

From my email today You use an illustration of a seasonal arima model: ARIMA(1,1,1)(1,1,1)4 I would like to simulate data from this process then fit a model… but I am unable to find any information as to how this can be conducted… if I set phi1, Phi1, theta1, and Theta1 it would be reassuring that for large n the parameters returned by Arima(foo,order=c(1,1,1),seasonal=c(1,1,1)) are in agreement… My answer: Unfortunately arima.sim() won’t handle seasonal ARIMA models. I wrote simulate.Arima() to handle them, but it is designed to simulate from a fitted model rather than a specified model. However, you can use the following code to do it. It first “estimates” an ARIMA model with specified coefficients. Then simulates from it. library(forecast) model <- Arima(ts(rnorm(100),freq=4), order=c(1,1,1), seasonal=c(1,1,1), fixed=c(phi=0.5, theta=-0.4, Phi=0.3, Theta=-0.2)) foo <- simulate(model, nsim=1000) fit <- Arima(foo, order=c(1,1,1), seasonal=c(1,1,1)) Related…

Original Post: Simulating from a specified seasonal ARIMA model