Our quest for robust time series forecasting at scale

by ERIC TASSONE, FARZAN ROHANIWe were part of a team of data scientists in Search Infrastructure at Google that took on the task of developing robust and automatic large-scale time series forecasting for our organization. In this post, we recount how we approached the task, describing initial stakeholder needs, the business and engineering contexts in which the challenge arose, and theoretical and pragmatic choices we made to implement our solution.Introduction Time series forecasting enjoys a rich and luminous history, and today is an essential element of most any business operation. So it should come as no surprise that Google has compiled and forecast time series for a long time. For instance, the image below from the Google Visitors Center in Mountain View, California, shows hand-drawn time series of “Results Pages” (essentially search query volume) dating back nearly to the founding…
Original Post: Our quest for robust time series forecasting at scale

Attributing a deep network’s prediction to its input features

Editor’s note: Causal inference is central to answering questions in science, engineering and business and hence the topic has received particular attention on this blog. Typically, causal inference in data science is framed in probabilistic terms, where there is statistical uncertainty in the outcomes as well as model uncertainty about the true causal mechanism connecting inputs and outputs. And yet even when the relationship between inputs and outputs is fully known and entirely deterministic, causal inference is far from obvious for a complex system. In this post, we explore causal inference in this setting via the problem of attribution in deep networks. This investigation has practical as well as philosophical implications for causal inference. On the other hand, if you just care about understanding what a deep network is doing, this post is for you too. Deep networks have had…
Original Post: Attributing a deep network’s prediction to its input features

Causality in machine learning

By OMKAR MURALIDHARAN, NIALL CARDIN, TODD PHILLIPS, AMIR NAJMIGiven recent advances and interest in machine learning, those of us with traditional statistical training have had occasion to ponder the similarities and differences between the fields. Many of the distinctions are due to culture and tooling, but there are also differences in thinking which run deeper. Take, for instance, how each field views the provenance of the training data when building predictive models. For most of ML, the training data is a given, often presumed to be representative of the data against which the prediction model will be deployed, but not much else. With a few notable exceptions, ML abstracts away from the data generating mechanism, and hence sees the data as raw material from which predictions are to be extracted. Indeed, machine learning generally lacks the vocabulary to capture the…
Original Post: Causality in machine learning

Data Science for Doctors – Part 1 : Data Display

Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day, so it is obvious that those people must have a substantial knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills. We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here. This is the first part of the series, it is going to be about data display. Before proceeding, it might be helpful to look over the help pages for the table, pie, geom_bar , coord_polar, barplot, stripchart, geom_jitter, density, geom_density, hist, geom_histogram, boxplot, geom_boxplot, qqnorm, qqline, geom_point, plot, qqline, geom_point . install.packages(‘ggplot2’)library(ggplot) I have also changed the values of…
Original Post: Data Science for Doctors – Part 1 : Data Display

Upcoming Win-Vector LLC public speaking engagements

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements. BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to use our replyr package to greatly improve scripting or programming over dplyr. Some articles on replyr can be found here. Strata & Hadoop World West, Tuesday March 14, 2017 1:30pm–5:00pm, San Jose Convention Center, CA, Location: LL21 C/D. Win-Vector LLC’s John Mount will teach how to use R to control big data analytics and modeling. In depth training to prepare you to use R, Spark, sparklyr, h2o, and rsparkling. In partnership with RStudio. Hope to see you there! Related To leave a comment…
Original Post: Upcoming Win-Vector LLC public speaking engagements

This is just a test page for forecastersblog.org

This is just a test page for forecastersblog.org. Ignore it, please. es() allows selecting between AIC (Akaike Information Criterion), AICc (Akaike Information Criterion corrected) and BIC (Bayesian Information Criterion, also known as Schwarz IC). The very basic information criterion is AIC. It is calculated for a chosen model using formula:begin{equation} label{eq:AIC}text{AIC} = -2 ell left(theta, hat{sigma}^2 | Y right) + 2k,end{equation}where (k) is number of parameters of the model. Not going too much into details, the model with the smallest AIC is considered to be the closest to the true model. Obviously IC cannot be calculated without model fitting, which implies that a human being needs to form a pool of models, then fit each of them to the data, calculate an information criterion for each of them and after that select the one model that has the lowest…
Original Post: This is just a test page for forecastersblog.org

On occasion of the 10,000th R package: The eoda Top 10

R just passed another milestone: 10,000 packages on CRAN. From A3 to zyp, from ABC analysis to zero-inflated models – 10,000 R packages mean great variety and methods for almost every use case. On occasion of this event we collected the top 10 R packages in collaboration with the ones who should know best: our data scientists. The eoda Top 10 R-PackagesOur Top 10 R packages Hmisc: This was one of the first R packages to be used at eoda on a regular basis. Today we barely use Hmisc anymore but nonetheless it had to be part of our top 10 simply for nostalgic reasons. data.table: R as an in-memory database with data.table? Who said R was slow? TraceR: Excellent profiling package. Will find every bottleneck. dplyR: Not only fast when it comes to evaluation but also easy to…
Original Post: On occasion of the 10,000th R package: The eoda Top 10

Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.Abstractions matter An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel) outlines several I/O strategies. I’ve tried many of them. One strategy is event-driven programming, where a core event loop monitors file descriptors for events, and then dispatches handlers. This style yields high performance servers, but is difficult to program and sensitive to programmer error. In addition to fault isolation issues (if all event are running in the same address space), this style is sensitive to whenever any event handler takes too long to execute…
Original Post: Reinforcement Learning and Language Support

The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

“Ten Simple Rules for Reproducible Computational Research” is afreely available paper on PLOS computational biology. As I’m currently very interested on the subject of reproducible data analysis, I will these ten rules and the possible implementation in R with my point of view of epidemiologist interested in healthcare data reuse. I will also check if my workflow comply with these rules. For those who are in a hurry, I summarised these rules and possible implementation in R in a table at the end of this post. The author of this paper, Geir Kjetil Sandve, is assistant-professor in a Norwegian biomedical informatics lab. It is possible he wrote these rules with R in mind because he cites R in rule 7. However, biomedical informatics is quite far from my practice and he describes other integrated framework allowing reproducibility. Two successive…
Original Post: The “Ten Simple Rules for Reproducible Computational Research” are easy to reach for R users

CRAN now has 10,000 R packages. Here’s how to find the ones you need.

CRAN, the global repository of open-source packages that extend the capabiltiies of R, reached a milestone today. There are now more than 10,000 R packages available for download*. (Incidentally, that count doesn’t even include all the R packages out there. There are also another 1294 packages for genomic analysis in the BioConductor repository, hundreds of R packages published only on GitHub, commercial R packages from vendors such as Microsoft and Oracle, and an unknowable number of private, unpublished packages.) Why so many packages? R has a very active developer community, who contribute new packages to CRAN on a daily basis. As a result, R is unparalleled in its capabilities for statistical computing, data science, and data visualization: almost anything you might care to do with data has likely already been implemented for you and released as an R package. There are several…
Original Post: CRAN now has 10,000 R packages. Here’s how to find the ones you need.