by ERIC TASSONE, FARZAN ROHANIWe were part of a team of data scientists in Search Infrastructure at Google that took on the task of developing robust and automatic large-scale time series forecasting for our organization. In this post, we recount how we approached the task, describing initial stakeholder needs, the business and engineering contexts in which the challenge arose, and theoretical and pragmatic choices we made to implement our solution.Introduction Time series forecasting enjoys a rich and luminous history, and today is an essential element of most any business operation. So it should come as no surprise that Google has compiled and forecast time series for a long time. For instance, the image below from the Google Visitors Center in Mountain View, California, shows hand-drawn time series of “Results Pages” (essentially search query volume) dating back nearly to the founding…

Original Post: Our quest for robust time series forecasting at scale

# Unofficial Google Data Science

## Attributing a deep network’s prediction to its input features

Editor’s note: Causal inference is central to answering questions in science, engineering and business and hence the topic has received particular attention on this blog. Typically, causal inference in data science is framed in probabilistic terms, where there is statistical uncertainty in the outcomes as well as model uncertainty about the true causal mechanism connecting inputs and outputs. And yet even when the relationship between inputs and outputs is fully known and entirely deterministic, causal inference is far from obvious for a complex system. In this post, we explore causal inference in this setting via the problem of attribution in deep networks. This investigation has practical as well as philosophical implications for causal inference. On the other hand, if you just care about understanding what a deep network is doing, this post is for you too. Deep networks have had…

Original Post: Attributing a deep network’s prediction to its input features

## Causality in machine learning

By OMKAR MURALIDHARAN, NIALL CARDIN, TODD PHILLIPS, AMIR NAJMIGiven recent advances and interest in machine learning, those of us with traditional statistical training have had occasion to ponder the similarities and differences between the fields. Many of the distinctions are due to culture and tooling, but there are also differences in thinking which run deeper. Take, for instance, how each field views the provenance of the training data when building predictive models. For most of ML, the training data is a given, often presumed to be representative of the data against which the prediction model will be deployed, but not much else. With a few notable exceptions, ML abstracts away from the data generating mechanism, and hence sees the data as raw material from which predictions are to be extracted. Indeed, machine learning generally lacks the vocabulary to capture the…

Original Post: Causality in machine learning

## Practical advice for analysis of large, complex data sets

By PATRICK RILEY For a number of years, I led the data science team for Google Search logs. We were often asked to make sense of confusing results, measure new phenomena from logged behavior, validate analyses done by others, and interpret metrics of user behavior. Some people seemed to be naturally good at doing this kind of high quality data…

Original Post: Practical advice for analysis of large, complex data sets

## Statistics for Google Sheets

BY STEVEN L. SCOTTBig data is new and exciting, but there are still lots of small data problems in the world. Many people who are just becoming aware that they need to work with data are finding that they lack the tools to do so. The statistics app for Google Sheets hopes to change that. Editor’s note: We’ve mostly portrayed…

Original Post: Statistics for Google Sheets

## Next generation tools for data science

By DAVID ADAMSSince inception, this blog has defined “data science” as inference derived from data too big to fit on a single computer. Thus the ability to manipulate big data is essential to our notion of data science. While MapReduce remains a fundamental tool, many interesting analyses require more than it can offer. For instance, the well-known Mantel-Haenszel estimator cannot…

Original Post: Next generation tools for data science

## Mind Your Units

By JEAN STEINERRandomized A/B experiments are the gold standard for estimating causal effects. The analysis can be straightforward, especially when it’s safe to assume that individual observations of an outcome measure are independent. However, this is not always the case. When observations are not independent, an analysis that assumes independence can lead us to believe that effects are significant when…

Original Post: Mind Your Units

## To Balance or Not to Balance?

By IVAN DIAZ & JOSEPH KELLYDetermining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is…

Original Post: To Balance or Not to Balance?

## Estimating causal effects using geo experiments

by JOUNI KERMAN, JON VAVER, and JIM KOEHLER Randomized experiments represent the gold standard for determining the causal effects of app or website design decisions on user behavior. We might be interested in comparing, for example, different subscription offers, different versions of terms and conditions, or different user interfaces. When it comes to online ads, there is also a fundamental…

Original Post: Estimating causal effects using geo experiments

## Using Random Effects Models in Prediction Problems

by NICHOLAS A. JOHNSON ALAN ZHAO KAI YANG SHENG WU FRANK O. KUEHNEL ALI NASIRI AMINI In this post, we give a brief introduction to random effects models, and discuss some of their uses. Through simulation we illustrate issues with model-fitting techniques that depend on matrix factorization. Far from hypothetical, we have…

Original Post: Using Random Effects Models in Prediction Problems