A large part of the data we data scientists are asked to analyze was not collected with the specific analysis in mind, or perhaps any particular analysis. In this space, many assumptions of classical statistics no longer hold. The data scientist working today lives in what Brad Efron has termed the “era of scientific mass production,” of which he remarks, “But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. [1]”Statistics, as a discipline, was largely developed in a small data world. Data was expensive to gather, and therefore decisions to collect data were generally well-considered. Implicitly, there was a prior belief about some interesting causal mechanism or an underlying hypothesis motivating…

Original Post: Unintentional data

# Unofficial Google Data Science

## Fitting Bayesian structural time series with the bsts R package

by STEVEN L. SCOTTTime series data are everywhere, but time series modeling is a fairly specialized area within statistics and data science. This post describes the bsts software package, which makes it easy to fit some fairly sophisticated time series models with just a few lines of R code.Introduction Time series data appear in a surprising number of applications, ranging from business, to the physical and social sciences, to health, medicine, and engineering. Forecasting (e.g. next month’s sales) is common in problems involving time series data, but explanatory models (e.g. finding drivers of sales) are also important. Time series data are having something of a moment in the tech blogs right now, with Facebook announcing their “Prophet” system for time series forecasting (Taylor and Letham 2017), and Google posting about its forecasting system in this blog (Tassone and Rohani 2017).This…

Original Post: Fitting Bayesian structural time series with the bsts R package

## Our quest for robust time series forecasting at scale

by ERIC TASSONE, FARZAN ROHANIWe were part of a team of data scientists in Search Infrastructure at Google that took on the task of developing robust and automatic large-scale time series forecasting for our organization. In this post, we recount how we approached the task, describing initial stakeholder needs, the business and engineering contexts in which the challenge arose, and theoretical and pragmatic choices we made to implement our solution.Introduction Time series forecasting enjoys a rich and luminous history, and today is an essential element of most any business operation. So it should come as no surprise that Google has compiled and forecast time series for a long time. For instance, the image below from the Google Visitors Center in Mountain View, California, shows hand-drawn time series of “Results Pages” (essentially search query volume) dating back nearly to the founding…

Original Post: Our quest for robust time series forecasting at scale

## Attributing a deep network’s prediction to its input features

Editor’s note: Causal inference is central to answering questions in science, engineering and business and hence the topic has received particular attention on this blog. Typically, causal inference in data science is framed in probabilistic terms, where there is statistical uncertainty in the outcomes as well as model uncertainty about the true causal mechanism connecting inputs and outputs. And yet even when the relationship between inputs and outputs is fully known and entirely deterministic, causal inference is far from obvious for a complex system. In this post, we explore causal inference in this setting via the problem of attribution in deep networks. This investigation has practical as well as philosophical implications for causal inference. On the other hand, if you just care about understanding what a deep network is doing, this post is for you too. Deep networks have had…

Original Post: Attributing a deep network’s prediction to its input features

## Causality in machine learning

By OMKAR MURALIDHARAN, NIALL CARDIN, TODD PHILLIPS, AMIR NAJMIGiven recent advances and interest in machine learning, those of us with traditional statistical training have had occasion to ponder the similarities and differences between the fields. Many of the distinctions are due to culture and tooling, but there are also differences in thinking which run deeper. Take, for instance, how each field views the provenance of the training data when building predictive models. For most of ML, the training data is a given, often presumed to be representative of the data against which the prediction model will be deployed, but not much else. With a few notable exceptions, ML abstracts away from the data generating mechanism, and hence sees the data as raw material from which predictions are to be extracted. Indeed, machine learning generally lacks the vocabulary to capture the…

Original Post: Causality in machine learning

## Practical advice for analysis of large, complex data sets

By PATRICK RILEY For a number of years, I led the data science team for Google Search logs. We were often asked to make sense of confusing results, measure new phenomena from logged behavior, validate analyses done by others, and interpret metrics of user behavior. Some people seemed to be naturally good at doing this kind of high quality data…

Original Post: Practical advice for analysis of large, complex data sets

## Statistics for Google Sheets

BY STEVEN L. SCOTTBig data is new and exciting, but there are still lots of small data problems in the world. Many people who are just becoming aware that they need to work with data are finding that they lack the tools to do so. The statistics app for Google Sheets hopes to change that. Editor’s note: We’ve mostly portrayed…

Original Post: Statistics for Google Sheets

## Next generation tools for data science

By DAVID ADAMSSince inception, this blog has defined “data science” as inference derived from data too big to fit on a single computer. Thus the ability to manipulate big data is essential to our notion of data science. While MapReduce remains a fundamental tool, many interesting analyses require more than it can offer. For instance, the well-known Mantel-Haenszel estimator cannot…

Original Post: Next generation tools for data science

## Mind Your Units

By JEAN STEINERRandomized A/B experiments are the gold standard for estimating causal effects. The analysis can be straightforward, especially when it’s safe to assume that individual observations of an outcome measure are independent. However, this is not always the case. When observations are not independent, an analysis that assumes independence can lead us to believe that effects are significant when…

Original Post: Mind Your Units

## To Balance or Not to Balance?

By IVAN DIAZ & JOSEPH KELLYDetermining the causal effects of an action—which we call treatment—on an outcome of interest is at the heart of many data analysis efforts. In an ideal world, experimentation through randomization of the treatment assignment allows the identification and consistent estimation of causal effects. In observational studies treatment is assigned by nature, therefore its mechanism is…

Original Post: To Balance or Not to Balance?