Statistical Significance and the Dichotomization of Evidence (McShane and Gal’s paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden) Posted by Andrew on 1 November 2017, 4:05 pm Blake McShane sent along this paper by himself and David Gal, which begins: In light of recent concerns about reproducibility and replicability, the ASA issued a Statement on Statistical Significance and p-values aimed at those who are not primarily statisticians. While the ASA Statement notes that statistical significance and p-values are “commonly misused and misinterpreted,” it does not discuss and document broader implications of these errors for the interpretation of evidence. In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data…

Original Post: Statistical Significance and the Dichotomization of Evidence (McShane and Gal’s paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden)

# Decision Theory

## “Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science

I’ve been thinking for awhile that the default ways in which statisticians think about science—and which scientists think about statistics—are seriously flawed, sometimes even crippling scientific inquiry in some subfields, in the way that bad philosophy can do. Here’s what I think are some of the default modes of thought: – Hypothesis testing, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place; – Inference, which can work in the context of some well-defined problems (for example, studying trends in public opinion or estimating parameters within an agreed-upon model in pharmacology), but which doesn’t capture the idea of learning from the unexpected; – Discovery, which sounds great but which runs aground when thinking about science as a routine process: can…

Original Post: “Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science

## An improved ending for The Martian

In this post from a couple years ago I discussed the unsatisfying end of The Martian. At the time, I wrote: The ending is not terrible—at a technical level it’s somewhat satisfying (I’m not enough of a physicist to say more than that), but at the level of construction of a story arc, it didn’t really work for me. Here’s what I think of the ending. The Martian is structured as a series of challenges: one at a time, there is a difficult or seemingly insurmountable problem that the character or characters solve, or try to solve, in some way. A lot of the fun comes when the solution of problem A leads to problem B later on. It’s an excellent metaphor for life (although not stated that way in the book; one of the strengths of The Martian is…

Original Post: An improved ending for The Martian

## Died in the Wool

Garrett M. writes: I’m an analyst at an investment management firm. I read your blog daily to improve my understanding of statistics, as it’s central to the work I do. I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on: First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience. However, historical market data are very noisy, with stock portfolios generating an average monthly return of around…

Original Post: Died in the Wool

## Some natural solutions to the p-value communication problem—and why they won’t work.

John Carlin and I write: It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true. A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect. Unfortunately, none of those inferences is generally appropriate: a…

Original Post: Some natural solutions to the p-value communication problem—and why they won’t work.

## How to interpret “p = .06” in situations where you really really want the treatment to work?

We’ve spent a lot of time during the past few years discussing the difficulty of interpreting “p less than .05” results from noisy studies. Standard practice is to just take the point estimate and confidence interval, but this is in general wrong in that it overestimates effect size (type M error) and can get the direction wrong (type S error). So what about noisy studies where the p-value is more than .05, that is, where the confidence interval includes zero? Standard practice here is to just declare this as a null effect, but of course that’s not right either, as the estimate of 0 is surely a negatively biased estimate of the magnitude of the effect. When the confidence interval includes 0, we can typically say that the data are consistent with no effect. But that doesn’t mean the true…

Original Post: How to interpret “p = .06” in situations where you really really want the treatment to work?

## “P-hacking” and the intention-to-cheat effect

“P-hacking” and the intention-to-cheat effect Posted by Andrew on 10 May 2017, 5:53 pm I’m a big fan of the work of Uri Simonsohn and his collaborators, but I don’t like the term “p-hacking” because it can be taken to imply an intention to cheat. The image of p-hacking is of a researcher trying test after test on the data until reaching the magic “p less than .05.” But, as Eric Loken and I discuss in our paper on the garden of forking paths, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. I worry that the widespread use term “p-hacking” gives two wrong impressions: First, it implies that the many researchers who use p-values incorrectly are cheating or “hacking,” even though I suspect they’re mostly…

Original Post: “P-hacking” and the intention-to-cheat effect

## A completely reasonable-sounding statement with which I strongly disagree

From a couple years ago: In the context of a listserv discussion about replication in psychology experiments, someone wrote: The current best estimate of the effect size is somewhere in between the original study and the replication’s reported value. This conciliatory, split-the-difference statement sounds reasonable, and it might well represent good politics in the context of a war over replications—but from a statistical perspective I strongly disagree with it, for the following reason. The original study’s estimate typically has a huge bias (due to the statistical significance filter). The estimate from the replicated study, assuming it’s a preregistered replication, is unbiased. I think in such a setting the safest course is to use the replication’s reported value as our current best estimate. That doesn’t mean that the original study is “wrong,” but it is wrong to report a biased estimate…

Original Post: A completely reasonable-sounding statement with which I strongly disagree

## 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid? Posted by Andrew on 5 May 2017, 9:51 am [cat picture] Laura Kapitula writes: I wanted to share a cute story that gave me a bit of hope. My daughter who is in 7th grade was doing her science project. She had designed an experiment comparing lemon batteries to potato batteries, a 2×4 design with lemons or potatoes as one factor and number of fruits/vegetables as the other factor (1, 2, 3 or 4). She had to “preregister” her experiment with her teacher and had basically designed her experiment herself and done her analysis plan without any help from her statistician mother. Typical scientist not consulting the statistician until she was already collecting data. She was running the experiment and after she had done all her batteries and…

Original Post: 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

## What hypothesis testing is all about. (Hint: It’s not what you think.)

The conventional view: Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for…

Original Post: What hypothesis testing is all about. (Hint: It’s not what you think.)