In this post from a couple years ago I discussed the unsatisfying end of The Martian. At the time, I wrote: The ending is not terrible—at a technical level it’s somewhat satisfying (I’m not enough of a physicist to say more than that), but at the level of construction of a story arc, it didn’t really work for me. Here’s what I think of the ending. The Martian is structured as a series of challenges: one at a time, there is a difficult or seemingly insurmountable problem that the character or characters solve, or try to solve, in some way. A lot of the fun comes when the solution of problem A leads to problem B later on. It’s an excellent metaphor for life (although not stated that way in the book; one of the strengths of The Martian is…

Original Post: An improved ending for The Martian

# Decision Theory

## Died in the Wool

Garrett M. writes: I’m an analyst at an investment management firm. I read your blog daily to improve my understanding of statistics, as it’s central to the work I do. I had two (hopefully straightforward) questions related to time series analysis that I was hoping I could get your thoughts on: First, much of the work I do involves “backtesting” investment strategies, where I simulate the performance of an investment portfolio using historical data on returns. The primary summary statistics I generate from this sort of analysis are mean return (both arithmetic and geometric) and standard deviation (called “volatility” in my industry). Basically the idea is to select strategies that are likely to generate high returns given the amount of volatility they experience. However, historical market data are very noisy, with stock portfolios generating an average monthly return of around…

Original Post: Died in the Wool

## Some natural solutions to the p-value communication problem—and why they won’t work.

John Carlin and I write: It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true. A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect. Unfortunately, none of those inferences is generally appropriate: a…

Original Post: Some natural solutions to the p-value communication problem—and why they won’t work.

## How to interpret “p = .06” in situations where you really really want the treatment to work?

We’ve spent a lot of time during the past few years discussing the difficulty of interpreting “p less than .05” results from noisy studies. Standard practice is to just take the point estimate and confidence interval, but this is in general wrong in that it overestimates effect size (type M error) and can get the direction wrong (type S error). So what about noisy studies where the p-value is more than .05, that is, where the confidence interval includes zero? Standard practice here is to just declare this as a null effect, but of course that’s not right either, as the estimate of 0 is surely a negatively biased estimate of the magnitude of the effect. When the confidence interval includes 0, we can typically say that the data are consistent with no effect. But that doesn’t mean the true…

Original Post: How to interpret “p = .06” in situations where you really really want the treatment to work?

## “P-hacking” and the intention-to-cheat effect

“P-hacking” and the intention-to-cheat effect Posted by Andrew on 10 May 2017, 5:53 pm I’m a big fan of the work of Uri Simonsohn and his collaborators, but I don’t like the term “p-hacking” because it can be taken to imply an intention to cheat. The image of p-hacking is of a researcher trying test after test on the data until reaching the magic “p less than .05.” But, as Eric Loken and I discuss in our paper on the garden of forking paths, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. I worry that the widespread use term “p-hacking” gives two wrong impressions: First, it implies that the many researchers who use p-values incorrectly are cheating or “hacking,” even though I suspect they’re mostly…

Original Post: “P-hacking” and the intention-to-cheat effect

## A completely reasonable-sounding statement with which I strongly disagree

From a couple years ago: In the context of a listserv discussion about replication in psychology experiments, someone wrote: The current best estimate of the effect size is somewhere in between the original study and the replication’s reported value. This conciliatory, split-the-difference statement sounds reasonable, and it might well represent good politics in the context of a war over replications—but from a statistical perspective I strongly disagree with it, for the following reason. The original study’s estimate typically has a huge bias (due to the statistical significance filter). The estimate from the replicated study, assuming it’s a preregistered replication, is unbiased. I think in such a setting the safest course is to use the replication’s reported value as our current best estimate. That doesn’t mean that the original study is “wrong,” but it is wrong to report a biased estimate…

Original Post: A completely reasonable-sounding statement with which I strongly disagree

## 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid? Posted by Andrew on 5 May 2017, 9:51 am [cat picture] Laura Kapitula writes: I wanted to share a cute story that gave me a bit of hope. My daughter who is in 7th grade was doing her science project. She had designed an experiment comparing lemon batteries to potato batteries, a 2×4 design with lemons or potatoes as one factor and number of fruits/vegetables as the other factor (1, 2, 3 or 4). She had to “preregister” her experiment with her teacher and had basically designed her experiment herself and done her analysis plan without any help from her statistician mother. Typical scientist not consulting the statistician until she was already collecting data. She was running the experiment and after she had done all her batteries and…

Original Post: 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

## What hypothesis testing is all about. (Hint: It’s not what you think.)

The conventional view: Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for…

Original Post: What hypothesis testing is all about. (Hint: It’s not what you think.)

## Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?

Do anyone out there know anyone at Time Inc? If so, I have a question for you. But first the story: Mark Palko linked to an item from Barry Petchesky pointing out this article at the online site of Sports Illustrated Magazine. Here’s Petchesky: Over at Sports Illustrated, you can read an article about Tom Brady’s new line of sleepwear for A Company That Makes Stretchy Workout Stuff. The article contains the following lines: “The TB12 Sleepwear line includes full-length shirts and pants—and a short-sleeve and shorts version—with bioceramics printed on the inside.” “The print, sourced from natural minerals, activates the body’s natural heat and reflects it back as far infrared energy…” “The line, available in both men’s [link to store for purchase] and women’s [link to store for purchase] sizes, costs between $80 to $100 [link to store for…

Original Post: Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?

## We fiddle while Rome burns: p-value edition

Raghu Parthasarathy presents a wonderfully clear example of disastrous p-value-based reasoning that he saw in a conference presentation. Here’s Raghu: Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph: . . . from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does.…

Original Post: We fiddle while Rome burns: p-value edition