John Carlin and I write: It is well known that even experienced scientists routinely misinterpret p-values in all sorts of ways, including confusion of statistical and practical significance, treating non-rejection as acceptance of the null hypothesis, and interpreting the p-value as some sort of replication probability or as the posterior probability that the null hypothesis is true. A common conceptual error is that researchers take the rejection of a straw-man null as evidence in favor of their preferred alternative. A standard mode of operation goes like this: p < 0.05 is taken as strong evidence against the null hypothesis, p > 0.15 is taken as evidence in favor of the null, and p near 0.10 is taken either as weak evidence for an effect or as evidence of a weak effect. Unfortunately, none of those inferences is generally appropriate: a…

Original Post: Some natural solutions to the p-value communication problem—and why they won’t work.

# Decision Theory

## How to interpret “p = .06” in situations where you really really want the treatment to work?

We’ve spent a lot of time during the past few years discussing the difficulty of interpreting “p less than .05” results from noisy studies. Standard practice is to just take the point estimate and confidence interval, but this is in general wrong in that it overestimates effect size (type M error) and can get the direction wrong (type S error). So what about noisy studies where the p-value is more than .05, that is, where the confidence interval includes zero? Standard practice here is to just declare this as a null effect, but of course that’s not right either, as the estimate of 0 is surely a negatively biased estimate of the magnitude of the effect. When the confidence interval includes 0, we can typically say that the data are consistent with no effect. But that doesn’t mean the true…

Original Post: How to interpret “p = .06” in situations where you really really want the treatment to work?

## “P-hacking” and the intention-to-cheat effect

“P-hacking” and the intention-to-cheat effect Posted by Andrew on 10 May 2017, 5:53 pm I’m a big fan of the work of Uri Simonsohn and his collaborators, but I don’t like the term “p-hacking” because it can be taken to imply an intention to cheat. The image of p-hacking is of a researcher trying test after test on the data until reaching the magic “p less than .05.” But, as Eric Loken and I discuss in our paper on the garden of forking paths, multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. I worry that the widespread use term “p-hacking” gives two wrong impressions: First, it implies that the many researchers who use p-values incorrectly are cheating or “hacking,” even though I suspect they’re mostly…

Original Post: “P-hacking” and the intention-to-cheat effect

## A completely reasonable-sounding statement with which I strongly disagree

From a couple years ago: In the context of a listserv discussion about replication in psychology experiments, someone wrote: The current best estimate of the effect size is somewhere in between the original study and the replication’s reported value. This conciliatory, split-the-difference statement sounds reasonable, and it might well represent good politics in the context of a war over replications—but from a statistical perspective I strongly disagree with it, for the following reason. The original study’s estimate typically has a huge bias (due to the statistical significance filter). The estimate from the replicated study, assuming it’s a preregistered replication, is unbiased. I think in such a setting the safest course is to use the replication’s reported value as our current best estimate. That doesn’t mean that the original study is “wrong,” but it is wrong to report a biased estimate…

Original Post: A completely reasonable-sounding statement with which I strongly disagree

## 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid? Posted by Andrew on 5 May 2017, 9:51 am [cat picture] Laura Kapitula writes: I wanted to share a cute story that gave me a bit of hope. My daughter who is in 7th grade was doing her science project. She had designed an experiment comparing lemon batteries to potato batteries, a 2×4 design with lemons or potatoes as one factor and number of fruits/vegetables as the other factor (1, 2, 3 or 4). She had to “preregister” her experiment with her teacher and had basically designed her experiment herself and done her analysis plan without any help from her statistician mother. Typical scientist not consulting the statistician until she was already collecting data. She was running the experiment and after she had done all her batteries and…

Original Post: 7th graders trained to avoid Pizzagate-style data exploration—but is the training too rigid?

## What hypothesis testing is all about. (Hint: It’s not what you think.)

The conventional view: Hyp testing is all about rejection. The idea is that if you reject the null hyp at the 5% level, you have a win, you have learned that a certain null model is false and science has progressed, either in the glamorous “scientific revolution” sense that you’ve rejected a central pillar of science-as-we-know-it and are forcing a radical re-evaluation of how we think about the world (those are the accomplishments of Kepler, Curie, Einstein, and . . . Daryl Bem), or in the more usual “normal science” sense in which a statistically significant finding is a small brick in the grand cathedral of science (or a stall in the scientific bazaar, whatever, I don’t give a damn what you call it), a three-yards-and-a-cloud-of-dust, all-in-a-day’s-work kind of thing, a “necessary murder” as Auden notoriously put it (and for…

Original Post: What hypothesis testing is all about. (Hint: It’s not what you think.)

## Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?

Do anyone out there know anyone at Time Inc? If so, I have a question for you. But first the story: Mark Palko linked to an item from Barry Petchesky pointing out this article at the online site of Sports Illustrated Magazine. Here’s Petchesky: Over at Sports Illustrated, you can read an article about Tom Brady’s new line of sleepwear for A Company That Makes Stretchy Workout Stuff. The article contains the following lines: “The TB12 Sleepwear line includes full-length shirts and pants—and a short-sleeve and shorts version—with bioceramics printed on the inside.” “The print, sourced from natural minerals, activates the body’s natural heat and reflects it back as far infrared energy…” “The line, available in both men’s [link to store for purchase] and women’s [link to store for purchase] sizes, costs between $80 to $100 [link to store for…

Original Post: Time Inc. stoops to the level of the American Society of Human Genetics and PPNAS?

## We fiddle while Rome burns: p-value edition

Raghu Parthasarathy presents a wonderfully clear example of disastrous p-value-based reasoning that he saw in a conference presentation. Here’s Raghu: Consider, for example, some tumorous cells that we can treat with drugs 1 and 2, either alone or in combination. We can make measurements of growth under our various drug treatment conditions. Suppose our measurements give us the following graph: . . . from which we tell the following story: When administered on their own, drugs 1 and 2 are ineffective — tumor growth isn’t statistically different than the control cells (p > 0.05, 2 sample t-test). However, when the drugs are administered together, they clearly affect the cancer (p < 0.05); in fact, the p-value is very small (0.002!). This indicates a clear synergy between the two drugs: together they have a much stronger effect than each alone does.…

Original Post: We fiddle while Rome burns: p-value edition

## “Which curve fitting model should I use?”

Oswaldo Melo writes: I have learned many of curve fitting models in the past, including their technical and mathematical details. Now I have been working on real-world problems and I face a great shortcoming: which method to use. As an example, I have to predict the demand of a product. I have a time series collected over the last 8 years. A simple set of (x,y) data about the relationship between the demand of a product on a certain week. I have this for 9 products. And to continue the study, I must predict the demand of each product for the next years. Looks easy enough, right? Since I do not have the probability distribution of the data, just use a non-parametric curve fitting algorithm. But which one? Kernel smoothing? B-splines? Wavelets? Symbolic regression? What about Fourier analysis? Neural networks?…

Original Post: “Which curve fitting model should I use?”

## Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements

Kevin Lewis and Paul Alper send me so much material, I think they need their own blogs. In the meantime, I keep posting the stuff they send me, as part of my desperate effort to empty my inbox. 1. From Lewis: “Should Students Assessed as Needing Remedial Mathematics Take College-Level Quantitative Courses Instead? A Randomized Controlled Trial,” by A. W. Logue, Mari Watanabe-Rose, and Daniel Douglas, which begins: Many college students never take, or do not pass, required remedial mathematics courses theorized to increase college-level performance. Some colleges and states are therefore instituting policies allowing students to take college-level courses without first taking remedial courses. However, no experiments have compared the effectiveness of these approaches, and other data are mixed. We randomly assigned 907 students to (a) remedial elementary algebra, (b) that course with workshops, or (c) college-level statistics with…

Original Post: Two unrelated topics in one post: (1) Teaching useful algebra classes, and (2) doing more careful psychological measurements