Saturday, May 28, 2016

The Great P Value Controversy

This quarter I have been part of the teaching team for Research Design and Quantitative Methods, a core class in Evergreen's Masters of Environmental Studies.  Naturally, I had to include a discussion of the debate that has been swirling around the use of P values as a "significance" filter and the role of null hypothesis statistical testing in general.  Because the students have very limited backgrounds in statistics and the course ventures only a little bit beyond the introductory level, I have to simplify the material as much as possible, but this might be useful for those of you reading this who aren't very statsy, or who have to teach others who aren't.

As background reading for this topic, students were assigned the recent statement by the American Statistical Association, along with "P Values and Statistical Practice" by Andrew Gelman, whose blog ought to be on your regular itinerary if you care about these questions.  Here are the slides that accompanied my lecture.

UPDATE: I've had a couple of late-breaking thoughts that I've incorporated into the slides.  One is that the metaphor of bioaccumulation works nicely for the tendency for chance results to concentrate in peer-reviewed journals under p-value filtration (slide 22).  The other is a more precise statement of why p-values for different results shouldn't be compared (slide 25).

5 comments:

MaxSpeak said...

This is informative. My 'metric prof warned us about this sort of thing way back when. It's why I've always been more interested in descriptive stats than dubious hypothesis testing. Is that wrong?

Peter Dorman said...

There are two problems with staying at the level of descriptive stats, Max. One is that they tell you only about the sample and not how well the sample generalizes to the underlying population. For policy purposes, it's future samples, altered by our policies, that we care about, and the generalizations we want to make extend over time. Another is that there are often important patterns in the data whose existence -- or limitations! -- aren't visible to the naked eye. You've got to model it to figure it out.

Null hypothesis statistical testing is just one technique in modeling, and the point is that it is being abused. Some argue it has no place at all; I'm not willing to go there (yet). I think if it's done in the spirit of an aggressive challenge to a claim about how the world works it can add some value. But I've come to the view that robustness and replication are more powerful criteria.

MaxSpeak said...

In practical work and real time, all you may have is that one sample, so rather than over-generalize on the basis of some hypothesis test, to me it's more reasonable to limit oneself to description. After all, the results of one hypothesis test on one sample is just description itself, no?

Peter Dorman said...

Well, let's take an example. Suppose you're a pollster for a politician. The campaign strategy depends on whether the candidate is ahead or behind. So you do a poll and have a sample. Your guy (could be female) scores a little higher. But how sure are you of this result? You surveyed 800 people out of an electorate of millions. Yes, your descriptive stats will tell you what percent of the people you polled are in favor/opposed/don't give a shit and you can even slice and dice your sample into demographics/geography/whatever. But how much credence should you give these numbers? How likely would you be to find your candidate behind if you took another sample? That's what significance testing is for.

Of course, published polls always report their confidence intervals (a variation on significance testing) and many of them are garbage. There are lots of other factors to consider besides sampling uncertainty. This is part of the critique. But would you want to ignore sampling uncertainty altogether?

Meanwhile, there is an important difference between descriptive stats and statistical tests. The descriptives are dependent on the real world out there and your data collection and measurement methods. Statistical tests depend on both of those plus all the modeling choices you made and, for significance testing, the conditional assumption that the null hypothesis is correct. You're adding a lot of if's, so the interpretation has to be different.

Zachary Smithingell said...

As I've studied statistics through the lens of epidemiological data analysis, I've become far more skeptical of the (over) reliance on p-values as the linchpin of statistical significance.

Rothman, one of the authors of a well known epi textbook, warns repeatedly against relying too heavily upon p-values as the sole (or most important) measure of significance. You can imagine how important it would be to keep this in mind when conducting a drug trial or investigating correlations between risk factors and specific negative health impacts. A test that fails to achieve a predetermined measure of statistical significance may very well hold some vital "real world" significance which could literally be a matter of life and death. Clearly, holding p-values in esteem above the other obtained statistics and inherent limitations of your model can obscure important data points.
Rothman et al encourage epidemiologists to use estimation (confidence intervals, p-value functions, and even push for Bayesian analysis) in their research, and if statistical significance is achieved, well that's fine.

The overall message of the text, which should be explicit in all stats classes, is that statistical models should all be subject to healthy skepticism. Statistical analysis is one tool in the kit of scientific inquiry, and each model is more of a tree in the forest, rather than being a forest by itself. That point, I think, is too frequently missed in frequentist model stats books and classes.