Monday, April 22, 2019

Statistical Significance and the Sweet Siren of Self-Confirmation: A Reply to Taylor


Just as Ulysses had himself chained to the mast of his ship so he wouldn’t succumb to the lure of the Sirens, John Ionnidis and others have argued we must bind ourselves to the discipline of statistical significance lest we fall victim to confirmation bias.  Some researchers will want to proclaim they have found earth-shaking results even if they are enveloped in noise, and others will try to dismiss genuine findings of no effect, even if that is where the data point.  The only way through the choppy seas of statistical investigation (sorry!) is to adhere unstintingly to the decision rule that everything else first depends on whether p is less than or greater than .05.

So says Timothy Taylor, citing Ionnidis:
The case for not treating statistical significance as the primary goal of an analysis seems to me ironclad. The case is strong for putting less emphasis on statistical significance and correspondingly more emphasis on issues like what data is used, the accuracy of data measurement, how the measurement corresponds to theory, the potential importance of a result, what factors may be confounding the analysis, and others. But the case for eliminating statistical significance from the language of research altogether, with the possibility that it will be replaced by an even squishier and more subjective decision process, is a harder one to make.
I don’t think Taylor understands what the issue is.  The question raised by the critique of null hypothesis statistical testing and its centerpiece, the asterisk-earning designation of statistical significance, is not whether we should compute p-values—we should certainly continue to do this or something very similar—but whether a particular cutoff like .05 should be used as a lexicographic decision rule.  As it stands, that’s the role significance plays.  If a finding holds with p < .05 it can then be examined for its provenance (data, model selection) and magnitude; if not it is considered an error or at best a sign of dubious attachment to continue to regard it as evidence.  First check a result’s significance, then look at the rest of the story.

The attack against significance testing is about this decision rule.  I won’t repeat all the arguments for why the rule is misguided; read the Nature comment.  The only point I want to make here is that the practical effect of first categorizing all results according to how many asterisks they receive is to make every other consideration secondary.  Really, is a result of a well-designed study with a highly plausible statistical model that comes in at p = .06 less constitutive of evidence than a result from a questionable study that comes in at .04?

p-values are important!  The first thing I look at is the ratio of effect size to standard deviation, but there’s so much more.  What about the sampling strategy?  What about the measurements of key variables—are they really proxies for the true variable of interest (they often are), and if so how good are they?  How much confidence should I have in the statistical model?  Is this subjective, as Taylor claims?  Yes and no.  The evaluation I make is a matter of judgment, but it can be defended or challenged on the basis of objective aspects of the study, provided the research is sufficiently documented.

There might still be a case for significance as a sorting device if there were a requirement that each piece of research produce a determinate, yes-no verdict on the question of interest.  This is the classic argument, in fact.  It is up to this particular study to make a determination on whether a hypothesized effect exists, and any significant doubt is sufficient to require a “no”.  So we set up the no-effect null, and only if we get a low enough p for a deviation from it (a low enough proportion of times we would expect to get an effect at least this large on repeated samples from a population with this dispersion if the true effect were zero) will our finding have survived the first possible “no”.  The ritual around pre-selection of the null and the cutoff criterion (critical threshold) is about protecting this all-important first test from any contamination emanating from our self-interest.  That’s what Ionnidis and Taylor are appealing to.

But the researcher does not have to make a determinate decision about the research question on the basis of a single study, or even a single variant of the same study.  The evidence for a potential effect of interest, to be convincing, should not only come from well-designed, well-analyzed work; it should also, as far as possible, come from a diversity of methods and sources.  We should have simulations, large-N observation studies, lab-style or natural experiments, utilizing a variety of samples and analytical methods.  Even in the limiting case of a single study, every attempt should be made to generate diversity within it: partitioning into sub-samples, trying out multiple estimation models.  In that case there is no need for a binary decision rule for a single finding; what matters is the constellation of evidence over the range of findings.  Of course, the individual researcher or research team does not have to be the locus of this judgment.  But even if they are, once we have dropped the requirement for a binary decision based on a single finding, lexicographic rules that require us to ignore whole swaths of our results can only weaken the evidentiary base we rely on.

In practice, the demand that a study generate magic asterisks in order to see the light of publication has led to lower quality, less credible and less reproducible research in economics as in many other fields.  It has led to exaggerated, unwarranted confidence in dubious claims and steered the profession away from questions of high importance that are difficult to resolve using available data, which is what the significance filter means when it is yoked to peer review.  There is an alternative: consider the evidence substantively, its quality and diversity.  If we don’t know how to do that as a research community, we aren’t going to be rescued by an arbitrary dichotomy of p > or < .05.

3 comments:

rosserjb@jmu.edu said...

Peter,

You briefly sccarfed over what McCloskey and Ziliak have said about this, your "effect size." The strength of the relationship is more important than the p-value.

It must be noted that while some are joining the M-Z critique of stat significance, there are some outside of economics who are effectively doubling down and criticizing Fisher's focus on 5% because it is too large a number, that the asterisks should only be dragged out for much stronger levels of significance. However, most of those people are in hard sciences where data is not nearly as noisy as it is in economics and other social sciences.

Peter Dorman said...

Hi Bark. Actually, I didn't mention the issue of effect size vs p-value at all. I largely agree with them, but the focus of this post, and the larger current debate over significance testing, is about what constitutes evidence. I am aware of the proposal to cut the cutoff down to .005, but that strikes me as utterly wrong-headed. It doubles down on the power of the lexicographic filter, as if the only problem is that we haven't been putting enough weight on that initial significance test. No, get rid of the filter and judge the weight of evidence substantively, putting p-values in the hopper along with the other factors. Do read the Nature piece.

Peter Dorman said...

ps: As a self-disclosure, I'm one of the signers of the Nature letter.