Tuesday, July 12, 2011

A Partial Cure for the “Is Consistent With” Syndrome

As I’ve argued in the past, economists often pretend to test their theories by identifying an outcome their model predicts, and analyzing a real-world dataset to see if the outcome occurs. If it does, they crow about how the result “is consistent with” their theoretical musings. Of course, they use the latest and greatest econometric techniques, scrupulously avoiding Type I error (false positives) in that aspect of their work. This way they can claim that there is absolutely no chance the result they found was due to unobserved endogeneity, an inappropriate parametric assumption or some other glitch.

Fine: but nothing in this approach addresses the far larger problem of how likely it is that this result could occur even if the theory is wrong. That’s the real issue for Type I error minimization. While there is no formal test for this problem, there is a procedure which can address it and even turn it to some advantage.

A researcher has a theory, call it X1, that can be expressed as a model of how some portion of the world works. Among other things, this theory predicts an outcome Y1 under a specified set of circumstances. There is a dataset that enables you to ascertain that these circumstances apply and to identify whether or not Y1 has arisen. How should this test be interpreted?

My proposal is simply this: the researcher should be expected to consider how many other plausible theories, X2, X3 and so on, also predict Y1. This should take the form of a section in the writeup: “How Unique Is this Prediction?” or something like that. If X1 is the only plausible theory that predicts, or better permits, Y1—if Y1 is inconsistent with all X except X1—the empirical test is critical: it decisively scrutinizes whether X1 is correct. If, however, there are other X’s that also yield Y1, the test is much weaker. It will accurately determine if X1 is false only if all the other X’s are false as well.

The first point, then, is that this additional part of the writeup will indicate to the reader how much weight to place on the demonstration that X1 has passed the Y1 test.

The second point is perhaps even more valuable. By giving some thought to the alternative theories that also explain Y1, the researcher may notice other predictions that enable her to discriminate between them. It may be that X2 predicts Y1, but only X1 predicts both Y1 and Y2. This moves the test closer to criticality, depending on how many other X’s there are in the game. Getting into the habit of testing theories not in a vacuum, but in relation to other, competing theories would be a huge advance. As a further bonus, it would push researchers in the direction of expanding their knowledge of competing theoretical traditions.

I’m going to begin making this suggestion in all theory-plus-empirical-test articles I review from now on.


Larry, The Barefoot Bum said...

Feynman beat you to it:

Cargo Cult Science

Eric Nilsson said...

I believe you're calling for a type of "non-nested hypothesis test," or something along those lines.

They are occasionally done, as I've done here:

Such procedures are plausibly helpful, if one takes empirical work seriously, but few empirical workers really care about the big picture questions associated with empirical work.

Peter Dorman said...

Larry: It's OK to come in a distant second (or more) after Feynman, but in this case what Feynman is doing is expressing the view that the minimization of Type I error is at the core of science. I did not presume to be original when I wrote this same thing a while back -- quite the contrary. I was evoking, as Feynman was, the accumulated experience of how science works when it really works. If there is anything new about my post yesterday, it is that I am narrowing the discussion to just one topic that troubles economics and proposing an operationalized mitigation. Little stuff philosophically, but it would be big if economists would do it.

Eric: I think the spirit of non-nested hypothesis testing, as you've used it in your paper, looks in the direction I'm recommending. At the level of implementation, though, it's rather different. It asks whether X1 has a better fit with the data at hand than X2. This adds something to my proposal, but it leaves two things out. First, it doesn't formally consider all the potential explanations for the outcome you're interested in, and second, it doesn't search for a prediction of X2 that would be disallowed by X1 or vice versa. But if you repeat your exercise on lots of different data sets, you will have something useful, to be sure.