Wednesday, May 7, 2014

Regression Analysis and the Tyranny of Average Effects

What follows is a summary of a mini-lecture I gave to my statistics students this morning.  (I apologize for the unwillingness of Blogger to give me subscripts.)

You may feel a gnawing discomfort with the way economists use statistical techniques.  Ostensibly they focus on the difference between people, countries or whatever the units of observation happen to be, but they nevertheless seem to treat the population of cases as interchangeable—as homogenous on some fundamental level.  As if people were replicants.

You are right, and this brief talk is about why and how you’re right, and what this implies for the questions people bring to statistical analysis and the methods they use.

Our point of departure will be a simple multiple regression model of the form

y = β0 + β1 x1 + β2 x2 + .... + ε

where y is an outcome variable, x1 is an explanatory variable of interest, the other x’s are control variables, the β’s are coefficients on these variables (or a constant term, in the case of β0), and ε is a vector of residuals.  We could apply the same analysis to more complex functional forms, and we would see the same things, so let’s stay simple.

What question does this model answer?  It tells us the average effect that variations in x1 have on the outcome y, controlling for the effects of other explanatory variables.  Repeat: it’s the average effect of x1 on y.

This model is applied to a sample of observations.  What is assumed to be the same for these observations?  (1) The outcome variable y is meaningful for all of them.  (2) The list of potential explanatory factors, the x’s, is the same for all.  (3) The effects these factors have on the outcome, the β’s, are the same for all.  (4) The proper functional form that best explains the outcome is the same for all.  In these four respects all units of observation are regarded as essentially the same.

Now what is permitted to differ across these observations?  Simply the values of the x’s and therefore the values of y and ε.  That’s it.

Thus measures of the difference between individual people or other objects of study are purchased at the cost of immense assumptions of sameness.  It is these assumptions that both reflect and justify the search for average effects.

Well, this is a bit harsh.  In practice, one can relax these assumptions a bit.  The main way this is done is by interacting your x’s.  If x1 is years of education and x2 is gender (with male = 1), the variable x1 x2 tells us that education is regarded as a factor if the observation is male, otherwise not.  In this way the list of x’s and their associated β’s can be different for different subgroups.  That’s a step in the right direction, but one can go further.

So what other methods are there that make fewer assumptions about the homogeneity of our study samples?  The simplest is partitioning subsamples.  Look at men and women, different racial groups or surplus and deficit countries separately.  Rather than search for an average effect for all observations, allow the effects to be different for different groups.

Interacting variables comes close to this if you interact group affiliation with every other explanatory variable.  It doesn't go all the way, however, because (1) it still requires the same outcome variable for each subgroup and (2) it imposes the same structural form.  Running separate models on subsamples gives you the freedom to vary everything.

When should you evaluate subsamples?  Whenever you can.  It is much better than just assuming that all factors, effects, and sensible regression choices are the same for everyone.

A different approach is multilevel modeling.  Here you accept the assumption that y, the x’s and structural methods are the same for everyone, but you permit the β’s to be different for different groups.  Compared to flat-out sample partition, this forces much more homogeneity on your model, but in return you get to analyze the factors that cause these β’s to vary.  It is a way to get more insight into the diversity of effects you see in the world.

Third, you could get really radical and put aside the regression format altogether.  Consider principal components analysis, whose purpose is not hypothesis testing (measurement of effects, average or not), but the structure of diversity within your sample population.  What PCA does, roughly, is to find a cluster of correlations that appear among the variables you specify, making no distinction between explanatory and outcome variables.  That gives you a principal component, understood as subpopulation with distinctive characteristics.  Then the procedure analyzes the remaining variation not accounted for in the first set of correlations; it comes up with a second cluster which describes a second subgroup with its own set of attributes.  It does this again and again until you stop, although, in social science data, you rarely get more than three significant principal components, and perhaps less than this.  PCA is all about identifying the “tribes” in your data sets—what makes them internally similar and externally different.

In the end, statistical analysis is about imposing a common structure on observations in order to understand differentiation.  Any structure requires assuming some kinds of sameness, but some approaches make much more sweeping assumptions than others.  An unfortunate symbiosis has arisen in economics between statistical methods that excessively rule out diversity and statistical questions that center on average (non-diverse) effects.  This is damaging in many contexts, including hypothesis testing, program evaluation, forecasting—you name it.

I will mention just one example from my own previous work.  There is a large empirical literature on whether and to what extent workers receive compensating differentials for dangerous work, a.k.a. hazard pay.  In nearly every instance the researcher wants to find “the” coefficient on risk in a wage regression.  But why assume such a thing?  Surely some workers receive ample, fully compensating hazard pay.  Some receive nothing.  Some, even if you control for everything you might throw in, have both lower wages and more dangerous jobs, because there is an irreducible element of luck in the labor market.  Surely a serious look at the issue would try to understand the variation in hazard pay: who gets it, who doesn't, and why.  But whole careers have been built on not doing this and assuming, instead, that the driving purpose is to isolate a single average effect, “the” willingness to pay for a unit of safety as a percent of the worker’s wage.  It’s beyond woozy; it’s completely wrongheaded.

The first step toward recovery is admitting you have a problem.  Every statistical analyst should come clean about what assumptions of homogeneity are being made, in light of their plausibility and the opportunities that exist for relaxing them.

UPDATE: I fixed a couple of bloopers in the original post (an inappropriate reference to IV and a misspelling of PCA).


Soccer Dad said...

you know what else would be nice ?
harmonize terminology across physical and social science; as a molecular biologist, I have seen many linear or non linear regressions (as well as PCA) and have never (ever) heard anyone use the term "regress"

even better, how about a rule that each paper has to have a .txt or excel file with the raw data; the habit of publishing only stat summary stuff leads to embarassing errors like reinhardt rogoff (their error was not, as commonly assumed a programming or stat error; it was a deeper error: the failure to take pen and pencil, graph out their points, and look at em for a few seconds; if they had done that they would have come to the depressing conclusion that all the work in gathering their data was useless)

John C. Pickett said...

Another monumental error is to apply OLS techniques to summarize time series data.

Unlearningecon said...

What about Matching Estimators? These try to recover the effects for people with similar values of X.

Peter Dorman said...

Unlearning, I don't think propensity matching gets away from average effects; on the contrary. The goal is to identify "the" treatment effect that best explains the outcome differences between matched pairs -- yes? (The statistical controls are introduced at different stages in the two techniques, but the conventional wisdom, which makes sense to me, is that this is minor.)

Unlearningecon said...

Well, it estimates average effects for people of particular characteristics. Which allows us to recover a distribution of effects from the whole sample, depending on what we adjust for. So it at least partially corrects for a completely "tyrannical" average, though of course the average still exists within matched groups.