Sunday, January 8, 2017

When is a Dummy Variable Not a Dummy Variable?

In econometrics we often use "dummy variables", to allow for changes in estimated coefficients when the data fall into one "regime" or another. An obvious example is when we use such variables to allow the different "seasons" in quarterly time-series data.

I've posted about dummy variables several times in the past - e.g., here

However, there's one important point that seems to come up from time to time in emails that I receive from readers of this blog. I thought that a few comments here might be helpful.

The following variable can legitimately called a "dummy variable":

Di = 1   ;    if a certain condition holds
= 0   ;    otherwise.

The following variable is not a dummy variable:

Ni = 0    ;  if condition A holds
= 1    ;  if condition B holds (where A and B are mutually exclusive conditions)
= 2    ;  otherwise. (Call this condition C, say.)

Let's see what's different about Di and Ni, and then we can consider some further examples.

Let's add  Di as a regressor in a regression model. For simplicity I'll just add it (rather than interact it with another regressor) so that it just shifts the intercept. However, this doesn't affect any of the points that I make below.

So, our model is:

yi = α + β xi + γ Di + ui

where uis the random error term.

If Di = 1, then the intercept is (α + γ); and if  Di = 0, then the intercept is just α. The estimated (positive or negative) "shift" in the intercept is just the estimate of γ that we obtain when we use (say) OLS. The data entirely determine the magnitude of this shift.

On the other hand, suppose that we replace  Di by  Ni in our model:

yi = α + β xi + γ Ni + u.

Now, i condition A holds, then the intercept is α; if condition B holds, then the intercept is (α + γ); and otherwise the intercept is (α + 2γ). Regardless of what the data tell us by way of an estimate for γ, the shift in the estimated intercept from condition A to condition C  is constrained to be twice the shift that we estimate from condition A to condition B.

We've essentially pre-judged part of the answer and imposed it before we even estimated the model! Generally, this is not something that we'd want to do.

You might now ask yourself, does it make sense to use any of the following "dummy variables" as regressors?
• Di = 1,  if condition A holds  ;   Di  = -1,  if condition A does not hold.
• Di = 0,  if condition A holds  ;   Di  = 1,  if condition B holds; Di  = -1, if condition C holds.
(In the second case, conditions A, B, and C are mutually exclusive and totally exhaustive.)

1. My take is that the only possible values for a "qualitative" dummy variable are 1 or 0; because if it holds it must be 1 so the estimated coefficient is not "scaled", and similarly for 0, If A and B are mutually exclusive this must be represented implicitly by the fact that in the estimation data when A is 1 B is 0 and viceversa. Mutual exclusion is a property of the data...

2. Neither of the two cases should work. In both cases (given the same regression model), the intercepts would be (α + y), (α - y) in the first case & (α), (α +y), (α - y) in the second case. Both of these are assuming a negative effect of the lack of condition A or condition C respectively.

The first case has a whiff of plausibility however. If there existed a condition such that effects were inherently positive when it holds and inherently negative when it does not (or vice versa), perhaps it would work. An example may be the existence of debt and its effects on one's credit score. It does still seem to be making assumptions before testing the data...