An Alternative to Using Dummy Variables

Oftentimes, researchers use dummy variables to determine how observations classified into different categorical groups affect the dependent variable of interest. One drawback with this approach is using too many dummy variables can create small cell sizes, creating an identification problem. Alternatively, using broad groupings for dummy variables may give the appearance that the effect of the covariate is homogenous within the category when this is not the case.

An alternative to using simple categorical dummy variables is to use overlap polynomials. For instance, Lakdawalla, Goldman, and Bhattacharya have a working paper where they rely on the difference of normal cumulative density functions (CDF) to create a flexible form to build these overlapping polynomials. In particular, they use the following specification:

g(age;β) = Σ_{j=0 to K} {Φ[(age_i-k_j+1)/σ]-Φ[(age_i-k_j)/σ]} * p_j(age_i;β)

Here is the equation from the paper in larger type.

Below I decribe how this function works in practice.

I randomly generated 250 obervations with ages uniformly distributed between 0 and 100. To compute the equation above, I assume that σ= 3 and the knots are placed every 10 years between 0 and 100 (i.e., 0, 10, 20, 30,…,100). The difference of normal CDF variable for the first spline is shown in the next graph.

The second spline can be produced as follows:

And combining the two we get:

One can see that the effect of age on the dependent variable for children aged 1 would be mostly informed by Spline #1. For children of aged 10, the effect of age on the dependent variable would be equally weighted between the coefficients on Spline #1 and Spline #2. For children aged 5 however, the coefficients for Spline #1 would receive more weight than the coefficients on Spline #2.

Finally, we can create the equation g(age_i,β).

The graph below shows each of the individual components of the spline. The top curve is the cumulative value of g(age) when p_j=1. In practice, one would ensure that p_j contains coefficients so that the effect of age on the dependent variable could change for each of these splines. One can see that by using this flexible formulation, fewer dummy variables are needed and the effect across age groups is smoothed.

1 Comment

This method is motivated by the concern that including too many dummy variables can create small sizes. Are there any rules of thumb (ideally based on econometric theory and/or simulations) about how many dummy variables is ‘too many’ or how small a cell size is ‘too small’?

1 Comment

Leave a Reply Cancel reply