Unbiased Analysis of Today's Healthcare Issues

An Alternative to Using Dummy Variables

Written By: Jason Shafrin - Feb• 21•11

Oftentimes, researchers use dummy variables to determine how observations classified into different categorical groups affect the dependent variable of interest.  One drawback with this approach is using too many dummy variables can create small cell sizes, creating an identification problem.  Alternatively, using broad groupings for dummy variables may give the appearance that the effect of the covariate is homogenous within the category when this is not the case.

An alternative to using simple categorical dummy variables is to use overlap polynomials.  For instance, Lakdawalla,  Goldman, and  Bhattacharya have a working paper where they rely on the difference of normal cumulative density functions (CDF) to create a flexible form to build these overlapping polynomials.  In particular, they use the following specification:

  • g(age;β) = Σj=0 to K {Φ[(agei-kj+1)/σ]-Φ[(agei-kj)/σ]} * pj(agei;β)

Here is the equation from the paper in larger type.

Below I decribe how this function works in practice.


I randomly generated 250 obervations with ages uniformly distributed between 0 and 100.  To compute the equation above, I assume that σ= 3 and the knots are placed every 10 years between 0 and 100 (i.e., 0, 10, 20, 30,…,100).  The difference of normal CDF  variable for the first spline is shown in the next graph.

The second spline can be produced as follows:

And combining the two we get:

One can see that the effect of age on the dependent variable for children aged 1 would be mostly informed by Spline #1.  For children of aged 10, the effect of age on the dependent variable would be equally weighted between the coefficients on Spline #1 and Spline #2.  For children aged 5 however, the coefficients for Spline #1 would receive more weight than the coefficients on Spline #2.

Finally, we can create the equation g(agei,β).

The graph below shows each of the individual components of the spline. The top curve is the cumulative value of g(age) when pj=1. In practice, one would ensure that pj contains coefficients so that the effect of age on the dependent variable could change for each of these splines. One can see that by using this flexible formulation, fewer dummy variables are needed and the effect across age groups is smoothed.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Comment

  1. Don Kenkel says:

    This method is motivated by the concern that including too many dummy variables can create small sizes. Are there any rules of thumb (ideally based on econometric theory and/or simulations) about how many dummy variables is ‘too many’ or how small a cell size is ‘too small’?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>