Unbiased Analysis of Today's Healthcare Issues

Beta Binomial Regression

Written By: Jason Shafrin - Sep• 17•13

Oftentimes, one will observe data cluster around two different points.  This distribution is known as a bimodal distribution.  A bimodal distribution could arise, for instance, when patients have two choices of health care providers, and the data measure the share of times patients use one of the providers.

To model the effect of different covariates on variable of interest, standard methods often do not perform well.  Binomial and OLS regressions fit the distribution poorly because regression’s fitted values often are concentrated around the mean where there are relatively few observations in the data

To better model the effect of covariatnes on a bimodal random variable, a paper by Liu et al. (2013) examines applying a beta-binomial regression model in the case where data is distributed in a bimodal fashion.

The binomial distribution is a discrete probability distribution arising when the probability of success (p) in each of a fixed or known number of Bernoulli trials (n) is either unknown or random.  The stereotypical use of a binomial distribution is a coin toss where the probability of heads is p and the number of coin tosses is n.  As applied to the choice of health care providers, p would be the probability of choosing provider #1 and would be the number of times the patient visited any provider.

The beta distribution is a family of continuous probability distributions defined on the interval (0, 1) parameterized by two positive shape parameters, typically denoted by a and b. These shape parameters provide a tremendous amount of flexibility to model different empirical shapes over the (0, 1) interval.

The beta-binomial model is a combined model of the beta and binomial distributions.  The beta-binomial distribution is used to model the number of successes in n binomial trials when the probability of success is p with a beta distribution with parameters a and b.

The beta-binomial distribution can model a bimodal random variable since this distribution is U-shaped if both a and b are less than 1.  Other values of a and b can generate shapes that are monotonically rising toward either end or are flat. The beta-binomial is a uniform distribution if both a and b are equal to 1. The beta-binomial approximates the binomial distribution if a and b are large (>1).

The figure below shows a variety of beta binomial parameterizations.

Author’s Application

The authors describe how to use a beta-binomial regression to model Medicare-eligilbe veteran’s choice of VA physicians to non-VA physicians.  Specifically, the authors use a fixed effect negative binomial approach (xtnbreg), developed by Guimaraes (Guimaraes 2005) in Stata.

Their three step process proceeds as follows.

  1. Structure the data with two records per year.  The first record indicates the number of visits (pc_enctr) that occurred in a Medicare outpatient primary care clinic (ilocation = 0), while the second record indicates the number of visits that occurred in a VA outpatient primary care clinic (ilocation = 1).
  2. Estimate the shape parameters, a and b, from a beta-binomial regression without including any covariates. In this simple model, the dependent variable is the number of visits (pc_enctr) and the independent variable includes whether the visit was at a non-VA location (ilocation.)
  3. Reestimate the beta-binomial with covariates. To reestimate the model, one constructs interaction terms of covariates and the variable indicating the location of visits which occurred in VA (ilocation). Then, we reestimate the beta-binomial regression with the covariates including ilocation and the interaction terms between ilocation and covariates categories. The shape parameters (a and b) and the mean VA reliance can be estimated for the model with covariates as well. The coefficients can be interpreted as incidence rate ratios by exponentiating the coefficients of the interaction terms, in cases where that is of interest.

In the author’s example, the second step produces shape parameters for the VA reliance distribution of a = 0.517 and b = 0.305. Both a and are less than 1, which indicates that the distribution is U-shaped. From this regression, we also predicted that the unadjusted mean VA reliance was 0.629, suggesting that 62.9% of total primary care visits occurred in VA.


Our study shows that binomial or OLS models may match the unadjusted means but poorly estimate the entire distribution when the outcome is bimodally distributed. The extreme flexibility of the shape parameters of the beta-binomial model allows us to estimate a regression that tracks closely to the underlying distribution of residuals. Our study shows that the shape of the distribution is critical, because significant shifts occurred at both extremes of VA reliance.

This figure illustrates how the beta binomial model can better fit the distribution.


Stata Tutorial

In Stata, one can estimate the 2nd step as:

  • xtnbreg pc_enctr ilocation, i(studyid) fe

One can estimate the value of the shape parameters a and b and the mean μ as follows:

  • local a exp(_b[_cons] + _b[ilocation])
  • local b exp(_b[_cons])
  • nlcom mu: `a' / (`a' + `b')

One can estimate the beta binomial regression model as follows:

  • gen age_5564_va = age_5564 * ilocation
  • gen age_6574_va = age_6574 * ilocation
  • gen age_75plus_va = age_75plus * ilocation
  • xtnbreg pc_enctr ilocation age_5564_va age_6574_va age_75plus_va, i(studyid) fe

where the value of the shape parameters a and b and the mean μ are calculated as follows:

  • local a exp(_b[_cons] + _b[ilocation] + _b[age_5564_va] + _b[age_6574_va]+ _b[age_75plus_va] )
  • local b exp(_b[_cons])
  • nlcom mu: `a' / (`a' + `b')

One can calculate the incidence rate ratio (IRR) as the exponential of a given coefficient. For instance, in the paper, for the coefficient of (age_5565_va), the exp(-0.1881)=0.83, which indicates that the expected proportion of primary care visits occurred in VA was 17% lower compared to the reference group, age < 55. One can also report the IRR’s in the regression results using the irr option.

  • xtnbreg pc_enctr ilocation age_5564_va age_6574_va age_75plus_va, i(studyid) fe irr



You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Your email address will not be published. Required fields are marked *