Today I will review the insightful lecture of Willard Manning at European Science Days. Manning is most famous for his work with the RAND Health Insurance Experiment.

**Problems with Healthcare Data**

There are 4 major econometric problems one must consider when trying to analyze health care cost and utilization data:

- There is a
**large mass of individuals with zero utilization**(or expenditures) during a given time period, - Consumption among those with any care is very
**skewed**(e.g.: visits, hospitalizations, expenditures), - The dependent variable often responds in a
**non-linear**manner to many covariates, - demand response to covariates may change by the level of demand (e.g.: outpatient to inpatient, or low to high levels)

**Log or Box-Cox Transformations**

While using OLS is easy, it can often produce out-of-range predictions (i.e.: y_{hat}=xβ_{hat}<0). Since health care data is skewed, many researchers decide to log the dependent variable in order to have a more symetric distribution of errors. The tradeoff of using logs is that although one gains precision and robustness, no one is interested in log-scale results *per se*.

The Box-Cox transformation of y is as follows:

- [(y
^{λ}-1)/λ]=xβ+ε, if λ≠0 - log(y)=xβ+ε, if λ=0

One estimates λ using MLE in order to minimize the skewness in the residuals**.**

**Log Example**

Using a log transformation implies that second moments often matter. For instance, let us assume log(y|g)~N(μ_{g},σ_{g}), where treatment g=A, B. Then we know

- E(y|g=A) = exp[μ
_{a}+ 0.5(σ_{a})^{2}]. - E(y|g=A)/E(y|g=B) = exp[(μ
_{a}-μ_{b)}+ 0.5{(σ_{a})^{2}-(σ_{b})^{2}}]

We can see from the second equation above, that the second moment of the distributions matters if there is heteroskedasticity, but not if there is homoskedasticity (i.e.: σ_{a}=σ_{b}=σ)

**Marginal Effects with log transformation**

Calculating marginal effects with non-linear econometric formulations is often difficult. For instance, we know that E(y)= exp(xβ)E{exp(ε)|x}. This implies that the marginal effect is equal to:

- dE(y)/d(x
_{k})=exp(xβ)[β_{k}E{exp(ε)|x}+ d E{exp(ε)|x}/d(x_{k})]

This is much more complicated that the *incorrect *formulation that: dE(y)/d(x_{k})=exp(xβ)β_{k}.

**Generalized Linear Model Approach**

In this method, one searches for the appropriate β’s to solve the following function:

- Σ dμ(xβ)/dβ*V(x)
^{-1}*(y-μ(xβ))=0

In practice, one usually assumes that μ(xβ)=exp[xβ]. A variance structure is assumed so that Var(y|x)=α[E(y|x)]^{γ}. The γ’s correspond to some standard parametric distributions:

- Gaussian NLS: γ=0
- Poisson: γ=1
- Gamma: γ=2
- Wald or inverse Gamma: γ=3.

**Two Part Models**

To this point, we have been focusing on the skewness problem and been ignoring the fact that many of the observations also clump at zero. We can decompose the expected value as follows:

- E(y|x) = P(y>0)*E{y|y>0} + P(y=0)*0 = P(y>0)*E{y|y>0}

Now we must estimate P(y>0) and E(y|y>0) separately. The first part term we can estimate with a probit model [P(y>0)=Φ(xα). The second part one can log the y term to take into account skewness.

If the log-scale error term is normally distributed, then:

- y
_{hat}= Φ(xα)*exp(xβ + .5σ^{2}), where β, σ are estimated from the data.

If the log-scale error term is not normally distributed, than one can use the following formulation:

- y
_{hat}= Φ(xα)*exp(xβ)*D - D is Duan’s (JASA 1983) smearing estimator:
- D=N
^{-1}Σexp[ε]=N^{-1}Σexp[ln(y|y>0)-xβ_{ols}]

**Count Data**

Count data in health economics is very common. The number of doctor visits, hospitalizations and ER visits all are types of count data. Poisson and Negative Binomial regressions are frequently recommended for these types of data.