**The Problem**

Many times, researchers wish to transform the dependent variable of a regression in order to estimate parameter values. Performing the transformation, however, complicates the calculation of the expected value of the dependent variable on the untransformed scale. Assume, the *Y _{i}* is the dependent variable. Assume the function

*g*is used to transform the dependent variable as follows:

- η
_{i}= g(Y_{i}) - Y
_{i}= h(η_{i}) - h=g
^{-1}

The easiest way to image this functions is think of *g* as the *ln* function and *h* as the *exp* function. In health economics, researchers often use a log transformation to attenuate problems related to a heavily right-skewed distribution. In this case, one would estimate the following regression:

- η
_{i}= x_{i}β + ε_{i} - ε
_{i}~F(iid), E(ε_{i})=0; Var(ε_{i})=σ^{2}

One can estimate β consistently as follows:

- β = (X’X)
^{-1}X’η

What is the predicted value of the dependent variable? Calculating this is not as easy as it seems:

- E(Y
_{0}) = E[h(x_{0}β + ε)]**≠**E[h(x_{0}β)]

For instance, it is well known that using the log transformation, the expected value of the depended variable is equal to: *exp(x _{0}β + σ^{2}/2)*. In cases where we do not know the true distribution of the error term, however, then calculating the expected value of the error term is more difficult. The solution is Duan’s Smearing Estimate.

**The Solution**

The goal is to estimate the following:

- E(Y
_{0}) = E[h(x_{0}β + ε)] = ∫ h(x_{0}β)dF(ε);

Duan states that: “Without knowing the error distribution function *F* or a reliable parametric form for it, we estimate *F* by the empirical cdf of the estimated residuals

- F
_{n}(e) = n^{-1}Σ_{i=1 to n }[I{ε_{i}≤ e}]

where ε_{i} = η_{i – }x_{i}β denotes the residual, I{⋅} denotes the indicator function of the event “⋅”.

By substituting the coefficients from the original regression, we can estimate the expected value of the untransformed dependent variable as:

- Ê(Y
_{0}) = ∫ h(x_{0}β)F_{n}(ε) - Ê(Y
_{0}) = n^{-1}Σ_{i=1 to n}[h(x_{0}β + ε_{i})]

To see how the smearing estimator works in practice see this example. I examine (made-up) figures of how Medicare spending varies by age. One can see that the average spending level is $6516. Retransforming the dependent variable on the log scale and using the naive estimate produces an estimate expected value of $4,853. Using the Duan smearing estimator, however, we get much closer to the actual spending level. The estimated average spending level is $6,725, much closer to the actual figure than the naive estimate.

This example assumes the function *g()* is the natural log function. You can see more detail on dealing with log estimation for homoskedastic and heteroskedastic erros here.

*Source*:

- Naihua Duan Smearing Estimate: A Nonparametric Retransformation Method,
*Journal of the American Statistical Association*, Vol., 78, No. 3838. (Sep., 1983), pp. 605-610.

The expected value of the depended variable is equal to: exp(x_0β + σ^2/2) is true only when Y is normally distributed.