Many data sets that social scientists come across use disproportionate stratified sampling. If a subpopulation is small, the survey designers may want to oversample this group. For example, in the Survey of Income and Program Participation (SIPP) poor individuals are oversampled and in the Community Tracking Study (CTS) uninsured individuals are oversampled in order to give more precision to the estimates made for these groups with a smaller population. Below, is a brief explanation of how to work with a disproportionate stratified data set.

**Simple Example** (from a Napier University website)

Lets us imagine a town which has 1200 rich people and 2500 poor people. Due to budget constraints, the survey designer samples 100 people from each of the two strata (200 people total). The sampling fraction for the rich is .08333 (100/1200) and for the poor is .04 (100/2500). The weights to be placed on each observation is simple just the inverse of the sampling fraction; thus the weights are 12 for the rich and 25 for the poor.

In the example above, suppose the mean household income in the poor areas was £12,000 and that in the rich areas was £25,000, then the weighted mean would be

**[100x £12,000 x (w=25) +100 x £25,000 x(w=12)] ÷ (100×25+100 x 12) = £16,216.20.**

An unweighted mean here would just be £18,500, so we can see that the weighting has corrected the fact that the sample has too many rich households.

**Econometrics **(see Wooldridge pp. 590-598)

Here is how Wooldridge explains variable probability sampling:

- Draw an observation
at random from the population**w_i** - If
is in stratum**w_i***j*, toss a (biased) coin with probability ‘*p_j*‘ of turning up heads. Let*h_{ij}=1*if the coin turns up heads and zero otherwise. - Keep observation
*i*if h_{ij}=1; otherwise leave out of the sample.

A weighted M-estimator would be:

- min _{β} SUM_{i=1 to N} [p_{j_i}]^{-1}*q(
**w_i**,**β**)

Here q(w,β) is the objective function that is chosen to identify the population parameters β_o. In the OLS case, q(**w_i**,**β**)=**x***(y-**xβ**). The asymptotic variance matrix for the linear model is:

- [SUM_i p^{-1}
**x**'**x**]^{-1} [SUM_i p^{-2}(u^2)**x**'**x**] [SUM_i p^{-1}**x**'**x**]^{-1}

where all variables are to have subscript *i*‘s except p, since *p=p_{j_i}*.

**Stata**

A simple and clear example of how to use weights in a stratified sample can be found at the UCLA Academic Technology Services website (Stata FAQ: How do I use the Stata survey (svy) commands?“). There are three main variables which need to be definied.

- The
**primary sampling unit**(psu) is the lowest unit of observation, usually either an individual or a household identification number. In the econometric section above, psu’s are indexed by the letter*i*. - The
**strata**are the groups into which the data set is divided. The strata are indexed by the letter*j*. In the first example above, there are two strata: rich households and poor households. - The
**sampling weight**is defined as the inverse of the psu’s probability of selection.

To program this into stata, if the we would write:

**svyset [pweigh=wt], psu(house) strata(eth)**

Here the psu is the variable *house*, the strata are categorized by *eth *(a variable for the ethnic group) and the weight is the variable *wt. *To run a weighted least squares regression (WLS), you would simply type:

**svy: regress y x1 x2 x3**

and the appropriate weighting will occur.