Let us assume that our null hypothesis is that when someone is sick, it is not swine flu. A type I error is a false positive. That is, we claim that the person has the swine flu, when actually then do not. A type II error is a false negative. This means that the person has swine flu, but we erroneously conclude that they do not.
What is the probability that someone who has flu-like symptoms actually has swine flu? We can calculate this using Bayes Rule:
- P(H1N1|symptoms) = P(Symptoms|H1N1)*P(symptoms)/P(H1N1)
Let us assume that all individuals with swine flu have symptoms so that P(Symptoms|H1N1)=1. Let us assume 2% of the population gets any type of flu each year and displays symptoms. Let us assume only .02% of the population gets H1N1. So, P(symptoms)=0.02 and P(H1N1)=.0002. Thus we have:
- P(H1N1|symptoms) = 1*0.02/0.002=.01.
This means that if we see a random person with the flu like symptoms, there is only a 1% chance that they actually have the swine flu.
This may explain why the CDC and WHO ignored early warnings from a Washington-based biosurveillance company concerning a possible flu outbreak. Although there was an increase in the number of cases of influenza, the probability that it was an outbreak of H1N1 (or any type of outbreak) was low. Although probability of a false positive was high, the cost of a false negative is also large. Ex-post, it is obvious that the CDC and WHO should have acted quicker to fight the spread H1N1. Ex-ante, these organizations likely receive numerous reports of potential outbreaks and acting on every single one–most of which turn out to be false–would be very costly. Identifying the optimal time to initial school closings and public health warnings is very difficult and must take into account both the probabilities and the costs of type I and type II errors.
The History of Least Squares
March 23, 2009 in Books, Econometrics | 1 comment
Let us say you have 10 observations of 2 different variables. How do you determine which of the observations to use? Should you throw out the outliers? Should you only include the most similar values? Does more observations increase or decrease the amount of measurement error?
These problems can be answered by the discipline of Statistics. An interesting book by Stigler recounts The History of Statistics. Astronomers lead many of the statistical advances in the seventeenth and eighteenth centuries. Accurate measurement is very important to astronomers. Further, observations with respect to the circumference and oblateness of the earth were made at different times and places throughout history. This leaves a conundrum of how best to combine these observations.
Mayer, Boscovich, and others contributed to the development of the idea of least squares, but Stigler credits Legendre with the invention of least squares. Legendre came up with the idea in his attempt to measure the length of the median quadrant (the distance from the equator to the North Pole) through Paris.
To demonstrate some of his ideas, I will use a simpler example. Let us assume that a drug can have a dosage level between 0 and 5 and we want to find it’s impact on health (measured from a 0-10 scale). Let us look at the following data. The goal is to find the parameters m (slope) and b (intercept) that accurately measure the relationship between drug dosage and health (ignore any questions of endogeneity). Should we include all 10 observations?
Although Euler recognized that including more observations increases the maximum possible error, Legendre realized that adding more observations also greatly increased the probability of getting close to the true value of the parameters of interest.
In my example, we need to fit a line to measure the parameters m and b. How do we set up the errors so that we have the most accurate calculations. Laplace believed that the following two conditions would need to hold:
The first condition basically says that the errors are uncorrelated with the independent variables on average. The second condition hopes to minimize the errors. Legendre extended Laplace’s second condition to minimize the sum of the squared errors rather than just the absolute error level.
Another key point is that this regression line must go through the “center of gravity.” In my example, the average dosage for the ten observations is 2.2 and the average health level is 5.9. This means the center of gravity is at the coordinates (2.2, 5.9). In the solution in my example is to set m=1.1456 and b=3.3797. We see that if we plug 2.2 into the equation, the output is 5.9; thus, the regression line does indeed go through the center of gravity.
Understanding the historical development of modern statistical techniques is an interesting task, and Stigler’s book enlightens the reader with much detail.
Tags: Books, Econometrics, Least Squares, Statistics