Errors and residuals in statistics

From wiki.gis.com
Jump to: navigation, search

In statistics and optimization, statistical errors and residuals are two closely related and easily confused measures of "deviation of a sample from the mean": the error of a sample is the deviation of the sample from the (unobservable) population mean or actual function, while the residual of a sample is the difference between the sample and either (1) the (observed) sample mean or (2) the regressed (fitted) function value. The fitted function value is the value that your statistical model says the sample "should" have. The distinction is most important in regression analysis, where the subtle behavior of residuals leads to the concept of studentized residuals.

Univariate explanation

For a univariate distribution, the distinction between errors and residuals is just the difference between deviations from the population mean versus the sample mean.

A statistical error is the amount by which an observation differs from its expected value; the latter being based on the whole population from which the statistical unit was chosen randomly. The expected value, being for instance the mean of the entire population, is typically unobservable. If the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The nomenclature arose from random measurement errors in astronomy. It is as if the measurement of the man's height were an attempt to measure the population mean, so that any difference between the man's height and the mean would be a measurement error.

A residual (or fitting error), on the other hand, is an observable estimate of the unobservable statistical error. The simplest case involves a random sample of n men whose heights are measured. The sample mean is used as an estimate of the population mean. Then we have:

  • The difference between the height of each man in the sample and the unobservable population mean is a statistical error, and
  • The difference between the height of each man in the sample and the observable sample mean is a residual.

Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent. The sum of the statistical errors within a random sample need not be zero; the statistical errors are independent random variables if the individuals are chosen from the population independently.

In sum:

  • Residuals are observable; statistical errors are not.
  • Statistical errors are often independent of each other; residuals are not (at least in the simple situation described above, and in most others).

One can standardize errors (especially of a normal distribution) in a z-score (or "standard score"), and standardize residuals in a t-statistic, or more generally studentized residuals.

Example with some mathematical theory

If we assume a normally distributed population with mean μ and standard deviation σ, and choose individuals independently, then we have

X_1, \dots, X_n\sim N(\mu,\sigma^2)\,

and the sample mean

\overline{X}={X_1 + \cdots + X_n \over n}

is a random variable distributed thus:

\overline{X}\sim N(\mu, \sigma^2/n).

The statistical errors are then

\varepsilon_i=X_i-\mu,\,

whereas the residuals are

\widehat{\varepsilon}_i=X_i-\overline{X}.

(As is often done, the "hat" over the letter ε indicates an observable estimate of an unobservable quantity called ε.)

The sum of squares of the statistical errors, divided by σ2, has a chi-square distribution with n degrees of freedom:

\sum_{i=1}^n \left(X_i-\mu\right)^2/\sigma^2\sim\chi^2_n.

This quantity, however, is not observable. The sum of squares of the residuals, on the other hand, is observable. The quotient of that sum by σ2 has a chi-square distribution with only n − 1 degrees of freedom:

\sum_{i=1}^n \left(\,X_i-\overline{X}\,\right)^2/\sigma^2\sim\chi^2_{n-1}.

It is remarkable that the sum of squares of the residuals and the sample mean can be shown to be independent of each other. That fact and the normal and chi-square distributions given above form the basis of calculations involving the quotient

{\overline{X}_n - \mu \over S_n/\sqrt{n}}.

The probability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviation σ, but σ appears in both the numerator and the denominator and cancels. That is fortunate because it means we know the probability distribution of this quotient: it has a Student's t-distribution with n − 1 degrees of freedom. We can therefore use this quotient to find a confidence interval for μ.

Regressions

In regression analysis, the distinction between errors and residuals is subtle and important, and leads to the concept of studentized residuals.

Given a function that relates the independent variable to the dependent variable – say, a line – the deviation of observations from this function are the errors. If one runs a regression on some data, then the deviations of observations from the fitted function are the residuals.

However, because of the behavior of the process of regression, the distributions of residuals at different data points (of the input variable) may vary even if the errors themselves are identically distributed. Concretely, in a linear regression where the errors are identically distributed, the variability of residuals of inputs in the middle of the domain will be higher than the variability of residuals at the ends of the domain: linear regressions fit endpoints better than the middle. This is also reflected in the influence functions of various data points on the regression coefficients: endpoints have more influence.

Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of residuals, which is called studentizing. This is particularly important in the case of detecting outliers: a large residual may be expected in the middle of the domain, but considered an outlier at the end of the domain.

References

  • Residuals and Influence in Regression, R. Dennis Cook, New York : Chapman and Hall, 1982.
  • Applied Linear Regression, Second Edition, Sanford Weisberg, John Wiley & Sons, 1985.

See also

  • Absolute deviation
  • Deviation (statistics)
  • Error detection and correction
  • Margin of error
  • Mean absolute error
  • Propagation of error
  • Root mean square deviation
  • Sampling error
  • Studentized residual

External links