Pearson's chi-square test

From wiki.gis.com
Jump to: navigation, search

Pearson's Chi-Square Test is a statistical test used to test the goodness of fit or to test if there is a difference between samples of data [1]. A one-sample chi-square test is used to test the goodness of fit between observed frequencies and theoretical expected frequencies [2]. An example of a one-sample in a chi-square is soil type on a farm. The farm is the single sample that is being tested. The soil types are the categories. The sample can have multiple categories but one sample is being tested. A two-or-more samples chi-square test is used to test differences between data samples [3]. An example of a two-sample chi-square is testing the differences between soil types on two separate farms. The two farms are the two samples.

The Chi-Square statistical test can be used to asses geographic data. The one-sample test enables geographers to examine the differences between observed data and expected data. Two-or-more samples test enables geographers to examine the differences between samples.

Criteria for a Chi-Square Test

-In terms of the Scale of measurement, the data must be nominal.

-The data may also be "categorized" ordinal or interval data.

-The categories of data must be mutually exclusive.

-The "data must be in frequencies, i.e. the number of discrete objects occurring in different categories" [4]. The data cannot be in percentages or proportions.

One-Sample Chi-Square Test

The chi-square statistic is a sum of differences between observed and expected outcome frequencies, each squared and divided by the expectation:

 \chi^2 = \sum_{i=1}^n {\frac{(O_i - E_i)}{E_i}^2}

where:

O_i = an observed frequency for the i^{th} bin
E_i = an expected (theoretical) frequency for the i^{th} bin, asserted by the null hypothesis

The resulting value can be compared to the chi-square distribution to determine the goodness of fit.

In order to determine the degrees of Freedom of the Chi-Squared distribution, one takes the total number of observed frequencies and subtracts one. For example, if there are eight different frequencies, one would compare to a chi-squared with seven degrees of freedom.

Another way to describe the chi-squared statistic is with the differences weighted based on measurement error:

 \chi^2 = \sum {\frac{(O - E)^2}{\sigma^2}}

where \sigma^2 is the variance of the observation [5]. This definition is useful when one has estimates for the error on the measurements.

The reduced chi-squared statistic is simply the chi-squared divided by the number of degrees of freedom: [5] [6] [7] [8]

 \chi_{red}^2 = \frac{\chi^2}{\nu} = \frac{1}{\nu} \sum {\frac{(O - E)^2}{\sigma^2}}

where \nu is the number of degrees of freedom, usually given by N-n-1, where N is the number of bins, and n is the number of fit parameters. The advantage of the reduced chi-squared is that it already normalizes for the number of data points and model complexity. As a rule of thumb, a large \chi_{red}^2 indicates a poor model fit. However \chi_{red}^2 < 1 indicates that the model is 'over-fitting' the data (either the model is improperly fitting noise, or the error bars have been over-estimated). A \chi_{red}^2 > 1 indicates that the fit has not fully captured the data (or that the error bars have been under-estimated). In principle a \chi_{red}^2 = 1 is the best-fit for the given data and error bars.

Binomial Case

A binomial experiment is a sequence of independent trials in which the trials can result in one of two outcomes, success or failure. There are n trials each with probability of success, denoted by p. Provided that npi ≫ 1 for every i (where i = 1, 2, ..., k), then

 \chi^2 = \sum_{i=1}^{k} {\frac{(N_i - np_i)^2}{np_i}} = \sum_{\mathrm{all\ cells}}^{} {\frac{(\mathrm{O} - \mathrm{E})^2}{\mathrm{E}}}.

This has approximately a chi-squared distribution with k − 1 df. The fact that df = k − 1 is a consequence of the restriction  \sum N_i=n. We know there are k observed cell counts, however, once any k − 1 are known, the remaining one is uniquely determined. Basically, one can say, there are only k − 1 freely determined cell counts, thus df = k − 1.


References

  1. Ebdon, David (1985). Statistics in Geography, page 66, 67. Oxford: Blackwell Publishing, 1985.
  2. Ebdon, David (1985). Statistics in Geography, page 66. Oxford: Blackwell Publishing, 1985.
  3. Ebdon, David (1985). Statistics in Geography, page 67, 71. Oxford: Blackwell Publishing, 1985.
  4. Ebdon, David (1985). Statistics in Geography, page 67. Oxford: Blackwell Publishing, 1985.
  5. 5.0 5.1 Charlie Laub and Tonya L. Kuhl: Chi-Square Data Fitting. University California, Davis.
  6. John Robert Taylor: An introduction to error analysis, page 268. University Science Books, 1997.
  7. Kirkman, T.W.: Chi-Square Curve Fitting.
  8. David M. Glover, William J. Jenkins, and Scott C. Doney: Least Squares and regression techniques, goodness of fit and tests, non-linear least squares techniques. Woods Hole Oceanographic Institute, 2008.


Further Reading

See Also