Predictive model

From wiki.gis.com
Jump to: navigation, search


A Predictive model is a simulation of a hypothesized explanation of a process, and is used to predict the current state of a study area. Typically, this prediction is then statistically compared to the actual state of the study area to test the validity of the hypothesis. Predictive modelling is a common tool for scientific research in the field sciences, and GIS is well-suited to implementing a variety of spatial predictive models.

Types of predictive models

There are two broad types of geospatial predictive models: deductive and inductive.

Crime Forecast of Washington DC. Red and orange colors indicate areas of high risk. The risk assessment was generated using an inductive predictive modeling tool called Signature Analyst. Signature Analyst is used to analyze past events and predict where subsequent events are most likely to occur.

Deductive Method

The deductive method relies on qualitative data or a subject matter expert (SME) to describe the relationship between event occurrences and factors that describe the environment. As a result, the deductive process generally will rely on more subjective information. The means that the modeler could potentially be limiting the model by only inputting a quantity of factors that the human brain can comprehend.

An example of a deductive model is as follows: Set of Events are typically found:

  • between 100 and 700 meters from airports
  • in the grassland land cover category
  • at elevations between 1000 and 1500 meters

In this deductive model, high suitability locations for the set of events are constrained and influenced by non-empirically calculated spatial ranges for airports, land cover, and elevation lower suitability areas would be all else. The accuracy and detail of the deductive model is limited by the depth of qualitative data inputs to the model.

Inductive Method

The inductive method relies on the empirically calculated spatial relationship between historical or known event occurrence locations and factors that make up the environment (infrastructure, socio-culture, topographic, etc.). Each event occurrence is plotted in geographic space and a quantitative relationship is defined between the event occurrence and the factors that make up the environment. The advantage of this method is that software can be developed to empirically discover – harnessing the speed of computers, which is crucial when hundreds of factors are involved – both known and unknown correlations between factors and events. Those quantitative relationship values are then processed by a statistical function to find spatial patterns that define high and low suitability areas for event occurrence.

Presence-Absence Models

Modeling Africanized Honeybee Habitat near St. George, Utah. Observed presence and absence points allow the model to identify which environmental characteristics can be used to accurately predict suitable habitat of the africanized honeybee.

Presence absence models are built using specific points that indicate the presence of a binary variable, usually recorded as a series of 0 and 1’s. It is somewhat commonsense that the variable needs to be completely inclusive or exclusive, meaning it needs to have an obvious opposite [1]. For example, a presence absence model can use "yes" and "no" as its parameters. "Yes" and "no" are complete opposites, and data can be categorized as either a "yes" or a "no." An example of presence absence models is habitat modeling. Habitat modeling is when researchers predict where a variable's habitat is located or not located. The "Modeling Africanized Honeybee Habitat" map, pictured on the right, is an example of habitat modeling.

The researcher combines different data together in order to form a logistical model of predictions where the variable in question either is or is not. Then it can be compared to field or in-situ data. Each point on the result can be one of four outcomes as stated by Pearson[2] :

  • true positive (the model predicts that the species is present and test data confirms this to be true)
  • false positive (the model predicts presence but test data show absence)
  • true negative (the model predicts absence and the test data show absence)
  • false negative (the model predicts absence but test data show presence)

This is a simple way to test the usefulness of the model. A model with mostly true positives and mostly true negatives is a suitable model. Statistics can be used in order to more clearly define the attributes of the model.

Examples of when a presence absence model is used include the following: habitat models predicting biodiversity niches, business models predicting the best location of a new, successful business, or a model predicting an environmental response to climate change.

"Testing Hypotheses Suggested by Data" Fallacy

In predictive modeling, the results can be affected by what is known as the "testing hypotheses suggested by data" fallacy. In statistics, when a hypothesis suggested by the data is tested using the data set that suggested it, it is often accepted as true by the hypothesis even when in reality it is not. This is often caused by predicting a hypothesis from a limited data set and then testing the data on that same data set [3]. The hypothesis must be tested from a new data set in order to achieve better research.

The general problem with testing hypotheses suggested by data is the strong risk of a false positive [3]. In predictive modeling, while researchers and scientists are usually looking for generalizations from a limited data set, they must take the "testing hypotheses suggested by data" fallacy into account in order to have a more accurate model.

Possible strategies for avoiding this fallacy include:

  • collecting a high number of confirmation samples
  • Scheffé's Rule (statistical method)
  • Cross-validation
  • using 70% of the original dataset for model calibration and a separate 30% for validation [4]
  • Bootstrapping (statistical method)
  • Other methods of compensation for multiple comparisons

GIS&T Body of Knowledge

The GIS&T Body of Knowledge covers references this idea in sections GC6-1 and GC7-1]].

See Also

References

  1. Stockwell D.R.B. The Power of Numeracy (2007) Niche Modeling
  2. Pearson, Richard G. Species' distribution modeling for conservation educators and practitioners. Center for Biodiversity and Conservation & Department of Herpetology American Museum of Natural History (2007) Distribution Modeling
  3. 3.0 3.1 Testing hypotheses suggested by the data. Wikipedia, The Free Encyclopedia. Accessed 10 October 2012
  4. Huberty, C.J. (1994) Applied discriminant analysis Wiley Interscience, New York."