Data integrity

From wiki.gis.com
Jump to: navigation, search

Data integrity is a term used in computer science and telecommunications that can mean ensuring data is "whole" or complete, the condition in which data is identically maintained during any operation (such as transfer, storage or retrieval), the preservation of data for their intended use, or, relative to specified operations, the a priori expectation of data quality. Put simply, data integrity is the assurance that data is consistent and correct.

Often such integrity is ensured by use of a number referred to as a Message Integrity Code (MIC) or Message Authentication Code (MAC).

In cryptography and information security in general, integrity refers to the validity of data. Integrity can be compromised through:

  • Malicious altering, such as an attacker altering an account number in a bank transaction, or forgery of an identity document
  • Accidental altering, such as a transmission error, or a hard disk crash
  • Programming errors that result in inconsistencies in the data

In terms of a database data integrity refers to the process of ensuring that a database remains an accurate reflection of the universe of discourse it is modelling or representing. In other words there is a close correspondence between the facts stored in the database and the real world it models [1].

Database data integrity models must not be confused with database consistency models which do not focus on the integrity of the data but only the consistency of the storage and retrieval mechanisms of the data. The database data integrity model insures that the data is an accurate reflection of the entity, referential, and domain models while the database consistency model insures only that the storage and retrieval process functions properly with no concern for the accuracy or usability of the actual data.

Types of integrity constraints

Data integrity is normally enforced in a database system by a series of integrity constraints or rules. Three types of integrity constraints are an inherent part of the relational data model: entity integrity, referential integrity and domain integrity.

Entity integrity concerns the concept of a primary key. Entity integrity is an integrity rule which states that every table must have a primary key and that the column or columns chosen to be the primary key should be unique and not null.

Referential integrity concerns the concept of a foreign key. The referential integrity rule states that any foreign key value can only be in one of two states. The usual state of affairs is that the foreign key value refers to a primary key value of some table in the database. Occasionally, and this will depend on the rules of the business, a foreign key value can be null. In this case we are explicitly saying that either there is no relationship between the objects represented in the database or that this relationship is unknown.

Domain integrity specifies that all columns in relational database must be declared upon a defined domain. The primary unit of data in the relational data model is the data item. Such data items are said to be non-decomposable or atomic. A domain is a set of values of the same type. Domains are therefore pools of values from which actual values appearing in the columns of a table are drawn.

If a database supports these features it is the responsibility of the database to insure data integrity as well as the consistency model for the data storage and retrieval. If a database does not support these features it is the responsibility of the application to insure data integrity while the database supports the consistency model for the data storage and retrieval.

Having a single, well controlled, and well defined data integrity system increases stability (one centralized system performs all data integrity operations), performance (all data integrity operations are performed in the same tier as the consistency model), re-usability (all applications benefit from a single centralized data integrity system), and maintainability (one centralized system for all data integrity administration).

Today, since all modern databases support these features (see Comparison of relational database management systems), it has become the defacto responsibility of the database to insure data integrity. Out-dated and legacy systems that use file systems (text, spreadsheets, ISAM, flat files, etc.) for their consistency model lack any kind of data integrity model. This requires companies to invest a large amount of time, money, and personnel in the creation of data integrity systems on a per application basis that effectively just duplicate the existing data integrity systems found in modern databases. Many companies, and indeed many database systems themselves, offer products and services to migrate out-dated and legacy systems to modern databases to provide these data integrity features. This offers companies a substantial savings in time, money, and resources because they do not have to develop per application data integrity systems that must be re-factored each time business requirements change.

Examples

An example of a data integrity mechanism in cryptography is the use of SHA-256 hash values. These blocks of bytes function as a numeric summation of the content of a data item. Should the data change even slightly, the SHA-256 hash would yield a totally different result. MD5 that was used to provide message integrity has since been broken and no longer used.

Another example of a data integrity mechanism is the parent and child relationship of related records. If a parent record owns one or more related child records all of the referential integrity processes are handled by the server itself, which automatically insures the accuracy and integrity of the data so that no child record can exist without a parent (also called being orphaned) and that no parent loses their child records. It also insures that no parent record can be deleted while the parent record owns any child records. All of this is handled at the database level and does not require coding integrity checks into each application.

References

  1. Beynon-Davies P. (2004). Database Systems 3rd Edition. Palgrave, Basingstoke, UK. ISBN 1-4039-1601-2

See also