ArtsAutosBooksBusinessEducationEntertainmentFamilyFashionFoodGamesGenderHealthHolidaysHomeHubPagesPersonal FinancePetsPoliticsReligionSportsTechnologyTravel

Database normalization

Updated on July 6, 2016


Database Normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency.

The central concept in this discussion is the notion of Functional Dependency (FDs), which depends on the semantics of the Data and which deals with what Information in a relation is dependent on what other information in the relation.

We perform normalization on database to avoid redundancy and associated problems and anomalies that may occur. The data redundancies yield the following anomalies

1. Update Anomaly If one copy of redundant data is updated, an inconsistency is created unless all copies are similarly updated.
2. Insertion Anomaly It may not be possible to store some information unless some other key information is stored as well.
3. Deletion Anomaly It may not be possible to delete some information without loosing some other related information as well.

Normalization decomposes a Relation into two or more smaller relations. Normalization removes the anomalies in the database, but the data that could be retrieved from one relation will require several relations to be joined after normalization.

Functional Dependency

There are several stages of the normalization process. They are called 1NF, 2NF, 3NF, EKNF, BCNF, 4NF, 5NF, DKNF, 6NF. The higher normal forms is cumulative of the lower normal forms; for example, BCNF is always in 3NF which also satisfies 2NF and 1NF.

A function or mapping ƒ from set A to a set B, denoted ƒ:A→B, is a correspondence in which to each element x of A (domain) corresponds exactly one element y=f(x) of B (range). We use the notation x-->y, saying x determines y or y is determined by x. This states that, In a legal relation state (relational instance) of R, for any value of x there relates to only a single definite value of y, this is known as Single Valued Dependency. Functional Dependencies define constraints on the set of legal relations.

A functional dependency is trivial if it is satisfied by all instances of a relation, this means that A --> B holds, whenever B is a subset of A;
e.g. FullName-->FirstName, where FullName=FirstName+LastName; or simply X-->X.

The Right Hand Side is dependent on the Left Hand Side (determinant) of the FD. Sets of functional dependencies may have redundant dependencies that can be inferred from the others. Intuitively, a canonical cover of F is a minimal set of FD equivalent to F+, having no redundant dependencies or redundant parts of dependencies. Clearly F+ is a superset of F.

all about Keys and non-keys

Super Key is the set of attributes which when taken collectively can uniquely identify a tuple in a relation. This means the super-key has the uniqueness property.
A super-key also having the irreducible property is a candidate key, which implies that any proper-subset of the candidate key no longer remains the candidate key (minimal superkey).
One of the candidate key is chosen by the DBA as the principal key called the Primary key, the rest if any are the alternate keys or secondary keys. Primary key enforces the Entity Integrity rule which states that no component of the primary key is allowed to accept NULL values.
Referential Integrity rule states that the database must not contain any unmatched foreign key. A Foreign key in a base relation R2, is a subset of the set of attributes of R2, say FK, such that:
There exists a base relation R1 not necessarily distinct with a candidate key CK and for every value of FK in the current value of R2 is identical to the value of CK in some tuple in the current value of R1.
Foreign keys may be referenced to any candidate keys either the primary key or unique keys.

The attributes which are a part of a key attribute is called the prime attribute, the rest which are not a part of any key are the non-prime attributes.

Determine keys in R with F

We say that an attribute Y is functionally determined by X if X→Y. To test whether a set X is a key, we must compute the set of attributes functionally determined by X. We call the set of all attributes functionally determined by X under a set F of FD the closure of X under F, denoting this by X+. If X+ contains all the attributes in R, or X→R, then X is a superkey in R. The steps in finding the keys in R with a given set of FD is :

Step 1: Draw a dependency graph of F, each vertices corresponds to an attribute, and dependency by directed edges.
Step 2: Identify the set of source-vertices Vn that have no incoming edges, and the set of sink-vertices Vo that have only incoming edges. (Vn≈»indegree=0; Vo≈»outdegree=0)
Step 3: Compute Vn+. A candidate key is a set of attributes that :
1. contains all the attributes in Vn (attributes present only in the LHS of a FD), Vn belongs to the set of prime attributes.
2. contains no attributes in Vo (attributes present only in the RHS of a FD), Vo belongs to the set of nonprime attributes.
3. it is the minimal superkey, confirmed by the closure of remaining attributes in R.

First Normal Form 1NF

A relation is in 1NF if all underlying domains contain (indivisible) atomic values only.

There should not be any duplicate rows in the table
Each cell is single-valued (no repeating groups or arrays)
Entries in a column are of the same kind.
All non-key attributes are dependent on the key attributes

Clearly, if we have defined the key attributes and all the attributes in R are single valued, the relation R satisfies 1NF.

Second Normal Form 2NF

A relation is in 2NF if it is in 1NF and every non-prime (non-key) attribute is fully dependent on the candidate keys of the relation. Clearly, a table is in 2NF if it satisfies 1NF and has no partial dependencies.

In this context, fully dependent means that, no proper-subset of the determinant can any longer determine the RHS of a FD; partial dependencies mean, any proper-subset of the determinant may determine the RHS of the FD.

A relation R which satisfies 1NF is not in 2NF only if a nonprime attribute is functionally dependent on any proper-subset of the composite key in R (partial dependency exists).

Third Normal Form 3NF

A relation R is in 3NF if R is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R. All non-prime attributes in R are required to be directly dependent on every candidate key of the relation.
A relation is in 3NF only if whenever a nontrivial FD A-->B holds in R, either A (LHS) is a candidate key or B (RHS) is a prime attribute (member of any candidate key) of R.

A relation R which is in 2NF is not in 3NF only if there exists any FD, where a non-key attribute functionally determines another non-key attribute in R.

Boyce Codd Normal Form BCNF

A relation is in BCNF if it is in 3NF and every determinant (LHS of a FD) is a candidate key. A relational schema R is in BCNF, if whenever a nontrivial functional dependency holds in R, then the determinant must be a key of R.
BCNF does not allow a non-prime attribute to functionally determine a prime attribute, which was relaxed in 3NF.

A relation in 3NF is not in BCNF, only if all the following satisfies
1. there are more than one candidate keys
2. the keys are composite and overlapping
3. there exists a FD from a non-overlapping attribute of one key to the non overlapping attribute of another key.

MultiValued Dependencies and 4NF

We have worked with single valued dependency; while Multivalued Dependency A-->>B defines a relationship in which a definite set of values of attribute B are determined by a single value of A. This means for any value of A instead of having a single value of B we have a definite set of values for B. Functional Dependencies sometimes are referred to as equality-generating dependencies, and Multivalued Dependencies are referred to as tuple-generating dependencies.
More formally, a MVD X-->>Y is called trivial if either Y is a subset of X or X union Y equals R. It is to be noted that every FD is also an MVD, since A-->B implies A-->{B,Φ}.

In 4NF we consider the non-trivial MVD; A relation R to be in 4NF, if whenever a MVD holds X-->>Y then either the dependency is trivial or X is a candidate key for R.

A relation which is in BCNF is not in 4NF, only if there exists two or more independent multivalued facts (MVD ≈ one-to-many or many-to-many relationships) about an entity. So, the relation has three or more attributes in R, based on the non-trivial multivalued dependency X-->>Y (X multidetermines Y) we decompose R into R1=(X U Y) and R2=(R-Y).

Fifth Normal Form 5NF

5NF sometimes called projection-join normal form (PJNF) where every non-trivial join dependency in the table is implied by the superkeys of the table.

Projection is the process of separating one relation into sub-relations and Join is the process of consolidating these sub-relations back into one relation. Sometimes, this join and projection operation produces spurious tuples because of join dependencies, we get more not less. The key objective of 5NF is to remove the join dependencies.

A Join Dependency JD denoted as JD(R1,R2) is equivalent to the MVD R1∩R2-->>(R1-R2) symmetrically R1∩R2-->>(R2-R1). Conversely, every MVD A-->>B is also a JD *[A U (R-B), A U B].

In 5NF we consider the non-additive join decomposition making sure that Lossless-Join decomposition can be achieved. 5NF does not differ from 4NF unless there exists a symmetric constraint such as a rule about the correspondence between the attributes in R (cyclic dependencies).

Domain/key normal form and 6NF

Domain/Key normal form requires every constraint on the table is a logical consequence of the table's domain constraints and key constraints. DKNF enforces database constraints by checking that each attribute in a tuple is of the appropriate domain and the key constraints are enforced.

DKNF defined by Ronald Fagin, theoretically known as the ultimate normal form; later C.J. Date with Hugh Darwen, and Nikos Lorentzos defined the Sixth Normal form which requires that the table features no non-trivial join dependencies at all (with reference to generalized join operator). This implies that a Relvar R of degree n is in 6NF, iff it is in 5NF and it has no key of degree less than n-1.

Goals of Normalization

Let R be a relation schema with the set F of functional dependencies.

  • each relation scheme should be in a good form.
  • the decomposition is a lossless-join decomposition.
  • Preferably, the decomposition is dependency preserving.

A relation R is said to be lossless decomposition into R1 and R2, if only the natural-join of these sub relations produce the original relation without creating additional spurious tuples.
Let, R be decomposed into R1 and R2 (binary decomposition),
If either R1 ∩ R2 → R1 or R1 ∩ R2 → R2 holds then
the decomposition is loss-less else it is undesired lossy.

It is not necessary nor suggested to normalize our database to the highest normal form possible, I consider 3NF sufficient for most database systems. Normalization has the disadvantages that it may require complex join to query the database and make the programmer task harder in extra coding and effort. Although, we have a solution of using materialized view, we may want to use non-normalized schema for performance. The main idea of normalization is to decompose the relation into a number of single-theme tables removing the database anomalies and problems of data redundancy.

© 2012 Dipankar Basu


Submit a Comment

No comments yet.


This website uses cookies

As a user in the EEA, your approval is needed on a few things. To provide a better website experience, uses cookies (and other similar technologies) and may collect, process, and share personal data. Please choose which areas of our service you consent to our doing so.

For more information on managing or withdrawing consents and how we handle data, visit our Privacy Policy at:

Show Details
HubPages Device IDThis is used to identify particular browsers or devices when the access the service, and is used for security reasons.
LoginThis is necessary to sign in to the HubPages Service.
Google RecaptchaThis is used to prevent bots and spam. (Privacy Policy)
AkismetThis is used to detect comment spam. (Privacy Policy)
HubPages Google AnalyticsThis is used to provide data on traffic to our website, all personally identifyable data is anonymized. (Privacy Policy)
HubPages Traffic PixelThis is used to collect data on traffic to articles and other pages on our site. Unless you are signed in to a HubPages account, all personally identifiable information is anonymized.
Amazon Web ServicesThis is a cloud services platform that we used to host our service. (Privacy Policy)
CloudflareThis is a cloud CDN service that we use to efficiently deliver files required for our service to operate such as javascript, cascading style sheets, images, and videos. (Privacy Policy)
Google Hosted LibrariesJavascript software libraries such as jQuery are loaded at endpoints on the or domains, for performance and efficiency reasons. (Privacy Policy)
Google Custom SearchThis is feature allows you to search the site. (Privacy Policy)
Google MapsSome articles have Google Maps embedded in them. (Privacy Policy)
Google ChartsThis is used to display charts and graphs on articles and the author center. (Privacy Policy)
Google AdSense Host APIThis service allows you to sign up for or associate a Google AdSense account with HubPages, so that you can earn money from ads on your articles. No data is shared unless you engage with this feature. (Privacy Policy)
Google YouTubeSome articles have YouTube videos embedded in them. (Privacy Policy)
VimeoSome articles have Vimeo videos embedded in them. (Privacy Policy)
PaypalThis is used for a registered author who enrolls in the HubPages Earnings program and requests to be paid via PayPal. No data is shared with Paypal unless you engage with this feature. (Privacy Policy)
Facebook LoginYou can use this to streamline signing up for, or signing in to your Hubpages account. No data is shared with Facebook unless you engage with this feature. (Privacy Policy)
MavenThis supports the Maven widget and search functionality. (Privacy Policy)
Google AdSenseThis is an ad network. (Privacy Policy)
Google DoubleClickGoogle provides ad serving technology and runs an ad network. (Privacy Policy)
Index ExchangeThis is an ad network. (Privacy Policy)
SovrnThis is an ad network. (Privacy Policy)
Facebook AdsThis is an ad network. (Privacy Policy)
Amazon Unified Ad MarketplaceThis is an ad network. (Privacy Policy)
AppNexusThis is an ad network. (Privacy Policy)
OpenxThis is an ad network. (Privacy Policy)
Rubicon ProjectThis is an ad network. (Privacy Policy)
TripleLiftThis is an ad network. (Privacy Policy)
Say MediaWe partner with Say Media to deliver ad campaigns on our sites. (Privacy Policy)
Remarketing PixelsWe may use remarketing pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to advertise the HubPages Service to people that have visited our sites.
Conversion Tracking PixelsWe may use conversion tracking pixels from advertising networks such as Google AdWords, Bing Ads, and Facebook in order to identify when an advertisement has successfully resulted in the desired action, such as signing up for the HubPages Service or publishing an article on the HubPages Service.
Author Google AnalyticsThis is used to provide traffic data and reports to the authors of articles on the HubPages Service. (Privacy Policy)
ComscoreComScore is a media measurement and analytics company providing marketing data and analytics to enterprises, media and advertising agencies, and publishers. Non-consent will result in ComScore only processing obfuscated personal data. (Privacy Policy)
Amazon Tracking PixelSome articles display amazon products as part of the Amazon Affiliate program, this pixel provides traffic statistics for those products (Privacy Policy)