Why do we #DataModel at all?

People in the Database world take Normalization and Data Modeling as something that should be done without question. I compare it to best practices like versioning software. No one expects that anyone would create software without version control anymore.But more often recently I do get questioned and challenged on why we need to normalize and model data. Is it even required with the cheap disk space, memory, and server capacity available ?

According to Wikipedia and others, the objective of normalization is:

“Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.”

The rules of Normalization

Normal form

Brief definition

Violation

1NF A relation is in 1NF if and only if all underlying domains contain scalar values only First Normal Form is violated when the table contains repeating groups
2NF A relation is in 2NF if and only if it is in 1NF and every non-key attribute is irreducibly dependent on the primary key – every column must depend solely on the primary key Second Normal Form is violated when a non-key field is a fact about a subset of a key
3NF A relation is in 3NF if and only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. Third normal form is violated when a non-key field is a fact about another non-key field
4NF Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A. Fourth Normal Form is violated when a table contains two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

The Questions

Now that we have reviewed the objectives and rules of normalization, let us summarize. The objective of Normalization is to:

  1. Minimize Redundancy
  2. Minimize Dependency

But what if we have extra storage available and storing redundant copies of data is not a problem? In fact, it probably will speed up our query response time. What if we also don’t require frequent modification of data so that having de-normalized data won’t result in update or deletion anomalies caused by excessive data dependency? Why should we still model our data and normalize?

The three reasons to Data Model

Simplicity, Consistency and Integrity, and Future Flexibility.

Simplicity

Every one of the violations mentioned above would require extra code and extra unit tests to validate proper functioning. Depending on the amount of violations, this can become a severe amount of technical debt that will needlessly be required in the software. There is an entire movement dedicated to the elimination of If statements. (www.antiifcampaign.com) Software that is Data Driven rather than Condition Driven is simpler and easier to maintain over the life of the application.

An application that is Data Driven can also automate the creation of their test cases to validate proper functioning of the application. This combined with the enhanced simplicity greatly adds to the quality of the application.

Consistency and Integrity

Even if the solution being modeled can accommodate redundant data and has the potential for minimal update and deletion anomalies currently, significant risk is being assumed by having these potential situations in your data model. How can you ensure that redundant data will be kept in sync and that update and deletion anomalies do not get introduced in the future as people and requirements change? Either this is through additional software development code or by additional processes and latent knowledge in resident experts. Neither of these situations are a good use of time and energy.

This is an example of an application-centric view of the data model. Unfortunately, not all activity on the Data Model can be guaranteed to always go through the application. Data Fixes, Conversions, and enhancements all have the ability to bypass the application’s business logic and compromise the integrity of the data. All it takes is one high value client with inaccurate or inconsistent data to irreparably harm a company’s reputation and revenue stream.

Future Flexibility

Solutions that are data driven and do not have excessive functional dependencies are much easier to evolve in the future. For example, I may have a business requirement to split one account type or combine account types. This type of conversion will be quite routine if I have modeled my data properly and minimized dependencies. If not, the conversion can be quite convoluted and I will probably need to evaluate code before I can determine the implications of making such a change. Then I have to be sure I address and update all the redundant code throughout the application. Just because the situation doesn’t exist currently with update and deletion anomalies doesn’t mean those situations won’t happen in the future.  

In addition, these changes to split or combine account types would probably also require code changes. If the solution was Data Driven, the possibility of these code changes would be minimized. (not to say they would never be required, but the probability of code changes would be minimized)

Summary

A well designed application and user interface will be able to be used with minimal training. It just makes sense and models the clients current processes and tasks.

A well designed data model should also have the same intuitive qualities. It also makes sense and models the business’s data. Not how the application functions, but how the business exists. Modeling the data in this manner minimizes work currently to work with the data and in the future to accommodate change.

In Object Oriented parlance, the Data Model itself should be loosely coupled with high cohesion. Both the Object Model and Data Model should share this characteristic. (Although they will implement it in quite distinct ways)

Object Model and Data Model differences – Embrace the diversity

In my experience there is a distinctive difference in Data Models when they are created by developers with an Object Model mindset. Usually there is some work and negotiation that needs to be done to properly overcome the Object Model-Data Model impedance problem. I believe that both extreme points lead to less than optimal designs and architectures:

1) Implementing an Object Model directly into a Data Model

2) Implementing a Data Model directly into an Object Model

Both models have a rationale and logic as to how they structure components and sometimes that doesn’t translate well between each other. There are good reasons on why they differ.

Recently I’ve had a situation where I’ve come up against these differences. And they have arisen in regards to how each camp sometime models Super/Subtype tables. (and subsequently reference tables)

Definitions

I am using the term Super/Subtype tables to refer to a collection of tables that represent a Generic/Specific relationship. Unlike Parent/Child relationships that represent a relationship with a hierarchy,  the Generic/Specific are at the same entity level. Perhaps a few examples would assist the discussion:

Parent/Child

Typical Parent/Child relationships represents ‘belonging’ or ‘ownership’. Examples of this type of relationship would be:

  • Person/Address
  • Account/Transaction

In these examples, there is a clear understanding of precedence and hierarchy. You obviously need a Person before you can have a related Address. Because there is a Parent/Child relationship and hierarchy, there also usually a clearer distinction between the types. In many cases, this is due to the fact that the two entities have different cardinality. Even the most fervent object modeller will create separate objects if they have different cardinality.

Generic/Specific

Generic/Specific relationships do not represent ‘belonging’ or ‘ownership’. In addition, they also typically do not have different cardinality. These relationships represent ‘refinement’ or a type of ‘Master-Detail’ relationship. These relationships have one Master entity with multiple Specific entities. To be a Generic/Specific relationship, there must attributes at the Master level that applies to all Specific types, but attributes at the different Specific types that only apply to each Specific type. (Which is why the Specific entities exist. Otherwise we would just have them at the Master level if they applied to all Specific Types)

Usually these relationships are created in the Data Model in one of two ways:

1) Multiple tables – one Master table with multiple specific Detail tables. (and a type code on the Generic Entity to define which related Specific Entity applies for each row.

2) One table with nullable fields that are populated based upon the type code of row

Examples of this type of relationship are:

  • Financial Account – [Equity Account, Fixed Income Account]
  • Coverage – [Maximum, Deductible, Co-insurance]

In these examples, there is not a pronounced hierarchy. You obviously need a Financial Account before you can have a related Equity Account, but they are at the same level. As we mentioned before, these entities in the relationship do not have different cardinality.

My Experience

My experience is that Data Models and Object Models are typically the same when we are discussing Parent/Child entities and relationships. Differences of opinions occur with these Generic/Specific entities and relationships. Since they do not have different cardinality, they are very easy and natural to represent as an object.

  • From a data modeller point of view, this practice creates data entities with embedded data knowledge. No longer does the structure of the data represent the true nature of the data. Depending on how many Specific Types and attributes exist , this can cause considerable complexity. The design also breaks the rule of Second Normal Form.
  • From an object modeller point of view, this practice creates an object that can be re-used and leveraged. Rather than needing to create separate objects, I create one object! Isn’t this just Polymorphism?

Isn’t this only a problem depending on what the database is being used for? Since I am a data modeller, I think it is a problem no matter what the database is used for. I guess the issue is just how large of a problem it is. There is less of an issue if it is an application database that has a business layer that can incorporate this complexity and  shield it from the consumers of the data. This may be a much larger issue if this design is for an Operational Data Store or Reporting Database.

Summary

There are valid reasons on why designs and models should differ between objects and tables. This is especially true when analysing Generic/Specific relationships. IMHO.