When Data Modeling goes too far

One thing I have struggled with when I have created Operational Data Stores is the tendency to create generic tables that promote re-use. I find these are usually tables like Address and Person. In an enterprise environment there may be many applications or Systems of Record that store Person or Address information. There is the tendency when we create the objects in an Operational Data Store to store like information in one table.

Why?

In most cases having multiple Address or Person tables don’t violate any rules of Normalization. It is just a habit of organization to try to have only one table for a certain entity. We probably don’t have a requirement to report on all addresses together? Maybe for Person, but not for Address.

So the question is, Is this over-modeling?

Early in my modeling career I would have thought the question ridiculous. But now, I’m not so sure. If I had it to do over, I think I would duplicate some fields on different tables and leave the objects separate . Is there a problem created by having clients and Sales Representatives in separate tables with the same columns? Certainly the Data Model has more tables but it many ways it is way easier to understand and query. One of the main reasons I would do this is how generic entities can complicate querying and reporting. Although it can provide a more aesthetically pleasing data model, the opposite could be said of the queries that are required to pull information out of the database. And if the queries and reports are more complicated, the same could be said of the Extract, Transform, and Load processes to load the data.

Going forward, I think I will resist the urge to combine similar entities when there isn’t a business reason to do so.

How about you?

How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.

“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”

Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.

In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.

After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.

The Opportunity

My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.

One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.

Serendipity

In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?

The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.

The Solution

Our solution contained the following elements:

  • tSQLt  source framework for the automated testing framework
  • Data Mapping Spreadsheet that defined the Column to Column mapping
  • 4 custom tables that contain information from the Data Mapping Spreadsheet
  • Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests

4 Custom Tables

schema

ETL Automated Tests

With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:

Table to Table – Every Table

  • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
  • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.

Column – Every Column

  •  TstColumn­DataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­ConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­NullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.

The Results

By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA  and 4 custom tables. A large part of generating our tests is now data driven.

 

Business Case for creating a Data Model and Data Normalization

I recently came across a great article by @datachick on “Why having a Data Model is important?” on DataVersity. Highly recommended read.

Karen Lopez provides a great list of items on why she finds a Data Model helpful. I do believe there is something missing from the list of items though. All of the items mostly describe reasons from an IT point of view.

Here is Karen’s list on why you should create a data model:

  • Technologies used to move and persist data come in many forms, over time and at the same time. I support the creation of XML Documents, multiple DBMSs, multiple versions of the same DBMS. By having a single logical data model of data requirements, I can separate the rules and definitions of the requirements from their implementation.
  • The same data exists in many platforms, in many locations.  I need to be able to map those sources to targets across platforms and systems.  Doing this mapping, or data lineage, with a data model is much easier.  It also helps me understand the implications of making a change in one system on another system.  I can’t do that by just looking at code.
  • Writing stuff down is often a good way to impress people that have to provide requirements, compliance issues, security requirements, etc. They like not having to answer the same question over and over for different people and roles.  I can’t tell you how many times a business person has asked me if IT ever writes anything down.  Imagine your frustration if you had someone working on your house and ever tradesperson started their day with “Tell me about what sort of house you want? How many people will live here? Do you want to be able go get to the second floor? Do you need a bathroom?  How about a shower?”  That’s how many business professionals see us in IT.
  • A data model is a great way of capturing rules, constraints, definitions in a method that is technology independent. I can capture those things once and reference them in many places.
  • We can measure databases against our data models to assess fit. This can be done for application packages and custom development.
  • Enterprise data is complex.  I work with a data model that has 32,000 objects (tables, columns, datatypes, constraints, etc.) in it.  There is no way I could professionally manage change by just trying to remember all this information.  Nor could anyone else.
  • Modeling helps you ask the right questions before a bunch of time is spent coding, creating screens, reports, etc. This reduces costs and the number of bug fixes required.
  • Like all models, the data model is a communication tool and is good for tying requirements directly to designs and implementations.  This is especially true when I generate data prototypes based on the model.
  • Data governance can’t easily be done via reverse-engineered pictures of databases.
  • Once you’ve worked with a great data model, you can’t go back.

I agree with all of Karen’s points and I’d even like to add a few of my favourite benefits from an IT point of view as well.

  • I find HUGE benefits in being able to abstract domains across my entire data model so that changes to domains in the future can be done easily and ripple through the entire data model. The use of domains also ensures that my design is consistent and coherent across all the hundreds and thousands of tables in my data models.
  • I find HUGE benefits in being able to version my data model alongside the application code so I can revert to any point during the entire project.
  • I find HUGE benefits in helping me think through the problem. The danger with documents is they don’t have to be holistic and complete. You can start a document and describe functionality and relationships but you could forget to describe a couple of relationships without people noticing. But once you transform that description into a diagram, be it a data model or another diagram, orphans and missed relationships become very apparent. I like data models because they help me ensure my solution is correct, consistent, and complete.

The Business Case

One thing I always find fascinating is how hard it is to discuss and gain agreement from clients on why you need a data model and need to have a nicely normalized data model. Everyone in the data world takes this as just a fact, but how can you convince a client that they should spend the months of person-time to create one? Why should they spend tens of thousands of dollars? What is the business benefit?

I believe it comes down to two basic business benefits

1) The Data Model enables communication between IT professionals and the clients using visual means – As we are learning in all areas of the business world, a picture is worth a thousand words. We could create paragraphs upon paragraphs on how we believe the client’s business is structured and we would be wrong every time. We will still be wrong with a data model, but we will be less wrong as clients find it easier to point out the relationships we have misunderstood.

2) Normalization promotes and enables future business flexibility – A normalized database promotes a design that is highly cohesive and loosely coupled. Why does this matter? To the business it matters greatly! It means we can adapt to new realities in the business world quickly and easily. That means quicker time to market and less budget required for new functionality.

  • Need to integrate a new line of business? No Problem.
  • Need to accommodate a new language in the application? No Problem.
  • Need to roll out enhancements to keep up with the competition? No Problem.

Summary

If you agree with the benefits above, a Data Model is just the most efficient way of enabling these benefits. Especially with free tools like Oracle’s SQL Developer Data Modeler. With these free tools, you no longer need a large investment to start using them.

So why do I create a Data Model and normalize my data designs? Because they make good business sense and return benefits far exceeding the cost of creating and maintaining them.

12 Rules to create a Dimensional Model from a Normalized Model in an #agile way

Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.

In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.

So here is my 12 step program with associated level of difficulty in my humble opinion.

1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy

First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.

2) De-normalize reference tables on master tables – Difficulty:Easy

This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.

3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy

This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.

4) Collapse sub-type onto super types – Difficulty:Easy

The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.

5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium

This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.

6) Remove non-enterprise entities – Difficulty:Easy

This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them.  If you are unsure, remove them for now. You can always add them back if needed.

The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.

7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers

I found it very helpful to then categorize the tables into four main categories:

  • Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
  • Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
  • Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
  • Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request

There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.

8) Don’t over de-normalize – Difficulty:Hard

I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.

Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.

9) Create Natural Keys – Difficulty:Hard

Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.

This step will provide crucial feedback for your design.

10) Make all columns mandatory

These last two steps are important to improve the usability of the Data Warehouse.

First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.

11) Create dummy records for every Dimension, Bridge, and Outrigger table

Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.

12) Iterate Immediately

As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.

Summary

These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.

Adaptive Data Model – #Agile or Anathema?

I have seen the concept of an Adaptive Data Model proposed as an Agile method to Data Modelling lately. (Most recently in Ken Collier’s excellent book – “Agile Analytics”) The theory is that you can be more Agile using an  Adaptive Data Model instead of a traditional Data Model of the business domain.

Definition

An Adaptive Data Model is a Data Model that doesn’t model the business data. Rather it is a data model of the data model that models the business data. 🙂   The Adaptive Data Model describes tables that contains the meta-data that describes the data model. In this way, the entire Data Model is data driven and stored in a series of tables. The advantages of this approach is that changes to the model can be made by updating the meta-data in the tables. (as opposed to having to generate Alter statements to update the database structures in the database)

A sample Adaptive Data Model is shown below. (Copyright Ken Collier – Agile Analytics)

ADM

 

A subsequent layer would then need to be created to allow for the data to be extracted in a traditional sense by the application. (and to be made sense of by the business) Some suggestions for this layer have recommended that this layer could be created with a series of views or stored procedures.

Agile?

Although the Adaptive Data Model does allow for the easy modification of the Data Model, is it Agile?

I propose that an Adaptive Data Model is neither Agile or a Data Model.

According to Wikipedia a Data Model is:

“A data model is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data are stored and accessed.”

An Adaptive Data Model does not meet this definition of a Data Model. It is a construct created to allow changes to be made to a data model.

In addition, I would propose that an Adaptive Data Model is also not Agile. It does not encourage frequent delivery and iterative development. An Adaptive Data Model is a complex solution that is not easily deployed in iterations to deliver value quickly and often to the business. If anything, it increases the technical debt of the project.

Summary

We as data professionals should be striving to make our processes more Agile and to be able to allow our processes to be refined iteratively like other areas of Software Development. Software Development proposed practices that allowed for iterative development. These didn’t include creating an Object Model of the Object model so that they didn’t have to fully embrace adaptive development practices.

An Adaptive Data Model feels like a short cut. It is hard to be Agile and iterative on data project, but trying to propose an Adaptive Data Model as a solution seems like a wrong turn. We need to find ways to allow our data designs to be change tolerant, adaptive, and test driven.

Memories of #ADABAS – My first database love

I recently was watching a DataVersity webinar on MongoDB schema design and I had flashbacks to my first DBA job at Investors Group in the early 90’s. You can find the webinar here.

MongoDB

MongoDB is one of the many NoSQL databases available. When I saw that DataVersity was going to have a webinar on MongoDB and how you define schemas in MongoDB I was very interested in learning about the topic. (It is a great presentation and introduction on the topic if you are interested)

One of the most interesting concepts was that you define documents which are like tables in SQL. These documents contain columns which are just name-value pairs of attributes when the data is stored. These documents can then be joined to other documents in one of two ways. Either they can be embedded into the main document or linked to the main document. This pattern is seen frequently is Object Oriented development but it is one that frequently causes issues in normalized databases.

These two methods also reminded me how I sometime like to organize tables in SQL databases along the same lines. I also like to identify which tables are truly independent (linked candidates) and which tables are dependent(embedded candidates) and need the context of other tables to typically be used. For example, Client and Product can be thought of as independent tables while the related tables of client_address and product_rate are dependent as they really only make sense when context is provided by Client and Product. I try to think of tables in these two ways as this categorization helps when we need to do Dimensional modeling.

I was intrigued with the functionality provided in MongoDB to allow for this embedding of other tables or documents.

But I did have concerns as all of the attributes to be saved in these documents are name-value pairs that are defined in each data command. There is also no consistency that is validated or verified on multiple data commands that operate the same documents. Yipe! Unlike SQL which separates Data Modification Language(DML) from Data Definition Language(DDL), in Mongo DB everything is a DML statement. As a result there is a lot of faith and confidence placed in the application to manage the integrity of the data. Although there is some additional work to define the data structure first and then operate on it, I feel that this structure is beneficial and has value.

Love at First Sight

As I was listening to the Webinar, I though back to my experience with ADABAS in the 90’s. After working with Oracle, Informix, Sybase, and SQL Server – I still think about the functionality I had with ADABAS. Unlike the other relational SQL databases, ADABAS did provide functionality that allowed you to embed tables/files in other tables/files through multi-value fields and periodic groups.(Tables are called Files and Indexes are called Descriptors in ADABAS, but the functionality is the same) Unlike MongoDB, ADABAS did provide functionality that required the creation of the tables/files first with their own DDL language and then modification of the data with their own DML.

In addition, ADABAS was able to provide performance throughput similar to what I have seen in any of the top-flight relational database engines. It certainly was high-performing. Now some of that high performance may have been helped by the lack of referential integrity provided by ADABAS. (That was the one drawback always mentioned when ADABAS was compared to other relational DBMSs) Any referential integrity must be maintained by the applications.

In retrospect, I don’t think I realized what a good DBMS I was using at the time.

I noticed that ADABAS now has a community edition that I am currently downloading… Maybe I will see what the technology looks like today. 🙂

Summary

Although MongoDb looks promising, I think ADABAS should try to inject their name and product into the NoSQL discussion. They have been providing NoSQL databases for over 40 years. They were cool before it was cool to be cool. 🙂

If I’ve peaked your curiosity, here are a couple of ADABAS links:

ADABAS Wikpedia Page

ADABAS Home Page

The state of #Agile according to Data Modellers

DataVersity released their snapshot survey on Data Modelling and the results aren’t good for Data Modellers who want more adoption of Agile by the Data Modelling community.

Before we get into the details, DataVersity is a great source of references and webinars for all things data. You can find them by clicking on this link: DataVersity

The Results

1) How important is it to know how to work with Agile Teams? (this was graded on a scale of 1-10. I’ve tried to group them into categories)

  • 27% thought it was important (1-3 on a scale of 10)
  • 27% were neutral (4-6 on a scale of 10)
  • 46% thought it was not important (7-10 on a scale of 10)

2) How much experience do you have with Agile or Scrum projects?

  • 12% – 10+ projects
  • 9% – 5-10 projects
  • 20% – 3-4 projects
  • 16% – 2 projects
  • 8% – 1 project
  • 33% – no experience

3) If you have been the data architect or data modeller on an Agile project, how satisfied were you with the results?

  • 22% – very satisfied
  • 46% – fairly satisfied
  • 32% – not satisfied

Analysis

What I took from these responses are that the majority of people feel that Agile isn’t that important to them. But I hope there is some light at the end of the tunnel as a good percentage of Data Modellers have not been on an Agile project or have only been on 1 Agile project. (41%) In addition, of the ones that were on Agile projects, 68% reported that they were either very satisfied or fairly satisfied. Those are encouraging numbers. Although some Data Modellers don’t think Agile is important to them, there seems to be a correlation that once they are exposed to the methods, they view them as positive.

Dimensional Modelling

Those of you that have read my blogs have seen my positive comments on Dimensional Modelling. Another concern I had with the results of the survey are the percentage of Data Modellers that feel understanding Dimensional Modelling concepts are not important to them.

  • 28% thought it was important (1-3 on a scale of 10)
  • 34% were neutral (4-6 on a scale of 10)
  • 40% thought it was not important (7-10 on a scale of 10)

Given that all Data Modellers have had to model the data to support reporting, this is a somewhat concerning statistic. Dimensional Modelling is not something new as well, not like XML and No SQL data stores. Shockingly 62% felt that knowledge of XML and No SQL Data Stores are not important to them.

Summary

I think we have a ways to go to encourage the adoption of new processes, methods, and technologies in the Data Modelling and Database professions. Even within our own profession, items like Dimensional Modelling still lack full acceptance.

I would encourage every Data Modeller out there to learn about Agile and Agile methods. As we move into the second decade of Agile, its adoption is increasing and it will affect every type of project in the future. Understanding the methods will help the Data Modeller determine how best to integrate with Agile methods and practices.

I feel that Data Modellers must remember that we are providing a service to projects and the business. If the projects and the business are becoming more Agile, we also must become Agile. If not, the projects and businesses will be drawn towards other solutions and services that do align with their methods. This could possibly drive more projects toward No SQL and XML that are not good candidates for those projects.