#SAP, Breaking Data, and Re-enabling #SQLServer Database Referential Integrity Constraints #Microsoft #FTW

Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.

Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.

To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?

The Problem

Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.

The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.

Possible Solution

When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database.  This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the  ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.

There are two majors issue with this approach:

1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.

2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.

Our Solution

Since we are loading data from multiple external providers, we designed a different solution.

Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.

We could also do this re-enabling nightly if we start to encountered more frequent data errors.

In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.

Our SQL Server Solution

A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”

This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.

To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.

Here is the disable constraints SQL

DECLARE @disable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints

OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

PRINT @disable_sql

EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql

END

CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints

And our re-enable constraint SQL:

DECLARE @enable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints

OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY

BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH

FETCH NEXT FROM enable_cursor INTO @enable_sql

END

CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints

Conclusion

This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂

I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!

Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.

Creating my own #ETL data validation #FTW

Recently on the same project I created an Agile Data Warehouse and Extract, Transform, and Load automated test suite, I was tasked to create a data validation process. We need to create a foundational process that could be leveraged to provide ongoing data validation for the data load process. We were responsible for loading data into the Operational Data Store, but we were the last step in a lengthy process. Previously, the data made hops from Legacy to web applications, to SAP, and then finally to the Operational Data store. Further complicating the issue was that at each step, there was a different data technology used:

  • Legacy – ISAM
  • Web Applications – Sybase
  • SAP – Oracle
  • Operational Data Store – SQL Server 2012

Aggregation is your friend!

What we designed was a process where we created a set of the same aggregation reports at each of the four points along the data migration. Typically people create aggregation reports for the obvious financial transactions with numeric totals. These are reports that will sum the transactions across dates, clients, or accounts. We decided we needed to create aggregate reports across all tables and objects in the databases whether they were transactional or not.

Once we analyzed our Operational Data Store with 380+ tables, we determined that we required 77 aggregate reports to provide coverage across the data model.

For example, the aggregate reports that reported on agreements/contracts were created with the following SQL:

— contracts/agreements by client

SELECT client_number, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— sum contract/agreement numbers by client

SELECT client_number, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_client
FROM ods.dbo.agreement, ods.dbo.client
WHERE ods.dbo.agreement.client_id = ods.dbo.client.client_id
GROUP BY client_number;

— contract/agreement numbers by renewal month

SELECT renewal_month, COUNT(*) AS total
INTO etl_validation_temp.dbo.validation_agreement_count_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

— sum contract/agreement numbers by renewal month

SELECT renewal_month, sum(coalesce(agreement_number,0)) AS total_agreement_number
INTO etl_validation_temp.dbo.validation_agreement_number_renewal_month
FROM ods.dbo.agreement
GROUP BY renewal_month;

The Process

The most interesting thing about the process was how I was able to create a copy of each report in a separate database. The solution is designed so that the process can run every day and create 77 reports in a ‘etl_validation’ database suffixed with the date. The solution is also configurable so that databases older than 90 days are automatically deleted.

The challenge I had was that the ‘INTO’ statement couldn’t have a variable that named a unique database suffixed with the date. So you will see in the SQL that the reports are created in a “etl_validation_temp” database that is later renamed to a database name suffixed with the date. I thought about creating dynamic SQL where I could have specified the database name in the INTO clause, but I thought that solution would not have been as maintainable.

The Problem

So all was good right? Not so fast. It worked for the first database, but then I ran into a problem when I created the second database. Although I could rename the database, I couldn’t programmatically rename the files the database uses. So then second database couldn’t be created as it tried to use the same file name.

Fair enough…

So I determined I can create the databases initially with the database name suffixed with the date, rename the database to be suffixed with “temp” to allow the creation of the aggregate reports, and then rename it back to the original database name suffixed with the date. After this I thought about whether the dynamic SQL solution was cleaner, but I decided to stick with this one.

Summary

Regardless, the process works and has been running to create these reports in under 20 minutes every day. Not too bad since some of the tables have over a 100 million records and the database is larger than 80 gigs…

Here is the SQL is created to initially create the database:

SELECT ‘started at ‘ + CAST(GETDATE() AS NVARCHAR(30))

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

exec(‘create database ‘ + @newname)

USE master

EXEC(‘ALTER DATABASE ‘ + @newname +’ Modify Name = etl_validation_temp;’)
GO

And to rename it at the end..

DECLARE @currentdate NVARCHAR(20)
DECLARE @newname sysname

SET @currentdate = CONVERT(VARCHAR(20),GETDATE(),112)

set @newname = ‘etl_validation_’ + @currentdate

USE master;

EXEC(‘ALTER DATABASE etl_validation_temp Modify Name = ‘ + @newname + ‘;’)
GO

SELECT ‘ended at ‘ + CAST(GETDATE() AS NVARCHAR(30))
GO

And finally, the SQL to clean up database older than 90 days:

USE master
go

DECLARE @currentdate DATETIME
DECLARE @limitdate DATETIME

DECLARE @currentdatetxt NVARCHAR(20)
DECLARE @limitdatetxt NVARCHAR(20)

DECLARE @maxcounter INTEGER
DECLARE @counter INTEGER

DECLARE @database_name NVARCHAR(80)
DECLARE @sql NVARCHAR(255)

SET @currentdate = GETDATE()
SET @limitdate = @currentdate – 90

SET @currentdatetxt = CONVERT(VARCHAR(20),@currentdate,112)
SET @limitdatetxt = CONVERT(VARCHAR(20),@limitdate,112)

DECLARE @olddbs TABLE
(
id INT IDENTITY(1, 1) PRIMARY KEY WITH FILLFACTOR = 100
,[dbname] nvarchar(255)
)
INSERT INTO @olddbs ( [dbname])
SELECT name FROM sys.databases WHERE name LIKE ‘etl_validation%’ AND create_date < @limitdate

SET @MaxCounter = @@ROWCOUNT

SET @counter = 1
WHILE @counter <= @MaxCounter
BEGIN

——Execution

SELECT @database_name = [dbname]
FROM @olddbs
WHERE id = @counter

SET @sql = ‘drop database ‘ + @database_name + ‘;’

PRINT @sql

exec sp_executesql @SQL

SET @Counter = @counter + 1;

END;
go

 

 

How to create 10,000 Extract, Transform, and Load automated tests using 4 tables #agile #data

The thing I love about my chosen profession is the ability to learn new things and improve on lessons learned from past projects. Recently I was able to take on a problem that I have experienced on multiple past projects.

“How can we easily create automated tests for a Data Migration or Extract, Transform, and Load application?”

Recently I have been lucky to be on Agile projects where we were able to create a large number of automated tests. I was able to see the huge increase in quality that came with these automated tests. All of these projects were Web Applications developed in either C# or Java.

In my coding days, I was always either a Data Modeler or Database Programmer. I had been on three projects where I was the lead developer on the extract, transform, and load (ETL) application that was responsible for loading data into new enterprise databases. Sadly, we had absolutely no automated tests in any of these projects. As we developed our ETL application, we had to manually test the loads to ensure they were operating as intended. This became especially painful near the end of the project as a small insignificant change resulted in hours and hours of manual retesting. We soon discovered that we were more likely to make a mistake testing the changes than making the actual coding change. This situation was something that was not sustainable.

After working with Agile teams and seeing how they were able to easily create a large number of automated tests, I hoped I would get the chance to try to create them on the next Extract, Transform, and Load project I was on.

The Opportunity

My most recent project allowed me to again be part of a team that recreated the corporate Data Warehouse and also needed to create an entirely new ETL application to load the Data Warehouse. The database technology that was selected was the Microsoft stack. We used SQL Server 2012 as our database engine and SSIS as our technology to create the ETL application. The Data Warehouse we were loading had over 200+ main tables that the ETL process was loading.

One thing I wanted to ensure we did was to create a large number of automated tests for our ETL application. We investigated multiple frameworks that existed, but none of them seemed to allow us to easily create the number of automated tests we wanted. All of them seemed to still require a large amount of test set up and the tests themselves did not adapt easily to changes in the database schema. This had always been a problem in the past as I tried to created automated tests for ETL applications.

Serendipity

In my time as a Data Modeler/DBA I became very good at writing queries to read the Data Dictionary tables of the database itself to generate SQL statements to then be executed. Then was the serendipitous moment – could we also read the Data Dictionary tables to easily generate automated tests for the ETL application?

The answer was yes, with a small number of customized tables that contained the column to column data mapping information. (Since this information was not stored in the Data Dictionary). The Data Dictionary tables we accessed in SQL Server were the tables that are part of the INFORMATION_SCHEMA.

The Solution

Our solution contained the following elements:

  • tSQLt  source framework for the automated testing framework
  • Data Mapping Spreadsheet that defined the Column to Column mapping
  • 4 custom tables that contain information from the Data Mapping Spreadsheet
  • Stored Procedures that read the INFORMATION_SCHEMA and 4 custom tables to automatically generate the tests

4 Custom Tables

schema

ETL Automated Tests

With these four custom tables loaded from the Data Mapping Spreadsheet we created the following Stored Procedures to generate tests:

Table to Table – Every Table

  • TstTableCount: Compares record counts between source data and target data. This will be done on table to table basis. This will be done starting from the target table and comparing to the associated source table or tables.
  • TstTableColumnDistinct: Compares counts on distinct values of columns. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields.
  • TstTableColumnNull: Generates a report of all columns where all the contents of a field is all null. This typically can highlight situations where the column was not assigned in the ETL process.

Column – Every Column

  •  TstColumn­DataMapping: Compares columns directly assigned from a source column on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­ConstantMapping: Compares columns assigned a constant on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumn­NullMapping: Compares columns assigned a Null value on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.
  • TstColumnTransformedMapping: Compares transformed columns on a field by field basis for 5-10 rows in the target table. More rows can be selected depending on the risk and complexity of the data transformation.

The Results

By creating these 4 custom tables and stored procedures we are now able to generated 10’s of thousands tests nightly. More importantly, we are able to have these tests be flexible to schema changes as the tests are generated by reading the INFORMATION_SCHEMA  and 4 custom tables. A large part of generating our tests is now data driven.

 

12 Rules to create a Dimensional Model from a Normalized Model in an #agile way

Recently I have been on a project where I have been fortunate enough to develop both a normalized Operational Data Store and a Dimensional Data Warehouse. In addition to this, I have been fortunate to be able to recreate the Dimensional Data Warehouse three times over. This has been because the project has been done in an Agile and iterative manner. I have been amazed how the Dimensional model changed and improved along the way. Although some aspects of the model have remained constant, I would estimate that a good 1/3 of the model changed and improved with each major revision. That by itself is a great testament for why Data Warehouse projects can and should be done in an Agile process.

In the process of having to recreate the Dimensional Model three times, I have followed a process that has proved beneficial and been proven out.

So here is my 12 step program with associated level of difficulty in my humble opinion.

1) Remove tables that you don’t need for analytical reporting or history – Difficulty:Easy

First step is to recognize that not all tables in an Operational Data Store make sense to store in a Data Warehouse. Some tables are truly operational and don’t make sense for analytical reporting or trend analysis. Some tables also don’t require that historical data be kept on them. Be brutal and get rid of these tables at the start. These are the tables that can clutter a Dimensional Model.

2) De-normalize reference tables on master tables – Difficulty:Easy

This next step is also pretty easy. De-normalize your reference tables onto the main tables. Remember to de-normalize shared reference tables to all the main tables that reference them. There is some discussion in the Data Warehouse world about whether you should just store the reference table description and not the reference table codes, but I like to store both as they allow for more efficient reporting and querying.

3) Prefix column names with the table names to assist in data lineage before further de-normalization – Difficulty:Easy

This was a lesson learned from the first time I created a Dimensional model. If you don’t prefix column names with the current table names, it becomes very difficult later to determine where the field initially came from. The second time I created abbreviations for the tables and prefixed the columns with those abbreviations. The third time I prefixed the column names with the full table names. The design this third time is optimized. I know have a self-documenting model that communicates the data lineage automatically. I ended up using a ‘#’ to separate the table name from the column name for each column.

4) Collapse sub-type onto super types – Difficulty:Easy

The last easy step in the process. This is the step where you collapse sub-types onto the super-type. This situation is where you may have a person entity for shared attributes and then have two sub-types like employee and manager for attributes specific for each type of person. It is relatively easy to de-normalize these sub-types attributes back onto the super-type – especially since we have prefixed the column names with the table names in step 3.

5) Duplicate and collapse shared main tables – like address/phone/email/bank accounts – Difficulty:Medium

This step is similar to step 2 except that some main tables are shared like reference tables. These main tables are usually real world objects like addresses, email addresses, phone number, and bank accounts. Usually these result in one main table in a normalized model that is then shared and link to multiple other main tables. Duplicate these shared main tables and de-normalize them onto the main tables that reference them. This is an easy step once you identify these shared main tables, but sometime it does take a little experience to identify them.

6) Remove non-enterprise entities – Difficulty:Easy

This is a second step to weed out tables that may over-complicate the model. After you have collapsed the sub/super types and shared main tables, you may have seen areas of the Dimensional model that highlight non-enterprise entities. If these areas complicate the model, it is good to ask if your Data Warehouse model really requires them.  If you are unsure, remove them for now. You can always add them back if needed.

The example I had was where we had a type of person that was a Client Administration clerk. This person was different enough to require specific tables and attributes, but was not really required for enterprise reporting on claims, premiums, and bills. For this reason, we removed the objects from our Data Warehouse until we know we have a firm requirement for this type of person and these attributes.

7) Categorize your tables as Facts, Dimensions, Bridges, and Outriggers

I found it very helpful to then categorize the tables into four main categories:

  • Facts – Tables that define events or transactions. These tables usually have the attributes you will sum up in reports. Example: Claim,Bill
  • Dimensions – Tables that describe the Facts. These tables usually provide the filter and grouping criteria for your reports. Example: Client, Subscriber
  • Bridges – Tables that resolve many to many links between Dimensions and Facts. Example: Client_Subscriber
  • Outriggers – Tables that represents one to many relationships to Facts and Dimensions where you do not want to collapse them onto the main tables. Example: Claim_request

There is much more theory behind these categories that would be the topic of an entire book. If you are interested, the best book I have found on the subject is Chris Adamson’s Star Schema – highly recommended.

8) Don’t over de-normalize – Difficulty:Hard

I’ve listed this step as hard because it does take some experience to not over de-normalize. One of the goals of a dimensional model is to simplify the data model so there is the tendency to view the objective as having as few tables as humanly possible. While this is somewhat true, you need to remember to let the data define the model and not to force the data into your view of what the model should be.

Sometimes forcing tables together will massively increase the row count or not allow for flexibility if the cardinality of the relationships change in the future. This is something that needs to be balanced. The next step of creating Natural Keys will assist in this analysis.

9) Create Natural Keys – Difficulty:Hard

Create Natural Keys for all tables. This means identifying the keys that will define record uniqueness. This is a critical step to validate your model. The Natural keys for some tables will be one field. This is a great indication that the table is cohesive and defined well. If many fields together define uniqueness, you may want to examine if you over de-normalized the table. It may turn out that you didn’t but this may be a cause of the complicated Natural Key. Sometimes, the data and Natural Keys are just complicated depending on the data.

This step will provide crucial feedback for your design.

10) Make all columns mandatory

These last two steps are important to improve the usability of the Data Warehouse.

First step is to make all the columns mandatory. This will address the bane of all Data Warehouse querying and having to specify the dreaded ‘IS NOT NULL’/’IS NULL’ as part of all the queries in a Data Warehouse.

11) Create dummy records for every Dimension, Bridge, and Outrigger table

Since all columns are now mandatory, you will also need to create a dummy record in every Dimension, Bridge, and Outrigger table. These dummy records are used when an invalid value doesn’t allow the load process to link a record to a valid Dimension, Bridge, or Outrigger. This design addresses the second bane of Data Warehouse querying – the dreaded outer join.

12) Iterate Immediately

As I mentioned, I was able to vastly improve my Dimensional Model on subsequent attempts. I would recommend that you create your model and plan on recreating it immediately after you are finished. You will learn enough during the process, that you will be able to create a better model immediately.

Summary

These rules have helped me to create a Dimensional Model that I have been able to improve and iterate on as the project has evolved. The 13th rule I didn’t mention was to trust your instinct. Trust your intuition if it feels wrong to combine two tables. You are more than likely sensing some issue that will cause problems in the future.

Adaptive Data Model – #Agile or Anathema?

I have seen the concept of an Adaptive Data Model proposed as an Agile method to Data Modelling lately. (Most recently in Ken Collier’s excellent book – “Agile Analytics”) The theory is that you can be more Agile using an  Adaptive Data Model instead of a traditional Data Model of the business domain.

Definition

An Adaptive Data Model is a Data Model that doesn’t model the business data. Rather it is a data model of the data model that models the business data. 🙂   The Adaptive Data Model describes tables that contains the meta-data that describes the data model. In this way, the entire Data Model is data driven and stored in a series of tables. The advantages of this approach is that changes to the model can be made by updating the meta-data in the tables. (as opposed to having to generate Alter statements to update the database structures in the database)

A sample Adaptive Data Model is shown below. (Copyright Ken Collier – Agile Analytics)

ADM

 

A subsequent layer would then need to be created to allow for the data to be extracted in a traditional sense by the application. (and to be made sense of by the business) Some suggestions for this layer have recommended that this layer could be created with a series of views or stored procedures.

Agile?

Although the Adaptive Data Model does allow for the easy modification of the Data Model, is it Agile?

I propose that an Adaptive Data Model is neither Agile or a Data Model.

According to Wikipedia a Data Model is:

“A data model is an abstract model that documents and organizes the business data for communication between team members and is used as a plan for developing applications, specifically how data are stored and accessed.”

An Adaptive Data Model does not meet this definition of a Data Model. It is a construct created to allow changes to be made to a data model.

In addition, I would propose that an Adaptive Data Model is also not Agile. It does not encourage frequent delivery and iterative development. An Adaptive Data Model is a complex solution that is not easily deployed in iterations to deliver value quickly and often to the business. If anything, it increases the technical debt of the project.

Summary

We as data professionals should be striving to make our processes more Agile and to be able to allow our processes to be refined iteratively like other areas of Software Development. Software Development proposed practices that allowed for iterative development. These didn’t include creating an Object Model of the Object model so that they didn’t have to fully embrace adaptive development practices.

An Adaptive Data Model feels like a short cut. It is hard to be Agile and iterative on data project, but trying to propose an Adaptive Data Model as a solution seems like a wrong turn. We need to find ways to allow our data designs to be change tolerant, adaptive, and test driven.

Why #Dimensional Modeling matters

I’ve recently completed a data modeling initiative on a major project. After doing this I’ve come to two major conclusions:

  1. The coverage area in Insurance is probably the most devious and twisted area of data that I have ever modeled.
  2. Dimensional Modeling should be done on every model to ensure you can simply model the domain.

Why Dimensional Model?

For the model I worked on, there are over 261 tables in the Relational Model. In the Dimensional Model, there are only 25. The process to distill a Relational Model to a Dimensional one is not easy, but the process highlighted problems in my Relational Model that I was unaware of. These issues would have been exposed later during testing or post-production, but being able to identify them in development was extremely valuable.

So what is Dimensional Modeling? although most people have been exposed to the rules of Data Normalization, not as many people have been exposed to Dimensional Modeling. I will try to explain my opinion of Dimensional Modeling, but if you are interested I would highly recommend ‘The Star Schema Reference’ by Chris Adamson. Simply put, it is the most complete and concise book ever written on Dimensional Modeling. Dimensional Modeling is usually done to create a model for a Data Warehouse. This Data Warehouse can then be used for Operational and Analytic reporting. But the process of Dimensional Modeling does not need to be limited to a Data Warehouse. The process itself has value to validate the Relational Model.

The term that is commonly used to refer to a Dimensional Model is a ‘Star Schema’ model. This is due to the fact that most Dimensional Models look like a star. (With the Facts in the centre and the Dimensions around the Fact)

Dimensional Modeling

Dimensional Modeling is the process of taking a Relational Model in some normal form and being able to distill the many tables down to the business objects that the business works with. In most cases this would result in single tables that correspond to objects like:

  • Client
  • Product
  • Sales Rep
  • Store
  • Sales Transaction

These business objects fall into two categories:

Facts – These are Facts about the business. They typically correspond to events or transactions and have metrics or measures that the business is concerned about. In many cases these fact tables are what the business wants to report on.

Dimensions – These are Dimensions to the Facts. They typically correspond to business objects that interact with Facts. These Dimensions represent how the business describes these objects and how they like to slice, dice, filter, sort, group, and order the Facts. A standard Dimension that occurs in all Dimensional Models that does not correspond to a business object is Time. Time is a dimension for almost all business events and transactions.

Initially, the act of creating these Facts and Dimensions is a large task of de-normalization. Although this sounds like a simply task, it is anything but. A Relational Model with hundreds of tables must be distilled down to a handful of tables. Many to Many relationships must be transformed into simple relationships that are easy to understand and report against. A natural key must also be defined that uniquely identifies a row in the Fact and Dimension. Although this sounds easy, you may find that some Dimensions don’t actually have a natural key. This is a clue that the Dimensional Model needs more attention.

A Dimensional Model also places the modeling of history at the forefront. A lot of attention is spent ensuring that data can accurately represent the changes that occur over time. Most of the time, a Relational Model is primarily concerned with current state.

Note: This is a gross over-simplification, if this is interesting to you I would recommend the book by Chris Adamson on all of the theory and complexity behind Dimensional Modeling. Or you can just start Googling!

Conclusion

The act of creating a Dimensional Model and creating the 25 Facts and Dimensions highlighted inconsistencies and duplication in my model. It also challenged my understanding of the data domain. It raised questions that I realized I didn’t have the answer for. It is easy for inconsistency and errors to hide in data model with 200+ tables. A data model that has 10-30 tables has nowhere to hide. It has the brutal transparency so valuable in Agile.

#SDEC12 Conference Review #Agile

Well another Software Development and Evolution conference has come and gone. (You’ve always wondered what SDEC stood for didn’t you?) It was a lot of work and effort to make it all happen, but in the end it was very enjoyable. I learned an immense amount and cant’t wait until next year.

My Highlights

      • The Joe Justice/Wikispeed Keynote on day 1 was entertaining and inspiring. If you aren’t familiar with the Joe Justice and Wikispeed story, I highly recommend you doing a search on YouTube or Google. Inspirational stuff on what can be accomplished when you ask why not? WikiSpeed
      • The Luke Hohmann/Innovation Games keynote on day 2 was energizing. I have been a fan of Innovation Games for a long time and it was energizing to hear Luke speak and provide the context on how and why Innovation Games are successful. InnovationGames
      • Adam Yuret @AdamYuret brought Lean Coffee to SDEC12. It was a highlight of mine to attend his session on Lean Coffee and learn how we can have our own Lean Coffee discussions. Although I must admit, I would prefer an afternoon coffee instead of an early morning one.
      • Chris Dagenais @MDChris had a couple of engaging and informative sessions on team building and peer feedback. Great sessions and audience was very engaged and interactive.
      • Lightning Talks made their first appearance at SDEC and were very well attended. There were great talks and tons of practical information compressed in 5 minute chunks.
      • Best presentation I attended was presented by Mark Kulchycki and Alyson Teterenko of Manitoba Hydro International. It was a real life tale from the trenches on how their team evolved and incorporated Agile principles into their PSCAD product development team. Awesome presentation, pragmatic approaches that everyone can use.
      • My personal highlight of the conference was the Innovation Games workshop with Luke Hohmann after the conference. It was an excellent session where Luke not only covered the Innovation Games themselves, but also the science and psychology of the games and the art of facilitation. Probably learned more in one day than I have for a long time.
      • I loved presenting my Agile Data Warehouse talk. I’m hopeful that I can have a follow-up presentation at SDEC13 that illustrates more how we used Innovation Games and show the actual models that were created.
      • It was great being able to just talk and share with everyone at the conference on what worked for them and what people are still struggling with.

Summary

It was a great conference with over 200 attendees. This was a new record for SDEC and caused us to be flexible to modify the lunch process for Day 2 to be more efficient. 🙂

I can’t wait for next year. We are gathering the feedback forms and listening to our Advisory council to assure the content and structure is even better next year! Thanks for your support and see you at SDEC13!