#SAP, Breaking Data, and Re-enabling #SQLServer Database Referential Integrity Constraints #Microsoft #FTW

Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.

Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.

To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?

The Problem

Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.

The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.

Possible Solution

When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database.  This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the  ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.

There are two majors issue with this approach:

1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.

2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.

Our Solution

Since we are loading data from multiple external providers, we designed a different solution.

Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.

We could also do this re-enabling nightly if we start to encountered more frequent data errors.

In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.

Our SQL Server Solution

A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”

This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.

To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.

Here is the disable constraints SQL

DECLARE @disable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints

OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

PRINT @disable_sql

EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql

END

CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints

And our re-enable constraint SQL:

DECLARE @enable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints

OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY

BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH

FETCH NEXT FROM enable_cursor INTO @enable_sql

END

CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints

Conclusion

This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂

I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!

Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.

The state of #Agile according to Data Modellers

DataVersity released their snapshot survey on Data Modelling and the results aren’t good for Data Modellers who want more adoption of Agile by the Data Modelling community.

Before we get into the details, DataVersity is a great source of references and webinars for all things data. You can find them by clicking on this link: DataVersity

The Results

1) How important is it to know how to work with Agile Teams? (this was graded on a scale of 1-10. I’ve tried to group them into categories)

  • 27% thought it was important (1-3 on a scale of 10)
  • 27% were neutral (4-6 on a scale of 10)
  • 46% thought it was not important (7-10 on a scale of 10)

2) How much experience do you have with Agile or Scrum projects?

  • 12% – 10+ projects
  • 9% – 5-10 projects
  • 20% – 3-4 projects
  • 16% – 2 projects
  • 8% – 1 project
  • 33% – no experience

3) If you have been the data architect or data modeller on an Agile project, how satisfied were you with the results?

  • 22% – very satisfied
  • 46% – fairly satisfied
  • 32% – not satisfied

Analysis

What I took from these responses are that the majority of people feel that Agile isn’t that important to them. But I hope there is some light at the end of the tunnel as a good percentage of Data Modellers have not been on an Agile project or have only been on 1 Agile project. (41%) In addition, of the ones that were on Agile projects, 68% reported that they were either very satisfied or fairly satisfied. Those are encouraging numbers. Although some Data Modellers don’t think Agile is important to them, there seems to be a correlation that once they are exposed to the methods, they view them as positive.

Dimensional Modelling

Those of you that have read my blogs have seen my positive comments on Dimensional Modelling. Another concern I had with the results of the survey are the percentage of Data Modellers that feel understanding Dimensional Modelling concepts are not important to them.

  • 28% thought it was important (1-3 on a scale of 10)
  • 34% were neutral (4-6 on a scale of 10)
  • 40% thought it was not important (7-10 on a scale of 10)

Given that all Data Modellers have had to model the data to support reporting, this is a somewhat concerning statistic. Dimensional Modelling is not something new as well, not like XML and No SQL data stores. Shockingly 62% felt that knowledge of XML and No SQL Data Stores are not important to them.

Summary

I think we have a ways to go to encourage the adoption of new processes, methods, and technologies in the Data Modelling and Database professions. Even within our own profession, items like Dimensional Modelling still lack full acceptance.

I would encourage every Data Modeller out there to learn about Agile and Agile methods. As we move into the second decade of Agile, its adoption is increasing and it will affect every type of project in the future. Understanding the methods will help the Data Modeller determine how best to integrate with Agile methods and practices.

I feel that Data Modellers must remember that we are providing a service to projects and the business. If the projects and the business are becoming more Agile, we also must become Agile. If not, the projects and businesses will be drawn towards other solutions and services that do align with their methods. This could possibly drive more projects toward No SQL and XML that are not good candidates for those projects.