#SAP, Breaking Data, and Re-enabling #SQLServer Database Referential Integrity Constraints #Microsoft #FTW

Many times as Data professionals we no longer have full control over the quality of data in the source systems. I am discussing SAP in my example, but I could have easily mentioned PeopleSoft, SalesForce, or a number of other purchased solutions. Usually those solutions are purchased and then we are tasked with maintaining those environments and also extracting data from those environments to be incorporated into a Business Intelligence corporate solution.

Our issue is one somewhat of our own choosing as well. We want to enforce integrity and constraints at a greater level than what was intended and specified in the purchased applications. This may be for a variety of reasons including that the business never specified it as a requirement. It may also be that the purchased application was never built to handle that level of integrity.

To be clear, this isn’t a complaint but more a reflection of reality. We as Data professionals are going to receive data that is not as consistent and complete as we as Data professionals want it to be. (I purposely did not state ‘require’ as there could be a discussion of what is truly required) So what are we to do?

The Problem

Typically we end up extracting data from these purchased applications and load them into a consolidated database. This database can be either a relational or dimensional database. We also typically need to cleanse the data we are loading so load the business can report on the data in a clear and consistent manner.

The challenge is what we do with data that we cannot load in a consistent manner. We really have two options; modify the data or reject it outright. Although there are many types of inconsistent data we may need to correct, I will limit my discussion to data that links tables together. Typically we define Referential Integrity or Foreign Keys constraints to ensure that the data to link tables are valid so that reports and queries return correct results.

Possible Solution

When we have more control over the quality of source systems, I usually see the solution embedded in the Extract, Transform, and Load (ETL) solution that extracts and loads the data into a corporate database.  This is because the data issues will be more known, of lesser frequency, and the data issues are things we can correct ourselves. In this type of solution, the Foreign Key constraints are always enabled and the  ETL solution validates all the data values before trying to insert the data in the database. Any errors that are encountered will result in the data being changed or rejected and an error written to a log file.

There are two majors issue with this approach:

1) Performance – The look-up to validate all Foreign Keys row by row can cause the process to run slower. It can eliminate a performant two step approach where some of the fields can be set in a subsequent SQL Update statement. (Depending on the column’s Nullability) It can also prevent the use of some bulk load methods in SQL Server Integration Services.

2) Availability – If major data issues are encountered, the data issues may prevent the data load from continuing and may affect the availability of the database.

Our Solution

Since we are loading data from multiple external providers, we designed a different solution.

Although we have Foreign Key constraints on the entire database, they will be disabled during the load. (and during the week) We will enable them every Sunday to validate the data loaded has not broken integrity rules. If we find we cannot re-enable any constraint, we will email the Data Team informing them of the offending constraint for investigation. If all Foreign Key constraints can be re-enabled, we will inform the Data Team of the success and disable them again.

We could also do this re-enabling nightly if we start to encountered more frequent data errors.

In this manner, we are in a better position to react to data outside of our control and load the data as quickly as possible.

Our SQL Server Solution

A couple of things to note about our SQL Server solution. Frequently I see the solution to re-enable all constraints use the sp_msforeachtable stored procedure. A sample of how to do this is listed below:

EXEC sp_msforeachtable “ALTER TABLE ? NOCHECK CONSTRAINT all”

This solution is virtually useless you can guarantee all your constraints can be re-enabled without failure. If one constraint fails, it will stop the process. Not good.

To accommodate the ability to re-enable all constraints even when errors are encountered we created our own processes to disable and re-enable our constraints using a cursor.

Here is the disable constraints SQL

DECLARE @disable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_disable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE disable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ NOCHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_disable_constraints

OPEN disable_cursor
FETCH NEXT FROM disable_cursor INTO @disable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

PRINT @disable_sql

EXEC sp_executesql @disable_sql
FETCH NEXT FROM disable_cursor INTO @disable_sql

END

CLOSE disable_cursor
DEALLOCATE disable_cursor
DROP TABLE temp_disable_constraints

And our re-enable constraint SQL:

DECLARE @enable_sql NVARCHAR(255)

SELECT ROW_NUMBER() OVER (ORDER BY o.[schema_id]) AS RowID,
QUOTENAME(o.name) AS CONSTRAINT_NAME,
QUOTENAME(SCHEMA_NAME(po.[schema_id])) AS FOREIGN_TABLE_SCHEMA,
QUOTENAME(po.name) AS FOREIGN_TABLE_NAME,
QUOTENAME(rccu.COLUMN_NAME) AS FOREIGN_COLUMN_NAME,
QUOTENAME(SCHEMA_NAME(ro.[schema_id])) AS PRIMARY_TABLE_SCHEMA,
QUOTENAME(ro.name) AS PRIMARY_TABLE_NAME,
QUOTENAME(rc.name) AS PRIMARY_COLUMN_NAME,
CASE fk.is_disabled
WHEN 0 THEN ‘CHECK’
ELSE ‘NOCHECK’
END AS [ENABLED]
INTO temp_enable_constraints
FROM sys.foreign_keys AS fk
INNER JOIN sys.objects AS o ON o.[object_id] = fk.[object_id]
INNER JOIN sys.objects AS po ON po.[object_id] = fk.parent_object_id
INNER JOIN sys.objects AS ro ON ro.[object_id] = fk.referenced_object_id
INNER JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE AS rccu ON rccu.CONSTRAINT_SCHEMA = SCHEMA_NAME(o.[schema_id])
AND rccu.CONSTRAINT_NAME = o.name
AND rccu.TABLE_SCHEMA = SCHEMA_NAME(po.[schema_id])
AND rccu.TABLE_NAME = po.name
INNER JOIN sys.index_columns AS ric ON ric.[object_id] = fk.referenced_object_id
AND ric.index_id = fk.key_index_id
AND ric.is_included_column = 0
INNER JOIN sys.columns AS rc ON rc.[object_id] = fk.referenced_object_id
AND rc.column_id = ric.column_id

DECLARE enable_cursor CURSOR for
SELECT ‘ALTER TABLE ‘ + FOREIGN_TABLE_SCHEMA + ‘.’ + FOREIGN_TABLE_NAME
+ ‘ ‘ + ‘ WITH CHECK CHECK CONSTRAINT ‘ + CONSTRAINT_NAME
FROM temp_enable_constraints

OPEN enable_cursor
FETCH NEXT FROM enable_cursor INTO @enable_sql

WHILE @@FETCH_STATUS = 0
BEGIN

BEGIN TRY
EXEC sp_executesql @enable_sql
END TRY

BEGIN CATCH
PRINT ‘ERROR–>’ + @enable_sql
FETCH NEXT FROM enable_cursor INTO @enable_sql
CONTINUE
END CATCH

FETCH NEXT FROM enable_cursor INTO @enable_sql

END

CLOSE enable_cursor
DEALLOCATE enable_cursor
DROP TABLE temp_enable_constraints

Conclusion

This solution has provided us the flexibility to load our data as efficiently as possible and validate our Foreign Key relationships on a recurring basis. It also minimizes the chance that our load process will stop mid-stream. Did I mentioned this is a key requirements as we are loading data into the Data Warehouse every 60 minutes? 🙂

I was initially concerned with how long it would take to re-enable the constraints, but it only takes 75 minutes to re-enable 616 Foreign Key constraints on a 1.1 Terabyte database. Thanks Microsoft!

Now that we have this process we also plan to use it on large software deployments just to ensure to major data issues were introduced with the deployment as well.

#Innovation Debt and the Four Fences of Software Development

I was looking for a new picture for the Blog and I thought about an interesting Blog post. As you can see now, I chose the image of a gate after much searching on various image search engines. (let me tell you there are some very interesting people out there with cameras.) 🙂

Gates and Fences

I chose a gate as I think it is a very interesting analogy that can be used in the Software Development industry. In the Agile community we are so focused on tearing down fences that we have to be careful we don’t use the remnants of the Waterfall Fence to build the Agile Methodology fence. I loved the analogy of a gate in conjunction with the fences. We need to ensure that every fence we build also has at least one gate. The fences exist for the purpose of providing structure and restrictions for predictability, but there always needs to at least one way to break free when the situation calls for it.  (hopefully multiple gates)

I thought of 4 separate fences that are quite common in the Software Development industry. They are:

1. Process Fence : Gate leads to greater value

I’ve alluded to this fence already. As I have mentioned, we in the Agile space need to be extremely careful that we don’t construct an Agile fence out of the broken boards of the Waterfall fence. If we start being equally as stringent and demanding, we are equally doomed to failure. Don’t get me wrong, I think the Agile practices are a great fit for the vast majority of projects and an improved over the Waterfall methodology. But we need to be careful that we don’t start to be overly prescriptive and cookie-cutter. It would be incorrect to say all Software Development projects require pair-programming, two-week iterations, and daily stand ups. Just like it was incorrect to say all Software Development Projects required Functional Specifications, Work Breakdown Structures, and Use Cases. Do these Agile practices fit better than Waterfall practices? Usually. But the team still needs to determine what practices best apply and to what extent.

Sometimes you have to open the gate and incorporate all the different practices that deliver the most value to your client. It is likely that these practices will be from many different methodologies. Can an Agile project benefit from a Work Breakdown Structure? It is possible.

2. Technology/Vendor Fence : Gate leads to better solutions

A second fence we can find ourselves in is this Technology or Vendor fence. This is the fence that we typically build around the technology we use and the vendor for that technology. We typically built these fences for very good reasons. Simply put we are more familiar with the technology we use the most and we there just isn’t enough time to learn all of the technologies that are out there. There are just simply too many. So what can we do?

I think similar to the Agile principle about trying one new thing every iteration, Software Development technical professionals should try one new technology every project. (preferably from different vendors) If the project doesn’t allow for this, then we should as Software Development professionals commit to reading one new book and playing with one new technology in our own time for every new project.

If we don’t do this continuous learning and strive to open the gate in the technology fence, how do we know we are providing the client the best solution? Of course we can’t know all technologies, but isn’t it our professional responsibility to know more than one group of them?

3. People/Employer : Gate leads to enhanced knowledge and competencies

The third fence is the people or employer fence. This fence is very similar to the last fence except that it deals with people instead of technology. It is very natural to again build a natural affinity to the people we primarily work with. But it is also important to realize that one company can’t be perfect in everything. (just like one person can’t be the best at everything) We all have our strengths and weaknesses both individually and corporate-wide.

Some of the most valuable lessons learned I have had over the years has been when I have worked with people from other companies and they have shared with me their practices and methods. Now those of us who have worked for a company for a longer duration obviously believe our company has more strengths than weaknesses. (I know this is something I believe 100% about Protegra.)

That said, I look forward to being able to work with new Protegrans and with new partners and clients because I know I am going to learn new things and be the better for it. Opening the gate in the People and Employee fence is one of the most rewarding.

4. Experience/Safety : Gate leads to innovation

The fourth fence is one we build ourselves and it is something I’ve noticed more in myself as I’ve gained experience. I think sometimes when we have gathered more experience, it is easier to just do what we have done before. Developing using a known process, technology or team is the safe route and something we feel more comfortable with. The decision between introducing new items and doing what has been done before is a fine line as we can’t take on too many new things and risk the project, but if we don’t take on any new items we are building what I like to term Innovation Debt.

Like Technical Debt, it is sitting there and charging interest. Innovation Debt will also needs to be paid sooner or later and it is better to pay it off bit by bit on projects rather that having a large payment at the end. The real problem is that too much Innovation Debt can result in a compromised company that is passed by their competitors. Too much innovation on projects can result in compromised projects. It is a very fine line to walk.

But not opening the Experience gate is actually more damaging that opening it. It is just a little unnerving at first and requires an atmosphere at work that encourages innovation and rewards fast failure.

Summary

Those are the four fences and gates that I try to keep in mind as I go about my projects. Does anyone know of any more?