Zeros, bloody zeros! (Data Vault Series)

Data Warehouse Automation Guide > Data Vault Series  > Zeros, bloody zeros! (Data Vault Series)

Zeros, bloody zeros! (Data Vault Series)


I must admit I have a weakness for British humour.
When I had to cope with leading zeros in business keys some time ago, I spontaneously came up with the title of this post, not knowing that it would serve as such.
For those who do not know, “Meetings, bloody meetings” is a British comedy training film in which John Cleese plays a main role. It was made in 1976, and a remake was made in 2012.
It tells in a funny way what can go wrong at meetings and how you can do better, check it out if you can.

DV-S01E05-meetingsMr John Cleese

But, obviously, this post is not about meetings but about zeros.


I can be short about that: leading zeros in business key values.
For instance a customer number is delivered to the data warehouse as 0001806 (instead of 1806).
This would not be a problem it is would always be delivered exactly like that. But to be honest, you can and will not know that upfront. Even this might be the case now, it might not be in the future.
When other tools are used, leading zeros could suddenly disappear (for instance when a csv file is modified using Excel), or (more rarely) the number of leading zeros could change (01806, 00001806). When this happens you have a problem, because for the data warehouse 01806, 0001806, 00001806 and 1806 are all different business keys! Even if you have only two variants, it is already a problem.
Because every business key gets a different row in the hub, and this customer now exists multiple times!

DV-S01E05-zeros(No acting here, this is how I look sometimes)


If you are familiar with Data Vault, you might already think of same-as-links to solve this.
But I think the solution should be implemented earlier, to avoid having multiple hub rows.
Simply always remove leading zeros when the sourcecolumn is (part of) a business key (either primary or foreign key) and seems a number or ID but is delivered as a string/varchar. In this way 1806 will always be 1806! And I think it is pretty impossible that 001806 and 1806 would refer to two different customers.
Unless, of course, they would come from different source systems. But in that situation, depending on leading zeros would be a bad thing to do, because when then leading zeros dropped off, satellite rows of different customers (in different source systems) could end up as connected to the same hub row! In this situation, in a non-integrated Raw Vault, it would be better to prefix the business key with the source system code and remove the leading zeros, for instance, CRM.1806 and ERP.1806.
In all cases, you can still store the original value (with leading zeros) as an ordinary attribute in a satellite for auditing reasons.

How to implement the solution

There are many ways to remove leading zeros. When I was searching for this I had two requirements:

  • No casting from and to an integer may take place, otherwise all business keys need to be numeric, so this would make the solution less reliable.
  • No function, routine or assembly may be called, this could negatively impact performance. I was looking for an “inline” conversion.

After some research I found an expression that was the same for SQL and SSIS and quite okay (T-SQL version by Robin Hames, my credits for his work), but appeared to change a string with only one or more zeros to an empty string. And because a 0 can have a meaning – and is certainly different from an empty string – this is undesired behavior, IMHO.
So I had to add some logic to it: a SELECT CASE in T-SQL and an inline condition (format {condition} ? {true part} : {false part} ) to the SSIS expression.
Furthermore I came on a different method for T-SQL as well, using the PATINDEX function, which is more compact than the other solution.
For SSIS I still use the ‘Robin Hames’ method, because the PATINDEX function is not available in SSIS Expressions.
So .. this is what it has become:



      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
      ELSE SUBSTRING(example.[id_with_leading_zeros], PATINDEX('%[^0 ]%', example.[id_with_leading_zeros]), LEN(example.[id_with_leading_zeros]))
   END AS [id_without_zeros_method1],

      WHEN LTRIM(example.[id_with_leading_zeros]) = '' THEN ''
      WHEN PATINDEX( '%[^0 ]%', example.[id_with_leading_zeros]) = 0 THEN '0'
            REPLACE(LTRIM(example.[id_with_leading_zeros]), ' ', '!#!') -- replace existing spaces with a string that does not occur in the column value, I have chosen '!#!'
            , '0', ' ') -- replace '0' with ' '
            ) -- end of LTRIM to remove leading '0's that have been changed to ' 's
            , ' ', '0') -- change ' ' back to '0'
            , '!#!', ' ') -- change '!#!' back to ' '
   END AS [id_without_zeros_method2]
        TOP 1000 RIGHT('00000000000' + CONVERT(NVARCHAR(12), object_id), 14) AS [id_with_leading_zeros]
    SELECT N' 00000 '
    SELECT N'00'
    SELECT N' '
    SELECT ' 0099990 A '
    SELECT '-5550'
    ) example

SSIS Expression (can be used in Derived Column)

(LTRIM(REPLACE(id_with_leading_zeros,"0", "")) == "" && LTRIM(id_with_leading_zeros) != "") ? "0" : REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(LTRIM(id_with_leading_zeros)," ","!#!"),"0"," "))," ","0"),"!#!"," ")

DV-S01E05-151In a Derived Column Transformation this looks for instance like this

Conclusion / Wrap up

In this post I have motivated why I think you should remove leading zeros from business keys when data is loaded from source systems to a data warehouse.
This post also contains different ways to remove leading zeros, two for T-SQL and one for a SSIS expression.

(c) 2016 – Do not steal the contents – spread the link instead – thank you.

Hans Michiels
No Comments

Post a Comment