Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.
I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:
- You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
- They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a
1 in 1018 chanceon a hash collision when having 1.71 * 1015hash values (read: hub rows) according to this blog post.
- If collisions are unacceptable you need a hash key collision strategy.
Hash collision detection
If you take the possibility of a hash collision seriously, you’ll need a detection strategy. The easiest way to implement this is to make sure that a primary key violation will occur when a hash key collision occurs. When your ETL processes are set up following best practices your processes are restartable after fixing the error, or can be rolled back, so you can continue the load or reload later.
A hash key collision will be detected when you will check in the hub if a business key (single-column or composite) already exists, and a primary key constraint is enabled on the hash key.
For instance the contents of a hub table (simplified example):
Let’s assume that you have a business key C that will also produce the hashkey H1.
The load process checks if C already exist, doesn’t find it, and thus wants to add a row to the hub. But because hashkey H1 already exists, a primary key violation will occur. This is good! Nothing is worse than an undetected hash key collision. If the load process would check for the hashkey H1, business key C would never be added to the hub and the describing context (satellite rows) and links of business key A would all be messed up with rows of business key C, but you would not know!
So detection is essential.
If for some reason your hubs do not have physical primary keys, you need to check afterwards for duplicate hashkeys. But this is the second-best approach, because then data corruption has already taken place. It can be a mess, but using a loading datetime and/or the source data (which is still in staging) you might be able to fix this.
Types of hash key collision strategies
So, Houston, we have a problem and Seattle, we need a solution.
How to deal with hash key collisions? You can deal with them at different levels.
Each level has a different impact on load processes and refactoring impact (the effort needed to change the load process and rehash existing business keys).
When chosing a different hash collision strategy, load impact and refactoring impact are more or less inversely proportional, so when the impact on loading process increases the refactoring impact will usually decrease.
Because of the low chances that a hash collision will ever happen, it often makes sense to choose a strategy that has a low impact on loading (and hence, a medium or high impact on refactoring, which might never be needed anyway).
In an overview:
|Level||Load impact||Refactoring impact|
|Table Level||Low||Low to High|
Implementing a hash key collision on System Level means that you change the hash key computation of the entire system. The advantage is that you need no logic in the load processes, they just use the (changed) hash key computation always.
The disadvantages can be:
- a performance impact: changed way of hashing might be slower on all tables.
- a risk of a high refactoring effort: the need to change a lot of ETL code if no data warehouse automation tool is used, and rehashing all business keys of the entire data warehouse.
In my opinion this feels like using a large ax to cut your lettuce.
Implementing a hash key collision on Table Level means that you change the hash key computation only for the business key of the hub that causes a hash key collision.
The advantages of this approach are:
- this exception can be hardcoded in manually made ETL code or meta data driven in a data warehouse automation tool. It can even be decided at runtime using the hubname as a parameter for the hash key calculation (although I would not recommend the latter because this could have a serious performance impact).
- a performance impact (if any) has only effect on one table.
- the refactoring impact can be Low to Medium: changing ETL code only affects the hash key of one hub and so does the rehashing of business keys.
The disadvantages can be:
- If it is a hub that is used in many links, the refactoring impact might still be Medium to High.
And finally hash key collision mitigation at the row level. This can be done by adding extra tables to the data vault that store business keys that would produce a duplicate hash key; the table row contains an alternate hash key for the business key to be used.
The advantage of this approach is:
- There is no refactoring impact. When a hash key collision occurs, just add the new business key that causes the collision to the exception table with an alternate hash key, and restart the load process.
The disadvantage however will be:
- Makes loading much more complicated: for every hash key computation a lookup in the exception table (which in most cases will be empty) needs to be done. This also can have a performance impact. I must admit I did not try, but would not dare to take the risk that this would not scale.
So my first thought was: no-go, no-go! In my opinion the influence on daily loads is just too big for a very small chance of a hash collision.
Picture credits: © Can Stock Photo / alexskp
Not discussed yet, how would you rehash?
In the Data Vault 2.0 training, upgrading the hashing algoritm when possible (e.g. MD5 to SHA-1) was mentioned. This is a valid choice, but has an impact on the length of your hash key (in the example from 16 to 20 bytes when stored as binary), and this introduces an exception in your table design patterns. Does not have my preference.
Euh .. is there something else I can do?
Yes! Dan Linstedt suggests to reverse the business key value before hashing, and this can work. The length of the hash key will stay the same, all that needs to be changed in the ETL code is the hash key computation for this hub (also when used in links!).
But the possibilities are almost endless! I will use the business key 1234AB as an example:
Besides reversing, you could also:
- duplicate the business key before hashing (1234AB1234AB)
- append the first character (1234AB1)
- prepend or append a fixed string (1234AB#, or #1234AB)
- etcetera etcetera. Just use your imagination and test what works best in terms of performance, code simplicy and uniqueness of the hash value.
Conclusion / Wrap up
In this post I discussed how you can prepare for the unlikely event of a Hash collission in your Data Vault 2.0 data warehouse.
Some things to keep in mind:
- Make sure you will detect a hash key collision when it happens.
- Make a decision on which level you want to deal with a hash key collision when it happens. Your options are: System, Table and Row level. I have a preference for Table level, but System level is also a valid choice. I think Row level is not a wise choice, for reasons described above.
- Know how to rehash the business keys for a hub that had a hash key collision. “do something” (as described above) with the business key before hashing until the hash values for the hub are unique and the performance is acceptable.
© 2017 datawarehouseautomation.guide – Do not steal the contents – spread the link instead – thank you.