Implementing a hash key collision strategy
Dan Linstedt, the inventor of Data Vault, has written a lot about it: hashkeys.
For instance, one of his latest blog posts:
#datavault 2.0, Hashes, one more time.
I will not list all other sources, as you can use Google yourself.
A few comments on hash keys:
- You need them for scalability. Using sequence numbers is taking the risk that your data warehouse does not scale well later when the amount of data grows.
- They can collide: two different business keys can produce the same hash key. However the chance that this happens is very small. For instance when using SHA-1 (which produces a hash value of 160 bits) you will have a
1 in 1018 chanceon a hash collision when having 1.71 * 1015hash values (read: hub rows) according to this blog post.
- If collisions are unacceptable you need a hash key collision strategy.
Hash collision detection
If you take the possibility of a hash collision seriously, you’ll need a detection strategy. The easiest way to implement this is to make sure that a primary key violation will occur when a hash key collision occurs. When your ETL processes are set up following best practices your processes are restartable after fixing the error, or can be rolled back, so you can continue the load or reload later.
A hash key collision will be detected when you will check in the hub if a business key (single-column or composite) already exists, and a primary key constraint is enabled on the hash key.
For instance the contents of a hub table (simplified example):
Let’s assume that you have a business key C that will also produce the hashkey H1.
The load process checks if C already exist, doesn’t find it, and thus wants to add a row to the hub. But because hashkey H1 already exists, a primary key violation will occur. This is good! Nothing is worse than an undetected hash key collision. If the load process would check for the hashkey H1, business key C would never be added to the hub and the describing context (satellite rows) and links of business key A would all be messed up with rows of business key C, but you would not know!
So detection is essential.
If for some reason your hubs do not have physical primary keys, you need to check afterwards for duplicate hashkeys. But this is the second-best approach, because then data corruption has already taken place. It can be a mess, but using a loading datetime and/or the source data (which is still in staging) you might be able to fix this.
Types of hash key collision strategies
So, Houston, we have a problem and Seattle, we need a solution.
How to deal with hash key collisions? You can deal with them at different levels.
Each level has a different impact on load processes and refactoring impact (the effort needed to change the load process and rehash existing business keys).
When chosing a different hash collision strategy, load impact and refactoring impact are more or less inversely proportional, so when the impact on loading process increases the refactoring impact will usually decrease.
Because of the low chances that a hash collision will ever happen, it often makes sense to choose a strategy that has a low impact on loading (and hence, a medium or high impact on refactoring, which might never be needed anyway).
In an overview:
|Level||Load impact||Refactoring impact|
|Table Level||Low||Low to High|
Implementing a hash key collision on System Level means that you change the hash key computation of the entire system. The advantage is that you need no logic in the load processes, they just use the (changed) hash key computation always.
The disadvantages can be:
- a performance impact: changed way of hashing might be slower on all tables.
- a risk of a high refactoring effort: the need to change a lot of ETL code if no data warehouse automation tool is used, and rehashing all business keys of the entire data warehouse.
In my opinion this feels like using a large ax to cut your lettuce.
Implementing a hash key collision on Table Level means that you change the hash key computation only for the business key of the hub that causes a hash key collision.
The advantages of this approach are:
- this exception can be hardcoded in manually made ETL code or meta data driven in a data warehouse automation tool. It can even be decided at runtime using the hubname as a parameter for the hash key calculation (although I would not recommend the latter because this could have a serious performance impact).
- a performance impact (if any) has only effect on one table.
- the refactoring impact can be Low to Medium: changing ETL code only affects the hash key of one hub and so does the rehashing of business keys.
The disadvantages can be:
- If it is a hub that is used in many links, the refactoring impact might still be Medium to High.
And finally hash key collision mitigation at the row level. This can be done by adding extra tables to the data vault that store business keys that would produce a duplicate hash key; the table row contains an alternate hash key for the business key to be used.
The advantage of this approach is:
- There is no refactoring impact. When a hash key collision occurs, just add the new business key that causes the collision to the exception table with an alternate hash key, and restart the load process.
The disadvantage however will be:
- Makes loading much more complicated: for every hash key computation a lookup in the exception table (which in most cases will be empty) needs to be done. This also can have a performance impact. I must admit I did not try, but would not dare to take the risk that this would not scale.
So my first thought was: no-go, no-go! In my opinion the influence on daily loads is just too big for a very small chance of a hash collision.
Picture credits: © Can Stock Photo / alexskp
Not discussed yet, how would you rehash?
In the Data Vault 2.0 training, upgrading the hashing algoritm when possible (e.g. MD5 to SHA-1) was mentioned. This is a valid choice, but has an impact on the length of your hash key (in the example from 16 to 20 bytes when stored as binary), and this introduces an exception in your table design patterns. Does not have my preference.
Euh .. is there something else I can do?
Yes! Dan Linstedt suggests to reverse the business key value before hashing, and this can work. The length of the hash key will stay the same, all that needs to be changed in the ETL code is the hash key computation for this hub (also when used in links!).
But the possibilities are almost endless! I will use the business key 1234AB as an example:
Besides reversing, you could also:
- duplicate the business key before hashing (1234AB1234AB)
- append the first character (1234AB1)
- prepend or append a fixed string (1234AB#, or #1234AB)
- etcetera etcetera. Just use your imagination and test what works best in terms of performance, code simplicy and uniqueness of the hash value.
Conclusion / Wrap up
In this post I discussed how you can prepare for the unlikely event of a Hash collission in your Data Vault 2.0 data warehouse.
Some things to keep in mind:
- Make sure you will detect a hash key collision when it happens.
- Make a decision on which level you want to deal with a hash key collision when it happens. Your options are: System, Table and Row level. I have a preference for Table level, but System level is also a valid choice. I think Row level is not a wise choice, for reasons described above.
- Know how to rehash the business keys for a hub that had a hash key collision. “do something” (as described above) with the business key before hashing until the hash values for the hub are unique and the performance is acceptable.
© 2017 datawarehouseautomation.guide – Do not steal the contents – spread the link instead – thank you.
Thank you for such a well informed article, I appreciate the content. Keep up the great work!
Thank you Dan, good to hear that you as the inventor of Data Vault approve it. Best regards, Hans
About the collision chance: Roland Bouman did a good analysis of collision chances in his blog: http://rpbouman.blogspot.nl/2017/05/do-we-still-need-to-talk-about-data.html. Though I’d refer to it for people who want another kind of analysis on the topic of collision chances.
But good to see some ideas on how to fix this.
We should chat though, the possibilities for resolving hash collision that you mention for rehashing will not result in higher or better uniqueness. And the reverse hash technique you mention is not quite what I discuss in my article, drop me an email and we can discuss. Thanks, dan
Hi Dan, thanks for your view, I will contact you by email.
Hi Hans, nice write up – thanks for sharing.
As Ronald pointed out, I believe I explored a lot of the ideas you’re describing here in my blog from may 2017. What seems like a novel idea is your notion of the “row-level” solution.
If I understand correctly, in this solution you remember the business keys that caused a collision, storing them in a separate table – essentially creating a second hub. But how then do you load the Satellites?
I mean, if I understand Linstedt correctly, one of the main advantages of using hashes as keys is that you can compute the hash independently, whereas in a DV 1.0 scenario, the Hub would typically get its PK from a sequence, which means you’ll need to load the hub before the satellite (or else you won’t know what key value to store in the satellite to tie the satellite row to its corresponding row in the hub).
It seems to me that once you start using these exception hubs, you have again created a dependency between hub and satellite in that you’ll need to do at least one lookup to the Hub to figure out what hash key to use to tie that particular satellite row to its hub.
It’s possible that I misunderstood your solution, if that is the case, can you then please explain in more detail how you envision the satellite loading process to deal with your proposed solution?
Thanks in advance, and kind regards,
Sorry I missed your comment, I was updating WordPress and saw your comment now.
The point of the “exception hub”, is that it is not really creating a dependency, because a row is only added after a collision, so not during the regular load cycles. So yes you do a lookup, but you do not have to wait for the ‘exception’ hub to be loaded. And most of the times it will be empty.
One way to implement it could for instance be do a LEFT JOIN on the exception hub, using the business keys in the join condition.
Then use a COALESCE(exceptionhub.Hashkey, HASHBYTES( … hash key calculation here …).
Obviously in a decentral scenario you have to replicate/copy the exception hub tables to all storages, but again, this is only done manually after an error caused by a hash key collision.
Hope all is clear now.