Conflict resolution for hash-backed unique constraint of large text bodies

Posted on

Question :

So we’ve just run into

ERROR: index row size 2736 exceeds maximum 2712 for index "foo__bar__un"
  Hint: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

(foo is table, bar is column and un for “unique”)

Consider a function index of an MD5 hash of the value

Alright, let’s.
This is simple enough to do:

CREATE UNIQUE INDEX foo__bar__un ON foo(md5(bar));

So far, so good, this is the solution both proposed in the error hint and found online in numerous places.

What I haven’t found yet, though, is how to deal with collisions.

In my case, I have a database with results for biochemical experiments. I can’t just go to the scientists and say “can you please change your result a tiny bit so we can actually store it in our database? yes, I understand this may cost a couple of your customers’ lives and you personally a potentially prohibitive amount of money, but I can’t allow you to store it otherwise”.

(Realistically, we’d just drop the constraint way before that, but you get my point.)

I do understand that collisions are almost negligibly rare, but I’d rather know I can deal with them if against all odds one should happen. And given that this isn’t a niche issue only we are having, I’m sure I’m not the only one to have ever wondered about this.

So what is the way to deal with collisions, in this case? I’d rather not have to replace the unique index with a DIY trigger-based workaround, if possible.

Can I somehow shoehorn in an additional “equals” check to be executed for entries with the same hash, for example, and only throw an exception if they actually are?

Answer :

I don’t know why the results of a biochemical experiment would be enforced to be unique in the first place, but you can use an exclusion constraint that tests for equality. It will automatically resolve hash collisions by doing the character-by-character comparison.

alter table foo add constraint foo__bar__un exclude using hash (bar with =);

But you should upgrade to something newer than 9.6.

If you take the square approximation for the birthday problem with 2128 days in a year (that is the number of different MD5 hashes), and you want to know from how many table rows on the collision probability exceeds 0.000000001, you end up with

SELECT sqrt(2 * 2 ^ 128 * 1E-9);
(1 row)

So even if you had almost a quadrillion rows, the risk of a collision is so small that I would be more worried about a plane colliding with the data center.

Leave a Reply

Your email address will not be published. Required fields are marked *