When do hash indexes become reasonable?

Posted on

Question :

MySQL natively doesn’t support hash indexes. So, making a pseudo hash column and creating an index on a hash column needs some thinking.

And it seems to be widely understood that if the text field is long then hashes are worth the overhead they require. But how long should text be to start thinking about hash?

If I intend to use 128bit hash then how many characters threshold would be a sufficient minimum for a text column?

Details listed in my another question maybe useful https://stackoverflow.com/questions/7403167/index-on-url-or-hashing-considering-ram

Answer :

Its starts to become reasonable when it starts to save your disk. But at the same time it starts to take your computational power.

Nowadays disk space is considered cheap but if you are using SSD then its not that cheap. As you need exact match then it will not take too much CPU so it may be a better option.

And exact answer of your question about the length of text column to satisfy the use of hashes can vary based on lot of factors starting from DBA’s skill to load on system.

But if you save more than 50% of RAM using hashes and total RAM saving is at least 2GB then I think it maybe worth to take this route otherwise it maybe an unnecessary rehearsal.

I’m not sure what are your reasons for wanting a hash index.

But if you are looking for indexing text/varchar columns then you can go with MyISAM/full-text or you can use something external like Sphinx or Lucene.

You can alternatively, create an additional column with a hashed version of another and as long as its 1000 bytes or less, then it can be indexed by MySQL.

I hope I have answered your question.

Having read your description, I could see how a hash would make sense. However, when you talk about memory use the size of an index column does not make it impossible for searching of a large data set. The basic index technology of mysql is a btree. Btrees divide and conquer.

The specific advice I can give is this: let’s assume you’re going to use an md5() hash. For maximum efficiency you should make the hash the primary key of innodb engine tables where the hash is a char(32). Since innodb has clustered indexes, retrieving the data will save on disk reads, and there is also reduced work that needs to be done when you use a char, as you will be able to do since the result of the md5() hash never changes.

Leave a Reply

Your email address will not be published. Required fields are marked *