Grouping/sorting performance choice between bigint and nvarchar

Posted on

Question :

I want to store a hash-code for a variable-length text field (max 1000 chars) in a database table. The hash-code will be computed and assigned once on insert, and new rows will be inserted very often.

The hash-code will be used mainly for filtering (WHERE), grouping (GROUP BY), and sorting (ORDER BY) in a couple of queries. The database table will hold a few million rows over time, with the probability of identical hash-codes (for identical text) being around 30% (rest being unique).

I have the choice of making the hash-code data type NVARCHAR (SHA1 of text) or BIGINT (converted bytes of SHA1 of text). I think BIGINT will be better in terms of storage space (less pages).

Generally speaking, which of these two data types will be better in terms of performance, considering the operations mentioned above?

Answer :

You want to store the hash using the densest datatype possible.

A bigint is 8 bytes, or 64-bits of data.
An nvarchar storing 8 bytes of data would take 10 bytes on disk.

But how long are your hashes? SHA-1 hashes are 20 bytes long (160 bits) so won’t fit in a bigint. Storing it as an nvarchar will take 42 bytes of storage if stored naively.

According to you best option is to use binary(20) so you are storing exactly 20 bytes (160 bits) of data.

Are you intending to do the hashing in the database or in client code? According to the documentation, as of SQL Server 2016 all hash algorithms other than SHA2_256 and SHA2_512 are deprecated.

But a cryptographic hash function is probably not the best choice of hash function for you if all you want is a random distribution of rows. There are many hash functions available, producing whatever length key you desire!

If you’re able to go the route of BIGINT and accomplish what you need then that would be my preference. It’ll be more performant for the SQL Server Engine to sort, filter, and join on as opposed to NVARCHAR because of the numeric type and the smaller size of the data value, and as you mentioned will technically take up less space (theoretically also allowing your table’s data pages to load into memory quicker – though this is probably a negligible difference on the whole for only a few million rows).

Leave a Reply

Your email address will not be published. Required fields are marked *