Will using bigint vs mediumint have a performance impact?

Posted on

Question :

I’m considering using mediumint or bigint for an ID column. I prefer to use bigint, because that way I’m very sure that the huge numbers that the client sometimes uses are stored just fine. However, I can also technically split up the numbers, and that way use a mediumint. I do not care about storage, but I do care about speed/performance. Is it better to use mediumint than bigint?

This is on an indexed, auto-increment ID column which will be used a lot for SELECT queries.

Answer :

Why does the user know the value of the ID column? This should be abstracted from them. They should be able to identify their data by some other natural key (even if this is only implemented as a unique constraint and you use the surrogate ID as a primary key for performance reasons). If you take away the ID column and you can no longer uniquely identify a row, there is likely an issue with the design.

While you can save some space and improve performance in small ways by choosing a different data type, you need to balance performance with actual business requirements. If you’re going to exceed the bound of a smallint or mediumint then you shouldn’t use either. Don’t make your design complex for the exception (e.g. “splitting up the numbers”) to try and optimize for the general rule. This complexity is unnecessary IMHO and will only lead to problems.

In SQL Server we can take advantage of data compression, which essentially treats a bigint column in the following way:

  • if the number fits in a tinyint (0-255), it takes 1 byte
  • if the number fits in a smallint (-32K -> 32K), it takes 2 bytes
  • if the number fits in an int (-2Bn -> 2Bn), it takes 4 bytes
  • only if the number exceeds the bounds of an int does it take 8 bytes

(This is a simplification. And this is the worst case where it treats each value independently only if values on a page are unique and non-sequential, which wouldn’t be the case for an auto-increment column. In addition page compression uses other methods like dictionary/prefix, which can reduce storage even further.)

Of course compression trades CPU for I/O, so while you save raw storage and reduce I/O, you pay in CPU. Even today most systems are I/O-bound, so this is a benefit, but shouldn’t be considered if you’re CPU-bound.

I don’t believe there is similar functionality in MySQL, but I am not certain. If there is you should look into it.

And one final note. You say “I do not care about storage, but I do care about speed/performance.” If you care about speed/performance, then you should care about storage. The more pages it takes to store your data, the harder the engine is going to have to work to read/write those pages, perform seeks, etc. But again, there needs to be balance. You shouldn’t take an extra four weeks over-engineering a solution that is going to save you 10 seconds a week.

As far as I understand your question, you are asking about type for primary key column. So why do you care about user input? If it’s a primary key, it should be surrogate, and user has no control on what is stored there. The only thing user cares is uniqueness of that column.

Performance itself depends on size of other columns, fragmentation, typical DML operations, presence/absence of constraints, indexes, etc. In addition, I believe ‘pure’ performance of different types is diffent for different OS/CPU architecture

I would completely ignore potential performance difference between various integer type for primary key and use the type that makes more sense. .

If your tables are small, it does not matter. In particular, if all the data and indexes are projected to fit in innodb_buffer_pool_size, then it won’t make much difference. But if the tables are too big to be cached, then…

“Count the disk hits”. or “Smaller –> more cacheable –> faster”.

MEDIUMINT is always 3 bytes; BIGINT is always 8.

Also, use UNSIGNED and/or NOT NULL when appropriate.

Later versions of InnoDB (and XtraDB) have some compression; I don’t know if it applies to INTs. InfoBright does a significant job of compressing. TokuDB also compresses.

In InnoDB, the PRIMARY KEY is implicitly part of every secondary key. So, that multiplies the storage cost (hence I/O time) for a BIGINT PK. This is also why a VARCHAR is usually not a good idea for an InnoDB PK.

If the client’s keys are really huge integers (BIGINTs, UUIDs, GUIDs, MD5s), it might be better to “normalize” them. That is, build a mapping from their over-sized keys into a MEDIUMINT UNSIGNED (or other smallish INT). This adds a lot of coding on your part, but may be worth the effort.

If what you care about is speed/performance then use bigint.

A disk read access will read a block of bytes, so reading 8 bytes will be the same as reading 3 bytes.

But since 3 bytes integers are not native for CPUs, mediumint needs to be converted before it can be used (memcpy into a 4/8 bytes integer) causing a overhead, while 8 bytes integers are native for 64bits CPUs and can be used directly.

Leave a Reply

Your email address will not be published. Required fields are marked *