char or smallint for a column with numeric codes?

Posted on

Question :

I have columns in my application that have values ​​that always have 4 digits, e.g. ‘0100’ or ‘1230’. One of examples could be Merchant Category Code.

For the time being, I have set the type of this column to char (4). I have many columns of this type. One of these types of columns is also included in nonclustered UNIQUE index. I would like to establish one approach to this type of column.

Is it better to use char (4) or maybe smallint, padding in the application by adding ' ' or 0 to the beginning?

The table in which I have these columns is partitioned and will contain about 300 million records. UPDATE will probably not be executed on columns with this type. They will also not appear in WHERE (maybe sometimes in ad hoc reports).

Which data type should I choose?

It’s like a zip code, it can be numeric but you must store leading zeros

Exactly. I am thinking what is better approach in terms of performance. Also have to take into consideration that if I store it as smallint, the application code has to parse it to string and add leading zeros.

Answer :

The standard clearly says “four-digit code”, which in my view means it should always contain 4 numeric characters, which in turn means the appropriate data type is char(4).

If you want to ensure the code value adheres to ISO 18245, store it as char(4) in the database along with a check constraint.

The document says, smallint data type storage size is 2 Bytes
and char(n) data type storage size is every character equal to 1 byte.

char [ ( n ) ] Fixed-size string data. n defines the string size in
bytes and must be a value from 1 through 8,000. For single-byte
encoding character sets such as Latin, the storage size is n bytes, and
the number of characters that can be stored is also n.

smallint -2^15 (-32,768) to 2^15-1 (32,767) 2 Bytes

The data type size consequently will affect your index size, query result size, memory use.

But on the other side, you have to think about performance losing when adding 0, ('0121').

I think, there is no correct answer to the question. You can’t decide which one is better without testing.

If you want the best of both worlds, you can store it as CHAR(4) and add a persisted computed column to the table that uses the CONVERT function to store it as an int too. This way if you need to report off of it and use it as a predicate, it’s more efficient and you can even create indexes on it.

That being said, I don’t think CHAR(4) would cause much difference in performance in your predicates anyway even at 300 million records, but only way to know is to test it. Note the persisted computed column will make the table consume more disk space too since it’s essentially a materialized column at that point.

Leave a Reply

Your email address will not be published. Required fields are marked *