Should I invest the time to change the column type from CHAR(36) to UUID?

Posted on

Question :

I have a few million rows in my database already. I didn’t know about the PostgreSQL UUID data type when I designed my schema.

One of the tables has 16M rows (about 3.5M to 4 M records per shard), growing at about 500K records per day. I still have the luxury of taking the production system down for a few hours if required. I won’t have this luxury in one or two weeks.

My question is, will it be worthwhile to do so? I’m wondering about JOIN performance, disk space use (full gzip’d dump is 1.25 GiB), things of that nature.

Table schema is:

# d twitter_interactions
                Table "public.twitter_interactions"
         Column          |            Type             | Modifiers 
-------------------------+-----------------------------+-----------
 interaction_id          | character(36)               | not null
 status_text             | character varying(1024)     | not null
 screen_name             | character varying(40)       | not null
 twitter_user_id         | bigint                      | 
 replying_to_screen_name | character varying(40)       | 
 source                  | character varying(240)      | not null
 tweet_id                | bigint                      | not null
 created_at              | timestamp without time zone | not null
Indexes:
    "twitter_interactions_pkey" PRIMARY KEY, btree (interaction_id)
    "twitter_interactions_tweet_id_key" UNIQUE, btree (tweet_id)
    "index_twitter_interactions_on_created_at" btree (created_at)
    "index_twitter_interactions_on_screen_name" btree (screen_name)
Triggers:
    insert_twitter_interactions_trigger BEFORE INSERT ON twitter_interactions FOR EACH ROW EXECUTE PROCEDURE twitter_interactions_insert_trigger()
Number of child tables: 9 (Use d+ to list them.)

Answer :

I would consider changing to the UUID type. char(36) takes 40 bytes, uuid takes 16, so you’ll save 24 bytes per row, which for you will equate to 12 MB a day, 4 GB after a year. Plus indexes. Depending on what hardware you have, that isn’t much, but it could be. And it adds up if you have more improvement opportunities like this.

Also, I see no constraint in you schema that ensures that interaction_id is actually in the right format. Using the right type will give you that as well.

If you like this, however, than using bigint would save even more and have even better performance. It’s very unlikely that your application is so large that a bigint for an ID column won’t work.

I’m not a postgres person by any stretch of the imagination, but based on what I know from SQL Server, the more rows you can fit onto a data page, the better performance you are going to have (reading data from disk is typically most expensive operation). Thus, going from a 36ish1 byte wide field to 16 byte GUID seems a straight forward cost savings. The fewer reads you can incur, the faster you can return results. All of this of course assumes that a GUID/UUID satisfies the business needs of the table. If a UUID satisfies it, would a bigint? That’d further shave your storage costs another 8 bytes per row.

Edit1

For character data in Postgres, there is an additional storage cost for them. Short strings, under 127 bytes have a 1 byte overhead while anything longer has 4 bytes which is how the the second respondent came up with a 40 byte cost for a 36 byte field. But there is also an option for string compression so perhaps it won’t cost the full 40. I can’t tell what the final cost would be but the fundamentals remain: anything over 16 bytes will increase storage cost, take longer to read from and consume more memory.

The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1.
Long strings are compressed by the system automatically, so the
physical requirement on disk might be less.

Besides the space issue keep in mind that you’ll need to change every table to use the correct data type or your join performance will tank, badly.

In addition to the saving in size of data and indexes (as said by others), which does translate to I/O savings, the thing that you need to consider is how will you generate new values for interaction_id and what will be the impact on the indexes and query conditions (joins).

For the index – it will be smaller, however if a lot of your queries use index scans switching to UUIDs might render index scans impossible (depending on how you will generate UUIDs) and bigint might be much better choice.

Finally, as the actual performance impact depends also on your usage patterns and data distribution you should run tests and have a development and testing environment in which you can test your changes.

This will give you a much more exact answer about the impact on the performance.

Leave a Reply

Your email address will not be published. Required fields are marked *