Dilemma with Partition Key choice

Posted on

Question :

One of our tables is over 200gb and the old records are hardly ever queried, and this makes me think partitioning is a good solution, however I am having trouble figuring out which column to use as a Partition Key for a table.

We have 2 tables as follows (names and structures are altered for confidentiality):

  • transactions table – with a transaction ID,
  • checks table – With the schema (id, transcation_id, created_at, metadata, last_updated_at) there can be multiple different checks with for a given transaction_id

As part of data aggregation we write a summarized version of the checks into a single column in the transactions table. The way we do this aggregation is by running an UPDATE on transactions table with joins on the checks table filtering by last_updated_at to only check for checks that have been updated recently.

Users however query using created_at in the checks table, and also last_updated_at could potentially change for a given row (although not likely).

Which one should I use as partition key? created_at or last_updated_at?

Answer :

Most queries become slower with partitioning.

The exceptions I know are:

  • a query that needs a sequential scan, but contains the partitioning key in the WHERE clause
  • a query that joins two partitioned tables, and the join condition matches the partition key for both
  • a query that groups and aggregates by the partitioning key

So you should rather consider how data expire. Partition your tables so that you can get rid of old, obsolete data by dropping partitions.

Leave a Reply

Your email address will not be published. Required fields are marked *