Question :
One of our tables is over 200gb and the old records are hardly ever queried, and this makes me think partitioning is a good solution, however I am having trouble figuring out which column to use as a Partition Key for a table.
We have 2 tables as follows (names and structures are altered for confidentiality):
transactions
table – with a transaction ID,checks
table – With the schema(id, transcation_id, created_at, metadata, last_updated_at)
there can be multiple different checks with for a giventransaction_id
As part of data aggregation we write a summarized version of the checks into a single column in the transactions
table. The way we do this aggregation is by running an UPDATE
on transactions
table with joins on the checks table filtering by last_updated_at
to only check for checks that have been updated recently.
Users however query using created_at
in the checks
table, and also last_updated_at
could potentially change for a given row (although not likely).
Which one should I use as partition key? created_at
or last_updated_at
?
Answer :
Most queries become slower with partitioning.
The exceptions I know are:
- a query that needs a sequential scan, but contains the partitioning key in the
WHERE
clause - a query that joins two partitioned tables, and the join condition matches the partition key for both
- a query that groups and aggregates by the partitioning key
So you should rather consider how data expire. Partition your tables so that you can get rid of old, obsolete data by dropping partitions.