Question :
I’m having a problem in finding the best option to approach the following problem (postgres 9.5):
I get update batches of about 100.000 rows at a time from another system. This happens every 10-15 minutes usually, but it’s possible that I get multiple of these batches at the same time. The batches are separated by “category” and one batch only ever contains data from one. Every “category” gets an updated set every 10-15 minutes.
New rows get inserted, old rows deleted and still existing rows should get updated to the new values.
This poses the problem that the table emasses serious amounts of garbage data, the VACUUM processes run really slowly and general table performance is really poor.
Now I thought I could solve this problem by creating child tables for every “category” within the data and thus “sharding” the data.
Would this make sense in this case, or are there better options for me to persue?
Answer :
Is vacuuming slow because you have inadequate IO throughput, or just because it is throttled too much?
The default throttling of autovacuum is not suitable for very write-intensive servers. You should probably decrease autovacuum_vacuum_cost_delay
or increase vacuum_cost_limit
. And I routinely set vacuum_cost_page_hit and vacuum_cost_page_miss to zero. Page misses are inherently self-limiting as the vacuum process can’t proceed until the page is delivered; so there is little reason to add intentional throttling on top of that.
I get update batches of about 100.000 rows at a time from another system. […] Every “category” gets an updated set every 10-15 minutes. New rows get inserted, old rows deleted and still existing rows should get updated to the new values.
Are you using UPSERT: INSERT INTO
ON CONFLICT DO UPDATE SET
? If not, have you tried integrating that into your work flow and benchmarking it? It’s new functionality with Pg 9.5.
Also, what about hardware: 10 million rows isn’t that many. Are you running SSDs?