ANALYZE strategy for big tables in PostgreSQL

Posted on

Question :

In our PostgreSQL 9.4.4 database we have a table that receives around 600k new records each day. Each day, nightly, we are performing some ETL exports from the table. If it has not been analyzed before the export, it is really slow. If we run ANALYZE, much faster, because the planner uses multicolumn index that has the field we query for. What would be the preferred way to solve the slow query issue? I see three options:

  • ANALYZE right before the export,
  • use automatic vacuum/analyze feature,
  • add query specific index.

The second option requires us to specify the auto vacuuming/analyzing settings per table, since the defaults are not working properly for big tables. And even if we do set it properly, there are some edge cases, when it still can cause troubles. For example:

autovacuum_analyze_scale_factor = 0.1

is the default now – so if our table is ~25 million records, it will ANALYZE after 2.5 millions, which is not frequent enough for us (we have ~600k new transactions every day). But if we set it to 0.02 (~500k records) it may happen, that for a day with many transactions (900k for example) we will have run ANALYZE after 500k but 400k will remain unanalysed, which would impact the query performance.

The table structure:

                                             Table "public.bet_transactions"
           Column           |            Type             |                           Modifiers
 id                         | integer                     | not null default nextval('bet_transactions_id_seq'::regclass)
 account_id                 | integer                     | not null
 amount_cents               | integer                     | not null default 0
 money_amount_cents         | integer                     | not null default 0
 bonus_amount_cents         | integer                     | not null default 0
 wager_amount_cents         | integer                     | not null default 0
 currency                   | character varying(3)        | not null default 'EUR'::character varying
 total_balance_before_cents | integer                     | not null default 0
 total_balance_after_cents  | integer                     | not null default 0
 money_balance_before_cents | integer                     | not null default 0
 money_balance_after_cents  | integer                     | not null default 0
 bonus_balance_before_cents | integer                     | not null default 0
 bonus_balance_after_cents  | integer                     | not null default 0
 wager_balance_before_cents | integer                     | not null default 0
 wager_balance_after_cents  | integer                     | not null default 0
 status                     | character varying(255)      | not null
 reason                     | character varying(255)      |
 provider                   | character varying(255)      | not null
 external_unique_id         | character varying(255)      |
 external_group_id          | character varying(255)      |
 external_reason            | character varying(255)      |
 external_data              | hstore                      | not null default ''::hstore
 search_vector              | tsvector                    |
 created_at                 | timestamp without time zone |
 updated_at                 | timestamp without time zone |
 exchange_rate              | double precision            | not null default 1
 original_currency          | character varying(3)        | not null default 'EUR'::character varying
 original_amount_cents      | integer                     | not null default 0
 game_id                    | integer                     | not null
 external_game_id           | character varying(255)      |
 big_win                    | boolean                     |
 contribution_factor        | integer                     |
    "bet_transactions_pkey" PRIMARY KEY, btree (id)
    "ux_bet_transactions_provider_external_unique_id" UNIQUE, btree (account_id, provider, external_unique_id) WHERE external_unique_id IS NOT NULL
    "ix_bet_transactions_account_game_status_and_date" btree (account_id, game_id, created_at) WHERE status::text = 'accepted'::text AND created_at >= '2015-01-01 00:00:00'::timestamp without time zone
    "ix_bet_transactions_account_id_external_group_id" btree (account_id, external_group_id)
    "ix_bet_transactions_date_created_at" btree (date(created_at))
Inherits: game_transactions

Sample query (batchified):

SELECT  "bet_transactions".*, "bet_transactions".tableoid::regclass::text as "table_name" FROM "bet_transactions" INNER JOIN "accounts" ON "accounts"."id" = "bet_transactions"."account_id" INNER JOIN "players" ON "players"."id" = "accounts"."player_id" WHERE (bet_transactions.tableoid::regclass::text = 'bet_transactions'::text) AND ("bet_transactions"."created_at" >= '2015-07-02 22:00:00.000000' AND "bet_transactions"."created_at" < '2015-07-03 22:00:00.000000') AND "bet_transactions"."status" = 'accepted'  ORDER BY "bet_transactions"."created_at" ASC LIMIT 5000 OFFSET 0

Do you see any alternatives? How is such situation usually handled?

EDIT: Added the table structure and a sample query.

Answer :

As already stated in the above comments there are some details hidden. I understand from your question that the query plan changes after an ANALYZE.
This may indicate that the statistical data used by the query planner are not reflecting the real distribution of the data.

ANALYZE in any case takes only a sample – it does not investigate the whole table. This means tweaking the autovacuum_analyze_threshold makes only sense to me if the new rows would change the distribution in the whole table dramatically. This depends on your use case.

Much more important seems to me to adjust the size of the sample taken by ANALYE. You can influence the sample size for your table by setting the statistics target (unfortunately it is not mentioned in the question). In this blog post it is shown, how the statistics target influences the validity of the sample taken by ANALYZE.

There are several ways you could deal with the problem that you have. You could definitely set your autovacuum settings more aggressively. Josh Berkus has a lovely answer here on a question from, called Aggressive Autovacuum on PostgreSQL.

You could add more workers and make the scale factors smaller, as you said. And adjust the costs and timings, as suggested in the referenced question.

Alternatively, you could set up a cronjob or other timed process to augment the work of autovacuum. A simple vacuumdb -U postgres -Z -t <big_table> <dbname> would work just fine. You would have to determine what a reasonable time frame would be for your table. You can even make this a part of your script, as suggested in the comments above by @fvu.

Finally, you could go with using ALTER TABLE for this large table, and setting the autovacuum_analyze_scale_factor=0.0 and setting autovacuum_analyze_threshold = 10000 (or another value that better fits your needs for this particular table).

These settings basically tell autovacuum that for this table, you need to ANALYZE after every 10,000 tuples are modified, and don’t try to scale that value based on the formulas for autoanalyze. This would keep the statistics very up to date as your requirements dictate.

Which solution is better depends on testing in your particular environment and query patterns as @ErwinBrandstetter commented above. Hopefully this gives you some idea of different ways you can handle this. Good luck. =)

Leave a Reply

Your email address will not be published. Required fields are marked *