Postgres using seq scan with filter on indexed column + EXISTS on related table

Posted on

Question :

I’ve following query which takes about 25-35 seconds to execute:

EXPLAIN (analyze, buffers, format text) SELECT booking.*
FROM booking
WHERE booking.reference_number = '9999999999' OR booking.booking_id = '9999999999' OR 
    (
        EXISTS (
            SELECT * FROM booking_customer 
            WHERE booking_customer.booking_id = booking.booking_id AND 
                (booking_customer.email = '9999999999' OR booking_customer.phone = '9999999999') AND 
                booking_customer.deleted = false
        )
    );

The explain output is here: https://explain.depesz.com/s/RPNV

As visible from the plan, there is sequence scan done on booking table, inspite of filtering on indexed columns – reference_number and booking_id. The booking_customer table is properly using index scan though.

Is it something to do with EXISTS or OR clauses there?

My table structure is as below:

                              Table "public.booking"
        Column         |           Type           | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
 deleted               | boolean                  |           |          |
 booking_id            | character varying        |           | not null |
 reference_number      | character varying        |           |          |
 booking_owner         | character varying        |           |          |
 checkin_date          | timestamp with time zone |           |          |
 checkout_date         | timestamp with time zone |           |          |
 status                | character varying        |           |          |
 hold_till             | timestamp with time zone |           |          |
 version               | integer                  |           | not null |
 comments              | text                     |           |          |
 extra_information     | json                     |           |          |
 cancellation_reason   | character varying        |           |          |
 cancellation_datetime | timestamp with time zone |           |          |
 created_at            | timestamp with time zone |           | not null | now()
 modified_at           | timestamp with time zone |           | not null | now()
Indexes:
    "booking_pkey" PRIMARY KEY, btree (booking_id)
    "ix_booking_reference_number" UNIQUE, btree (reference_number)
    "idx_booking_checkin_date" btree (checkin_date)
    "idx_booking_checkout_date" btree (checkout_date)
    "ix_booking_status" btree (status)
    "trgm_booking_ref_num" gist (reference_number gist_trgm_ops)

And customer table:

                          Table "public.booking_customer"
        Column         |           Type           | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
 deleted               | boolean                  |           |          |
 customer_id           | character varying        |           | not null |
 booking_id            | character varying        |           | not null |
 first_name            | character varying        |           |          |
 last_name             | character varying        |           |          |
 gender                | character varying        |           |          |
 age                   | integer                  |           |          |
 nationality           | character varying        |           |          |
 phone                 | character varying        |           |          |
 email                 | character varying        |           |          |
 gst_addr_field1       | character varying        |           |          |
 gst_addr_field2       | character varying        |           |          |
 gst_addr_city         | character varying        |           |          |
 gst_addr_state        | character varying        |           |          |
 gst_addr_country      | character varying        |           |          |
 gst_pincode           | character varying        |           |          |
 legal_name            | character varying        |           |          |
 created_at            | timestamp with time zone |           | not null | now()
 modified_at           | timestamp with time zone |           | not null | now()
Indexes:
    "booking_customer_pkey" PRIMARY KEY, btree (customer_id, booking_id)
    "book_cust_idx" btree (booking_id, customer_id)
    "ix_booking_customer_email" btree (email)
    "ix_booking_customer_phone" btree (phone)
    "trgm_cust_last_name" gist (last_name gist_trgm_ops)

Details on total records in table are as follows:

db=> SELECT COUNT(*) FROM booking;
 count
--------
 958092
(1 row)

db=> SELECT COUNT(*) FROM booking_customer;
  count
---------
 2471445
(1 row)

db=> SELECT COUNT(*) FROM booking WHERE reference_number = '9999999999' OR booking_id = '9999999999';
 count
-------
     1
(1 row)

db=> SELECT COUNT(*) FROM booking_customer WHERE (email = '9999999999' OR phone = '9999999999') AND deleted = false;
 count
--------
 156377
(1 row)

db=> SELECT COUNT(DISTINCT(booking_id)) FROM booking_customer WHERE (email = '9999999999' OR phone = '9999999999') AND deleted = false;
 count
-------
 65196
(1 row)

So ideally, I should have gotten 65196 rows as result, which may be seems good reason to use sequence scan. But, when I try with a different value of phone or email, which results in roughly 1300 rows, still it uses seq scan on booking table.

I ran vacuum analyze on both tables yesterday midnight only.

Is there a way to optimise this query?

Postgres details:

db=> SELECT version();
                                    version
-------------------------------------------------------------------------------
 PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.9.3, 64-bit
(1 row)

Update:

Just tried with UNION ALL, and the performance is too good. As good, as it makes me wonder if the result is correct? Or is there some drawback. Because why would anyone do OR condition in WHERE clause ever?

Here’s the new query:

EXPLAIN (analyze, buffers, format text) 
SELECT booking.*
FROM booking
WHERE booking.deleted = false AND booking.reference_number = '9999999999'
UNION ALL
    SELECT booking.*
    FROM booking
    WHERE booking.deleted = false AND booking.booking_id = '9999999999'
UNION ALL
    SELECT booking.*
    FROM booking
    WHERE booking.deleted = false AND EXISTS (
            SELECT * FROM booking_customer 
            WHERE booking_customer.booking_id = booking.booking_id AND 
                (booking_customer.email = '9999999999' OR booking_customer.phone = '9999999999') AND 
                booking_customer.deleted = false
        );

New plan is here: https://explain.depesz.com/s/aXMV

Answer :

The reason why the query is choosing a sequential scan on booking is because the estimated number of rows to be scanned is 478,187, and the entire table is 958,092. That means that more than half of the table will need to be scanned anyways–might as well just do a sequential scan, rather than pull 478K rows out of the index, then go to the disk to pull the actual rows out.

Note also that the estimate is incorrect. It estimates 478,187 rows, but there are actually only 61,695 rows — the estimate is 7.8x off. You may need to perform an ANALYZE on the booking table and try again.

However, to focus on the Sequential Scan is sort of misleading, as the bulk of the runtime is in the loop over Index Scan on book_cust_idx 958,047 times. Each scan of the index takes 0.025ms, but after looping through it nearly 1M times, the duration adds up. The reason why it loops 958,047 times is because for each row in booking, it needs to compare with the result of EXISTS.

You may be better off just doing a JOIN, so that the two tables are merged, rather than looped upon:

EXPLAIN (analyze, buffers, format text) SELECT b.*
FROM booking b
JOIN booking_customer bc ON b.booking_id=bc.booking_id
WHERE b.reference_number = '9999999999'
   OR b.booking_id = '9999999999'
   OR ((bc.email = '9999999999' OR bc.phone = '9999999999') AND NOT bc.deleted);

UPDATE:
Using a UNION ALL works as well, since the individual reference_number and booking_id columns are indexed, effectively partitioning the result set for faster retrieval

The part where it “is properly using index scan” is by far the slowest part of your original query. Is your goal to get the performance to be better, or to force it to use some preconceived notion of what is “proper”, even if that is very slow?

Just tried with UNION ALL, and the performance is too good. As good,
as it makes me wonder if the result is correct? Or is there some
drawback. Because why would anyone do OR condition in WHERE clause
ever?

With the UNION ALL, you will get the same row of “booking” returned multiple times if it satisfies multiple branches of the query. With your original query, it would not do that. So if the first query is correct, then the second one is incorrect. Whether this is incorrect in a way that actually matters to you is something only you can know. You could change “UNION ALL” to “UNION”, but then there is a possibility of being incorrect in a different way, by over-consolidating rows rather than duplicating them. But if one of the columns you are selecting is the primary key, then over-consolidating should not be possible and I believe your two queries are formally identical.

Leave a Reply

Your email address will not be published. Required fields are marked *