How to optimize a query for < operator

Posted on

Question :

I have a SELECT that is very slow when using < operator I’m looking for fixes or workarounds to perform this operation:

FROM "users" 
WHERE (engagement_level(social) < 1) 
    AND (social_peemv(social) < 33.333333333333336) 
    AND (array['United Kingdom'] <@ mixed_frequent_locations(location)) 
    AND (is_visible(social, flags) = TRUE) 
ORDER BY "users"."created_at" ASC 
                                                                                                               QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------    ---------------------------
 Limit  (cost=0.43..18572.10 rows=12 width=860) (actual time=5658.037..175582.743 rows=12 loops=1)
   ->  Index Scan using created_at_idx on users  (cost=0.43..6244724.16 rows=4035 width=860) (actual time=5658.035..175582.735 rows=12 loops=1)
         Filter: (is_visible(social, flags) AND (engagement_level(social) < 1) AND (social_peemv(social) < '33.3333333333333'::double precision) AND ('{"United Kingdom"}'::text[] <@ mixed_frequent_locations(location)))
         Rows Removed by Filter: 2816798
 Planning time: 1.573 ms
 Execution time: 175583.373 ms
(6 rows)
FROM "users" 
WHERE (engagement_level(social) < 1) 
  AND (social_peemv(social) = 33.3333) 
  AND (array['United Kingdom'] <@ mixed_frequent_locations(location)) 
  AND (is_visible(social, flags) = TRUE)
ORDER BY "users"."created_at" ASC
                                                                                QUERY PLAN
 Limit  (cost=380.04..380.31 rows=1 width=863) (actual time=0.051..0.051 rows=0 loops=1)
   ->  Result  (cost=380.04..380.31 rows=1 width=863) (actual time=0.050..0.050 rows=0 loops=1)
         ->  Sort  (cost=380.04..380.05 rows=1 width=896) (actual time=0.049..0.049 rows=0 loops=1)
               Sort Key: created_at
               Sort Method: quicksort  Memory: 25kB
               ->  Index Scan using idx_in_social_peemv on users  (cost=0.43..380.03 rows=1 width=896) (actual time=0.044..0.044 rows=0 loops=1)
                     Index Cond: (social_peemv(social) = '33.3333'::double precision)
                     Filter: (is_visible(social, flags) AND (engagement_level(social) < 1) AND ('{"United Kingdom"}'::text[] <@ mixed_frequent_locations(location)))
 Planning time: 0.459 ms
 Execution time: 0.095 ms

In the first case no Index Cond is applied and the execution time grows to 175583.373 ms

The index:

 CREATE INDEX idx_in_social_peemv ON users USING BTREE ( social_peemv(social) ) ;
 CREATE INDEX mixed_frequent_locations_idx on users USING GIN ( mixed_frequent_locations(location) ) ;
 CREATE INDEX created_at_idx ON users USING btree (created_at)
 CREATE INDEX idx_in_social_follower_count_and_created_at ON users USING btree (social_follower_count(social) DESC, created_at)
 CREATE INDEX idx_in_egagagement_level_and_created_at ON users USING btree (engagement_level(social), creat

The Table:

    name TEXT,
    bio TEXT,
    social jsonb,
    flags array,
    location jsonb,
    search_field ts_vector,

Postgres version: 10

The whole condition match 20 records out of a total of 3669284

Each condition matches:

(engagement_level(social) < 1) = 801176

(social_peemv(social) < 33.333333333333336) = 1621516

(array['United Kingdom'] <@ mixed_frequent_locations(location)) = 91625

(is_visible(social, flags) = TRUE) = 3333733

As suggested by @jjanes i tried removing the LIMIT and OFFSET and the query plan changed to a BitMap Heap Scan:

FROM "users"
WHERE (engagement_level(social) < 1)
     AND (social_peemv(social) < 33.333333333333336)
     AND (array['United Kingdom'] <@ mixed_frequent_locations(location))
     AND (is_visible(social, flags) = TRUE)
ORDER BY "users"."created_at" ASC;
                                                                                  QUERY PLAN
 Sort  (cost=77641.99..77652.36 rows=4148 width=1393) (actual time=1195.544..1195.546 rows=20 loops=1)
   Sort Key: created_at
   Sort Method: quicksort  Memory: 59kB
   ->  Bitmap Heap Scan on users  (cost=18046.48..77392.73 rows=4148 width=1393) (actual time=227.471..1195.481 rows=20 loops=1)
         Recheck Cond: (('{"United Kingdom"}'::text[] <@ mixed_frequent_locations(location)) AND (engagement_level(social) < 1))
         Filter: (is_visible(social, flags) AND (social_peemv(social) < '33.3333333333333'::double precision))
         Rows Removed by Filter: 19444
         Heap Blocks: exact=19238
         ->  BitmapAnd  (cost=18046.48..18046.48 rows=28415 width=0) (actual time=218.484..218.484 rows=0 loops=1)
               ->  Bitmap Index Scan on mixed_frequent_locations_idx  (cost=0.00..1356.36 rows=128634 width=0) (actual time=44.794..44.794 rows=108076 loops=1)
                     Index Cond: ('{"United Kingdom"}'::text[] <@ mixed_frequent_locations(location))
               ->  Bitmap Index Scan on idx_in_egagagement_level_and_created_at  (cost=0.00..16687.80 rows=1156662 width=0) (actual time=163.368..163.368 rows=801189 loops=1)
                     Index Cond: (engagement_level(social) < 1)
 Planning time: 3.326 ms
 Execution time: 1197.242 ms

All the conditions applied except the is_visible are given by the user

Answer :

You have a classic column dependence problem. PostgreSQL thinks there will be 4035 rows that meet the WHERE clause based on estimates. If I take the actual selectivities you reported and multiply them (which is correct only if each one is independent of the others), I get 8032 rows. But the true value is 20. Apparently people from the UK are abnormally unsocial, or something like that.

With the LIMIT 12, it thinks it will just walk the index in the same order as the ORDER BY and stop once it finds 12 rows which pass the filters. It thinks it will need to walk only about 12/4035 of the index. But it really needs to walk about 12/20 of the index (unless there is also a correlation between creation date and socialness, then even those estimates would be wrong). So it picks those methods because it thinks they will be faster.

In newer versions of PostgreSQL you can gather statistics on inter-column dependencies, but you can’t do that on functions/expressions so it would not work for you. I don’t think it would work for array columns, anyway.

If your cutoffs were hard-coded, you could create a partial index like the below, but it sounds like that won’t work for you either, unless you are willing to compromise on that.

CREATE INDEX mixed_frequent_locations_idx_2 on users 
  USING GIN ( mixed_frequent_locations(location) )
  WHERE (engagement_level(social) < 1)
   AND (social_peemv(social) < 33.333333333333336)
   AND (is_visible(social, flags) = TRUE) ;

You can often solve the problem by crafting a multicolumn index with just the right order of columns which is so good for the query that it gets used even if the selectivity estimates are way off. But since all selective conditions are based on inequality or array-membership, it isn’t obvious how you could come up with such an index. It is possible you could use a GiST index for this. btree_gist provides GiST operator classes for ‘<‘ on the numbers, but I don’t know of anything that provides <@ support on string arrays. You can try an index USING GIST ON (engagement_level(social), social_peemv(social)) but since most of your selectivity comes from mixed_frequent_locations(location) I would not have high hopes that an index without that would work well enough to appear better than the existing indexed ORDER BY plan.

Put it all together, and I don’t think you have a lot of great options here. Selectively hunting down problem queries and removing their LIMIT is probably the best option, as un-beautiful as it may be. Rather than removing the LIMIT, you could just specify the ORDER BY like this:

ORDER BY "users"."created_at" + '0 seconds'::interval ASC

It will prevent the falsely-attractive index from being used, without changing the output of the queries.

In the fast query, you’re returning 0 rows. That’s easy. PostgreSQL sees social_peemv(social) and has a juicy statistics estimate for it and scans idx_in_social_peemv. It returns 0 rows and the query is done.

In the other one you’re saying < 33.333333333333336. As this means nearly half of the table, PostgreSQL finds that the value isn’t worth the cost according to selectivity estimates. Instead, it scans a different index, and then removes 2816798 rows from its result set.

We still don’t know what your functions are doing, and we don’t have the table definition. You didn’t give us all the indexes created_at_idx either.

Leave a Reply

Your email address will not be published. Required fields are marked *