Postgres – fast way to enforce AND across multiple rows

Posted on

Question :

I’ve got a 4 column table with ~25M rows:

tweetifai=> d core_tagusage;
                                 Table "public.core_tagusage"
  Column  |         Type          |                         Modifiers
----------+-----------------------+------------------------------------------------------------
 id       | integer               | not null default nextval('core_tagusage_id_seq'::regclass)
 tag      | character varying(30) | not null
 score    | double precision      | not null
 tweet_id | character varying(16) | not null
Indexes:
    "core_tagusage_pkey" PRIMARY KEY, btree (id)
    "core_tagusage_tag_3ed0f993a18dd430_uniq" UNIQUE CONSTRAINT, btree (tag, tweet_id)
    "core_tagusage_cea90699" btree (tweet_id)
    "core_tagusage_score_772bb061400905c6_uniq" btree (score)
    "core_tagusage_tag_443351eb_uniq" btree (tag)
    "core_tagusage_tweet_id_11461293c1e23f80_like" btree (tweet_id varchar_pattern_ops)
Foreign-key constraints:
    "core_tagusage_tweet_id_11461293c1e23f80_fk_core_imagetweet_id" FOREIGN KEY (tweet_id) REFERENCES core_imagetweet(id) DEFERRABLE INITIALLY DEFERRED

There is a many-to-many relationship between tweet_id and tag.

Given a few tags (eg. “dog”, “grass”, “frisbee”) I’d like to retrieve all tweet_ids which match all tags and sort by score. Here’s my query:

SELECT *
FROM (SELECT
        COUNT(ort.tweet_id) AS numtags,
        ort.tweet_id,
        MIN(ort.score)      AS minscore
      FROM (SELECT
              tg.tweet_id AS tweet_id,
              tg.tag      AS tag,
              tg.score    AS score
            FROM core_tagusage AS tg
            WHERE tg.tag = 'dog'
                  OR tg.tag = 'grass'
                  OR tg.tag = 'frisbee'
           ) AS ort
      GROUP BY ort.tweet_id) AS ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;

Even without the ORDER BY this is remarkably slow (~10-20 seconds). Here’s what I get from EXPLAIN ANALYZE:

Limit  (cost=0.56..22720.77 rows=200 width=33) (actual time=24177.117..24177.117 rows=0 loops=1)
  ->  GroupAggregate  (cost=0.56..1327311.85 rows=11685 width=25) (actual time=24177.115..24177.115 rows=0 loops=1)
        Group Key: tg.tweet_id
        Filter: (count(tg.tweet_id) = 3)
        Rows Removed by Filter: 493764
        ->  Index Scan using core_tagusage_cea90699 on core_tagusage tg  (cost=0.56..1321691.90 rows=547389 width=25) (actual time=0.206..23322.853 rows=501654 loops=1)
              Filter: (((tag)::text = 'dog'::text) OR ((tag)::text = 'grass'::text) OR ((tag)::text = 'frisbee'::text))
              Rows Removed by Filter: 24309855
Planning time: 0.237 ms
Execution time: 24177.166 ms

Is there a different query I could write to achieve the same results?

edit:

EXPLAIN ANALYZE SELECT
              tg.tweet_id AS tweet_id,
              tg.tag      AS tag,
              tg.score    AS score
            FROM core_tagusage AS tg
            WHERE tg.tag = 'dog'
                  OR tg.tag = 'grass'
                  OR tg.tag = 'tree';

Bitmap Heap Scan on core_tagusage tg  (cost=295.51..44454.72 rows=13794 width=32) (actual time=8.253..105.051 rows=104691 loops=1)
  Recheck Cond: (((tag)::text = 'dog'::text) OR ((tag)::text = 'grass'::text) OR ((tag)::text = 'tree'::text))
  Heap Blocks: exact=1133
  ->  BitmapOr  (cost=295.51..295.51 rows=13797 width=0) (actual time=8.111..8.111 rows=0 loops=1)
        ->  Bitmap Index Scan on core_tagusage_tag_443351eb_uniq  (cost=0.00..95.06 rows=4599 width=0) (actual time=3.238..3.238 rows=38385 loops=1)
              Index Cond: ((tag)::text = 'dog'::text)
        ->  Bitmap Index Scan on core_tagusage_tag_443351eb_uniq  (cost=0.00..95.06 rows=4599 width=0) (actual time=2.217..2.217 rows=30477 loops=1)
              Index Cond: ((tag)::text = 'grass'::text)
        ->  Bitmap Index Scan on core_tagusage_tag_443351eb_uniq  (cost=0.00..95.06 rows=4599 width=0) (actual time=2.649..2.649 rows=35864 loops=1)
              Index Cond: ((tag)::text = 'tree'::text)
Planning time: 0.093 ms
Execution time: 187.032 ms

Answer :

I think you have unnecessary steps in your query, consider this:

SELECT *
FROM (SELECT
        COUNT(tag) AS numtags,
        tweet_id,
        MIN(score)      AS minscore
      FROM  core_tagusage
            WHERE tag = 'dog'
                  OR tag = 'grass'
                  OR tag = 'frisbee'
      GROUP BY tweet_id) AS ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;

and this:

SELECT
    tweet_id,
    MIN(score) AS minscore
FROM  core_tagusage
WHERE tag IN ('dog', 'grass', 'frisbee')
GROUP BY tweet_id
WHERE COUNT(*) = 3
ORDER BY minscore DESC
LIMIT 200;

I also think that running analyze on your table will help the query planner.

After playing with some fabricated data locally, I was seeing ~600K rows take ~1sec, so 10 seconds doesn’t seem unreasonable (but my data was so predictable that analyze made the query planner just go for a sequential scan)

I can’t test this, since I don’t have your data, but
you could try something like this – it’s possible the optimizer will produce the exact same plan. In any case, I think CTEs make queries much more readable but may affect your performance?

WITH ort AS
(
  SELECT
  tg.tweet_id AS tweet_id,
  tg.tag      AS tag, 
  tg.score    AS score
  FROM core_tagusage AS tg
  WHERE tg.tag = 'dog'
  OR tg.tag = 'grass'
  OR tg.tag = 'frisbee'
),
ct AS 
(
  SELECT
  COUNT(ort.tweet_id) AS numtags,
  ort.tweet_id,
  MIN(ort.score)      AS minscore
  FROM ort
  GROUP BY ort.tweet_id
)
SELECT * FROM ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;

Leave a Reply

Your email address will not be published. Required fields are marked *