Question :
I’ve got a 4 column table with ~25M rows:
tweetifai=> d core_tagusage;
Table "public.core_tagusage"
Column | Type | Modifiers
----------+-----------------------+------------------------------------------------------------
id | integer | not null default nextval('core_tagusage_id_seq'::regclass)
tag | character varying(30) | not null
score | double precision | not null
tweet_id | character varying(16) | not null
Indexes:
"core_tagusage_pkey" PRIMARY KEY, btree (id)
"core_tagusage_tag_3ed0f993a18dd430_uniq" UNIQUE CONSTRAINT, btree (tag, tweet_id)
"core_tagusage_cea90699" btree (tweet_id)
"core_tagusage_score_772bb061400905c6_uniq" btree (score)
"core_tagusage_tag_443351eb_uniq" btree (tag)
"core_tagusage_tweet_id_11461293c1e23f80_like" btree (tweet_id varchar_pattern_ops)
Foreign-key constraints:
"core_tagusage_tweet_id_11461293c1e23f80_fk_core_imagetweet_id" FOREIGN KEY (tweet_id) REFERENCES core_imagetweet(id) DEFERRABLE INITIALLY DEFERRED
There is a many-to-many relationship between tweet_id
and tag
.
Given a few tags (eg. “dog”, “grass”, “frisbee”) I’d like to retrieve all tweet_id
s which match all tags and sort by score. Here’s my query:
SELECT *
FROM (SELECT
COUNT(ort.tweet_id) AS numtags,
ort.tweet_id,
MIN(ort.score) AS minscore
FROM (SELECT
tg.tweet_id AS tweet_id,
tg.tag AS tag,
tg.score AS score
FROM core_tagusage AS tg
WHERE tg.tag = 'dog'
OR tg.tag = 'grass'
OR tg.tag = 'frisbee'
) AS ort
GROUP BY ort.tweet_id) AS ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;
Even without the ORDER BY
this is remarkably slow (~10-20 seconds). Here’s what I get from EXPLAIN ANALYZE
:
Limit (cost=0.56..22720.77 rows=200 width=33) (actual time=24177.117..24177.117 rows=0 loops=1)
-> GroupAggregate (cost=0.56..1327311.85 rows=11685 width=25) (actual time=24177.115..24177.115 rows=0 loops=1)
Group Key: tg.tweet_id
Filter: (count(tg.tweet_id) = 3)
Rows Removed by Filter: 493764
-> Index Scan using core_tagusage_cea90699 on core_tagusage tg (cost=0.56..1321691.90 rows=547389 width=25) (actual time=0.206..23322.853 rows=501654 loops=1)
Filter: (((tag)::text = 'dog'::text) OR ((tag)::text = 'grass'::text) OR ((tag)::text = 'frisbee'::text))
Rows Removed by Filter: 24309855
Planning time: 0.237 ms
Execution time: 24177.166 ms
Is there a different query I could write to achieve the same results?
edit:
EXPLAIN ANALYZE SELECT
tg.tweet_id AS tweet_id,
tg.tag AS tag,
tg.score AS score
FROM core_tagusage AS tg
WHERE tg.tag = 'dog'
OR tg.tag = 'grass'
OR tg.tag = 'tree';
Bitmap Heap Scan on core_tagusage tg (cost=295.51..44454.72 rows=13794 width=32) (actual time=8.253..105.051 rows=104691 loops=1)
Recheck Cond: (((tag)::text = 'dog'::text) OR ((tag)::text = 'grass'::text) OR ((tag)::text = 'tree'::text))
Heap Blocks: exact=1133
-> BitmapOr (cost=295.51..295.51 rows=13797 width=0) (actual time=8.111..8.111 rows=0 loops=1)
-> Bitmap Index Scan on core_tagusage_tag_443351eb_uniq (cost=0.00..95.06 rows=4599 width=0) (actual time=3.238..3.238 rows=38385 loops=1)
Index Cond: ((tag)::text = 'dog'::text)
-> Bitmap Index Scan on core_tagusage_tag_443351eb_uniq (cost=0.00..95.06 rows=4599 width=0) (actual time=2.217..2.217 rows=30477 loops=1)
Index Cond: ((tag)::text = 'grass'::text)
-> Bitmap Index Scan on core_tagusage_tag_443351eb_uniq (cost=0.00..95.06 rows=4599 width=0) (actual time=2.649..2.649 rows=35864 loops=1)
Index Cond: ((tag)::text = 'tree'::text)
Planning time: 0.093 ms
Execution time: 187.032 ms
Answer :
I think you have unnecessary steps in your query, consider this:
SELECT *
FROM (SELECT
COUNT(tag) AS numtags,
tweet_id,
MIN(score) AS minscore
FROM core_tagusage
WHERE tag = 'dog'
OR tag = 'grass'
OR tag = 'frisbee'
GROUP BY tweet_id) AS ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;
and this:
SELECT
tweet_id,
MIN(score) AS minscore
FROM core_tagusage
WHERE tag IN ('dog', 'grass', 'frisbee')
GROUP BY tweet_id
WHERE COUNT(*) = 3
ORDER BY minscore DESC
LIMIT 200;
I also think that running analyze
on your table will help the query planner.
After playing with some fabricated data locally, I was seeing ~600K rows take ~1sec, so 10 seconds doesn’t seem unreasonable (but my data was so predictable that analyze
made the query planner just go for a sequential scan)
I can’t test this, since I don’t have your data, but
you could try something like this – it’s possible the optimizer will produce the exact same plan. In any case, I think CTE
s make queries much more readable but may affect your performance?
WITH ort AS
(
SELECT
tg.tweet_id AS tweet_id,
tg.tag AS tag,
tg.score AS score
FROM core_tagusage AS tg
WHERE tg.tag = 'dog'
OR tg.tag = 'grass'
OR tg.tag = 'frisbee'
),
ct AS
(
SELECT
COUNT(ort.tweet_id) AS numtags,
ort.tweet_id,
MIN(ort.score) AS minscore
FROM ort
GROUP BY ort.tweet_id
)
SELECT * FROM ct
WHERE numtags = 3
ORDER BY minscore DESC
LIMIT 200;