Question :
Imagine a very basic example of your average discussion board. For example:
CREATE TABLE threads (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
title VARCHAR(100),
PRIMARY KEY (id)
)
CREATE TABLE replies (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
thread_id INT UNSIGNED NOT NULL,
text TEXT NOT NULL,
PRIMARY KEY (id)
INDEX thread_id (thread_id),
)
There can be very lengthy discussions (people love bikeshedding!), maybe 100k or 200k replies per thread and users can read them paginated (the number of replies per page is variable, depending on user preferences, but if needed for the solution it can be limited to a fixed set). These tables might have ~40 million replies and ~2 million threads.
So you might end running this query to get the last replies of a thread:
SELECT * FROM replies
WHERE thread_id = 1234
ORDER BY id ASC
LIMIT 125400,10 /* whoops */
Which, as you know, is quite slow since MySQL has to walk 125,400 rows just to get there and return your 10 rows.
Hacky solutions I’ve thought:
-
Create a secondary index which assigns an incrementing number for each chunk of N posts (for example, a new field in the
replies
table which for the first 1000 posts contains 1, for the following 1000 it contains 2, etc).- I have to heavily modify the application, since there are tons of queries that read the
replies
table, it’s not just a simpleSELECT
here and there and I really don’t want to cripple and reengineer each query of the application. - It would force me to recalculate each time that I delete a reply or when I do other destructive operations (splitting, merging, etc).
- I have to heavily modify the application, since there are tons of queries that read the
-
For each link to the next page, attach the ID of the next post. That way the database can go directly to the row using the primary key of the
replies
table.- This is a web application, so this solution would have tricky SEO implications which I’d prefer not to deal with.
I might be dreaming here (and if so please do tell me!) but is there a solution that resides (almost) exclusively in the database and allows me to fix this problem without heavily modifying the application?
I’ve read about MySQL partitions, but I’m not sure they would help here.
Answer :
You can add a new column to replies
, call it position
, and fill it with consecutive numbers of replies per thread (the position of the reply in the thread).
For example
id | thread_id | text | position
1 | 1 | .... | 1
2 | 2 | .... | 1
3 | 1 | .... | 2
4 | 1 | .... | 3
5 | 2 | .... | 2
6 | 3 | .... | 1
Further put an index on (thread_id, position, id)
and it allows you to write queries like
SELECT * FROM replies
WHERE thread_id = 1234
AND position BETWEEN 125400 AND 125410
ORDER BY id ASC
which runs fast, since this does not need a full index scan.
You can either update this column in your application, or write a database trigger to do this automatically.
The initial effort is quite high I admit. We used this trick a few years ago on a high write frequented, quite large table, and like I said it cost some effort to get it running, but when the solution was in place, the performance gain was overwhelming.
Look at the query
SELECT * FROM replies
WHERE thread_id = 1234
ORDER BY id ASC
LIMIT 125400,10 /* whoops */
My guess is you are probably forcing MySQL to gather the records with thread_id 1234 and then sorting the rows in id order. You need to give MySQL more help.
SUGGESTION #1 : Use a Better Index
I would suggest changing the thread_id index as follows:
CREATE TABLE replies (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
thread_id INT UNSIGNED NOT NULL,
text TEXT NOT NULL,
PRIMARY KEY (id)
INDEX thread_id (thread_id,id),
)
with these commands
ALTER TABLE replies DROP INDEX thread_id;
ALTER TABLE replies ADD INDEX thread_id_id (thread_id,id);
Even with such an index, the LIMIT 125400,10
will causing an internal index scan anyway. Nevertheless, because (thread_id,id)
is already ordered by id
per thread_id
, the ORDER BY
does not need to do any additional sorting.
SUGGESTION #2 : Rewrite the Query (OPTIONAL)
Since you only need 10 rows returned from the query, you should refactor the query to retrieve the 10 keys first in a subquery. Then, do a LEFT JOIN to retrieve the needed rows. I use LEFT JOIN
instead of an INNER JOIN
to preserve the order of the subquery prior to the JOIN
.
CAVEAT #1: If you try SUGGESTION #2
, you must do SUGGESTION #1
first.
CAVEAT #2 : Refactoring the query may be worth it for a large dataset. Please read my old StackOverflow post Fetching a Single Row from Join Table to see an example of adding an index and refactoring a query to retrieve 40 rows and retrieve it at the same speed no matter how many rows are in the table (I had to thoroughly explain in the comments why my answer had to be the fastest method over the 12 other answers. My answer got accepted and received the bounty).