Question :
I’m wondering if someone could help me optimize my tables/query to speed up a query. It is currently running ridiculously slow. I think a well-thought out index could help me. Any help would be really appreciated
Tables URLS and TAGS mentioned below are 2 and 20 million rows respectively (will probably end up having 10x). A query like the one below already takes 10 seconds to run.
An Example:
http://whatrethebest.com/php+tutorials
Tables
CREATE TABLE IF NOT EXISTS `TAGS` (
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `my_unique_key` (`hash`,`tag`),
KEY `tag` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and
CREATE TABLE IF NOT EXISTS `URLS` (
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`md5`),
UNIQUE KEY `md5` (`md5`),
KEY `numsaves` (`numsaves`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
QUERY
SELECT urls.md5, urls.url, urls.title, urls.numsaves
FROM urls
JOIN tags ON urls.md5 = tags.hash
WHERE tags.tag
IN (
'php', 'tutorials'
)
GROUP BY urls.md5
HAVING COUNT( * ) =2
ORDER BY urls.numsaves DESC
LIMIT 20
EXPLAIN
I’m not sure what this shows
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tags range my_unique_key,tag tag 767 NULL 230946 Using where; Using index; Using temporary; Using filesort
1 SIMPLE urls eq_ref PRIMARY,md5 PRIMARY 767 jcooper_whatrethebest_urls.tags.hash 1
So I think the problem is:
certain tags like ‘php have 34,000 entries, most of which only have under 5 saves. But in order to get the 20 most saved it is having to sort them all.Right?
I can’t really create a ‘numsaves’ column in TAGS and index on that because that number will be changing up and down, and that wouldnt make sense. Is it possible to create a cross-table index between urls.numsaves and tags.tag? Or a third table to use in my query somehow? Would this solve my problem? I know almost nothing about indexing.
Any help would be really appreciated!
EDITS BELOW
RESPONSE TO YperCube:
Thank you, Thank you, your suggestions have sped up my queries by a factor of 10-20X .
This is an immense improvement. I can’t thank you enough.
I’m posting my current queries and tables with execution times in case you or anyone else has any more optimization suggestions. I am worried that as my table grows I may not be able to keep my search times under 3 seconds, which would be a killer.
New Query Example 1
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = u.id
AND t2.tag = 'language'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 0.2009 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t2 ref tag_id tag_id 767 const 53820 Using where; Using index; Using temporary; Using filesort
1 SIMPLE t1 ref tag_id tag_id 772 const,jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Neq Query Example 2 (seems to be slower)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN
( SELECT ui.id, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON t1.url_id = ui.id
AND t1.tag = 'programming'
JOIN tags AS t2 ON t2.url_id = ui.id
AND t2.tag = 'language'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.id = u.id
ORDER BY ulim.numsaves DESC ;
Showing rows 0 - 29 ( 2,794,577 total, Query took 0.4633 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 20 Using filesort
1 PRIMARY u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 ulim.id 1
2 DERIVED t2 ref tag_id tag_id 767 53820 Using where; Using index; Using temporary; Using filesort
2 DERIVED t1 ref tag_id tag_id 772 jcooper_whatrethebest_urls.t2.url_id 1 Using where; Using index
2 DERIVED ui eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t2.url_id 1
Using Query Example on a Single Tag (slower by a lot)
SELECT u.id, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.url_id = u.id
AND t1.tag = 'programming'
ORDER BY u.numsaves DESC
LIMIT 20
Showing rows 20 - 19 ( 20 total, Query took 3.7395 sec)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t1 ref tag_id tag_id 767 const 200576 Using where; Using index; Using temporary; Using filesort
1 SIMPLE u eq_ref PRIMARY,id_numsaves_IX PRIMARY 4 jcooper_whatrethebest_urls.t1.url_id 1
I’m not sure why this one is so much slower?
Do you have any ideas of a query to optimize for querying a single tag?
My Current Tables
CREATE TABLE `urls` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`domain` text,
`title` text NOT NULL,
`description` text,
`numsaves` int(11) NOT NULL,
`firstsaved` varchar(256) DEFAULT NULL,
`md5` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
KEY `id_numsaves_IX` (`id`,`numsaves`)
) ENGINE=InnoDB AUTO_INCREMENT=2958560 DEFAULT CHARSET=utf8
CREATE TABLE `tags` (
`url_id` int(11) DEFAULT NULL,
`hash` varchar(255) NOT NULL,
`tag` varchar(255) NOT NULL,
UNIQUE KEY `tag_id` (`tag`,`url_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Thank you again
Answer :
Add an index on (tag, hash)
ALTER TABLE tags
DROP INDEX tag,
ADD UNIQUE INDEX tag_hash_UX
(tag, `hash`) ;
Then try this query:
SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
JOIN tags AS t1 ON t1.hash = u.md5
AND t1.tag = 'php'
JOIN tags AS t2 ON t2.hash = u.md5
AND t2.tag = 'tutorials'
ORDER BY u.numsaves DESC
LIMIT 20 ;
and after adding another index:
ALTER TABLE urls
DROP INDEX md5,
ADD INDEX md5_numsaves_IX
(md5, numsaves) ;
try this variation (explanation below, at point 3):
SELECT u.md5, u.url, u.title, u.numsaves
FROM urls AS u
JOIN
( SELECT ui.md5, ui.numsaves
FROM urls AS ui
JOIN tags AS t1 ON t1.hash = ui.md5
AND t1.tag = 'php'
JOIN tags AS t2 ON t2.hash = ui.md5
AND t2.tag = 'tutorials'
ORDER BY ui.numsaves DESC
LIMIT 20
) AS ulim ON ulim.md5 = u.md5
ORDER BY ulim.numsaves DESC ;
Other things you should consider:
- You have 2 identical indexes on
URLS (md5)
, one Primary and one Unique. You don’t need both, you can safely delete the redundant (unique) index. -
Why was
md5
was chosen as the primary key in the first place? Why not a surrogate (auto-incrementing) integer column? That is usually a better choice for primary key in InnODB tables because the primary key is also by default the clustered index of the table.If you make that change (add an
url_id
column and make it PK), you will also need to alter theTAGS
table by adding aurl_id
as well and removing thehash
column. This would help efficiency, too. Indexes onint
columns take much less space than indexes onVARCHAR(255)
columns. 4 bytes vs. 757 bytes wide is a lot of difference. -
Read this blog post about text and blob columns in InnoDB: Handling long texts/blobs in InnoDB If the issue turns out to be the text columns, you could split the table into two tables, having an 1-1 relationship between them, so text columns are separately from the narrow columns (or try the index I suggested, which is a similar fix for this.)