Question :
I have a MySQL database table that references different words and their locations in documents. I want to return the IDs of documents that contain all of the words.
Here is an example table.
docid wordid
1 4
2 4
1 2
1 5
Ok, now say that someone queried the database for the words that had WORDIDs 4, 2, and 5.
My erroneous SQL SELECT statement would be something like:
Select docid from table where wordid = 4 and wordid = 2 and wordid = 5
This is giving me 0 results.
I have seen in other places where the where in
clause has been suggested:
If I understand correctly, this is another way to write an OR clause. I have tried this:
select docid from table where wordid in (4,2,5)
But, this is giving me all the results. It should exclude docid 2 as that does not contain the other words. I’m expecting to just get docid 1.
However, I could be using the where in
clause incorrectly as I have very little db experience.
How can I return docids that contain all of the words?
Please note as well, my where clause will be dynamically generated in a FOR loop. The query could be as simple as one or two words, or it could be 10 or 12 words. I’m looking for a query structure that takes speed into consideration. Please let me know if you need anymore information.
For reference, I am trying to convert this code into PHP / MYSQL, but I don’t understand the sql statement here or its equivalent in MYSQL:
Answer :
This is the relational division problem and there is a question about it at SO, with a lot of ways to write this query, plus performance analysis for PostgreSQL: How to filter SQL results in a has-many-through relation
Shamelessly copying code form there and removing/changing code for answers that have features lacking from MySQL, like CTEs, EXCEPT
, INTERSECT
, etc, here are a few ways to do this.
Assumptions:
- the table is called
factors
- there is a
UNIQUE
constraint on(wordid, docid)
- there is a
documents
and awords
table:
Easy to write, medium efficiency:
-- Query 1 -- by Martin
SELECT d.docid, d.docname
FROM document d
JOIN factors f USING (docid)
WHERE f.wordid IN (2, 4, 5)
GROUP BY d.docid
HAVING COUNT(*) = 3 ; -- number of words
Easy to write, medium efficiency:
-- Query 2 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
JOIN (
SELECT docid
FROM factors
WHERE wordid IN (2, 4, 5)
GROUP BY docid
HAVING COUNT(*) = 3
) f USING (docid) ;
More complex to write, very good efficiency in Postgres – probably lousy in MySQL:
-- Query 4 -- by Derek
SELECT d.docid, d.docname
FROM documents d
WHERE d.docid IN (SELECT docid FROM factors WHERE wordid = 2)
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 4);
AND d.docid IN (SELECT docid FROM factors WHERE wordid = 5);
More complex to write, very good efficiency in Postgres – and probably the same in MySQL:
-- Query 5 -- by Erwin
SELECT d.docid, d.docname
FROM documents d
WHERE EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 2)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 4)
AND EXISTS (SELECT * FROM factors
WHERE docid = d.docid AND wordid = 5) ;
More complex to write, very good efficiency in Postgres – and probably the same in MySQL:
-- Query 6 -- by Sean
SELECT d.docid, d.docname
FROM documents d
JOIN factors x ON d.docid = x.docid
JOIN factors y ON d.docid = y.docid
JOIN factors z ON d.docid = z.docid
WHERE x.wordid = 2
AND y.wordid = 4
AND z.wordid = 5 ;
Easy to write and extend to an arbitrary set of words
but not as efficient as the JOIN
and EXISTS
solutions:
-- Query 7 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM words AS w
WHERE w.wordid IN (2, 4, 5)
AND NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Easy to write, not good efficiency:
-- Query 8 -- by ypercube
SELECT d.docid, d.docname
FROM documents d
WHERE NOT EXISTS (
SELECT *
FROM (
SELECT 2 AS wordid UNION ALL
SELECT 4 UNION ALL
SELECT 5
) AS w
WHERE NOT EXISTS (
SELECT *
FROM factors AS f
WHERE f.docid = d.docid
AND f.wordid = w.wordid
)
);
Enjoy testing them 🙂