Question :
I cant seem to figure out what the tsql should be to do the following.
I would like to query Tweets table with all tweets that have category 1 and 2 (could be more than 2 categories passed)
So the query should return tweet 1, and 3 for the following inserts
USE tempdb
CREATE TABLE [dbo].[Tweets]([Id] [bigint] NOT NULL)
CREATE TABLE [dbo].[TweetCategories]([TweetId] [bigint] NOT NULL, [CategoryId] [int] NOT NULL)
INSERT INTO Tweets(Id) VALUES(1)
INSERT INTO Tweets(Id) VALUES(2)
INSERT INTO Tweets(Id) VALUES(3)
INSERT INTO Tweets(Id) VALUES(4)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(1, 1)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(1, 2)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(2, 2)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(2, 3)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 1)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 2)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 3)
DROP TABLE tweets
DROP TABLE TweetCategories
Efficiency is important as my tweets will have over 2M records and Categories could also be in few millions
Answer :
Easiest way, if it are always exactly two IDs:
SELECT *
FROM dbo.Tweets AS t
WHERE EXISTS (SELECT * FROM dbo.TweetCategories AS tc WHERE tc.CategoryId = 1 AND tc.TweetId = t.Id)
AND EXISTS (SELECT * FROM dbo.TweetCategories AS tc WHERE tc.CategoryId = 2 AND tc.TweetId = t.Id)
If the categories to query are flexible (number or CategoryID), you may want something as:
DROP TABLE IF EXISTS #categories_to_query
CREATE TABLE #categories_to_query (CategoryId INT NOT NULL PRIMARY KEY);
INSERT INTO #categories_to_query (CategoryId)
VALUES (1), (2)
DECLARE @ctq INT = (SELECT COUNT(*) FROM #categories_to_query AS ctq)
SELECT t.*
FROM dbo.Tweets AS t
INNER JOIN (SELECT tc.TweetId
FROM dbo.TweetCategories AS tc
INNER JOIN #categories_to_query AS ctq
ON ctq.CategoryId = tc.CategoryId
GROUP BY tc.TweetId
HAVING COUNT(*) = @ctq
) AS cat
ON cat.TweetId = t.Id
Depending, where / how the categories to query are “submitted”, you could either pass it as table variable to your procedure (drawback: you have to create a type for this), pass it as simple string list (‘1,2’ in your example) and split the string before inserting into #categories_to_query or make it quick and dirty by declaring e.g. 10 parameters to your procedure to allow up to 10 different categories to be filtered.
Regarding performance: you should have an index on CategoryId, TweetId (in this order!) on the dbo.TweetCategories
Since the list size is not gonna be the same for every execution, you could try Passing array parameters to a stored procedure. Here’s one way of doing it:
CREATE TYPE id_list AS TABLE (
id int NOT NULL PRIMARY KEY
);
GO
CREATE PROCEDURE [dbo].[selected_categories] (
@id_list id_list READONLY
)
AS
BEGIN
SELECT TC.TweetId
FROM TweetCategories TC
INNER JOIN @id_list CL ON TC.CategoryId = CL.id
GROUP BY TC.TweetId
HAVING COUNT(1) = (SELECT COUNT(1) FROM @id_list);
END;
GO
DECLARE @category_list id_list;
INSERT INTO @category_list (id)
VALUES (1), (2);
EXECUTE [dbo].[selected_categories]
@id_list = @category_list;
GO
DROP PROCEDURE selected_categories;
DROP TYPE id_list;
GO
For this approach to work properly I’m assuming (TweetId, CategoryId) are a composite primary key, although your sample code does not show it.