Query for when comparing to multiple child records

Posted on

Question :

I cant seem to figure out what the tsql should be to do the following.

I would like to query Tweets table with all tweets that have category 1 and 2 (could be more than 2 categories passed)

So the query should return tweet 1, and 3 for the following inserts

USE tempdb
CREATE TABLE [dbo].[Tweets]([Id] [bigint] NOT NULL)
CREATE TABLE [dbo].[TweetCategories]([TweetId] [bigint] NOT NULL, [CategoryId] [int] NOT NULL)

INSERT INTO Tweets(Id) VALUES(1)
INSERT INTO Tweets(Id) VALUES(2)
INSERT INTO Tweets(Id) VALUES(3)
INSERT INTO Tweets(Id) VALUES(4)

INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(1, 1)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(1, 2)

INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(2, 2)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(2, 3)

INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 1)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 2)
INSERT INTO TweetCategories(TweetId, CategoryId) VALUES(3, 3)


DROP TABLE tweets
DROP TABLE TweetCategories

Efficiency is important as my tweets will have over 2M records and Categories could also be in few millions

Answer :

Easiest way, if it are always exactly two IDs:

SELECT *
  FROM dbo.Tweets AS t
 WHERE EXISTS (SELECT * FROM dbo.TweetCategories AS tc WHERE tc.CategoryId = 1 AND tc.TweetId = t.Id)
   AND EXISTS (SELECT * FROM dbo.TweetCategories AS tc WHERE tc.CategoryId = 2 AND tc.TweetId = t.Id)

If the categories to query are flexible (number or CategoryID), you may want something as:

DROP TABLE IF EXISTS #categories_to_query
CREATE TABLE #categories_to_query (CategoryId INT NOT NULL PRIMARY KEY);
INSERT INTO #categories_to_query (CategoryId)
VALUES (1), (2)
DECLARE @ctq INT = (SELECT COUNT(*) FROM #categories_to_query AS ctq)

SELECT t.*
  FROM dbo.Tweets AS t
 INNER JOIN (SELECT tc.TweetId
               FROM dbo.TweetCategories AS tc
              INNER JOIN #categories_to_query AS ctq
                 ON ctq.CategoryId = tc.CategoryId
              GROUP BY tc.TweetId
              HAVING COUNT(*) = @ctq
            ) AS cat
   ON cat.TweetId = t.Id

Depending, where / how the categories to query are “submitted”, you could either pass it as table variable to your procedure (drawback: you have to create a type for this), pass it as simple string list (‘1,2’ in your example) and split the string before inserting into #categories_to_query or make it quick and dirty by declaring e.g. 10 parameters to your procedure to allow up to 10 different categories to be filtered.

Regarding performance: you should have an index on CategoryId, TweetId (in this order!) on the dbo.TweetCategories

Since the list size is not gonna be the same for every execution, you could try Passing array parameters to a stored procedure. Here’s one way of doing it:

CREATE TYPE id_list AS TABLE (
    id int NOT NULL PRIMARY KEY
);
GO

CREATE PROCEDURE [dbo].[selected_categories] (
    @id_list    id_list READONLY
)
AS
BEGIN

    SELECT TC.TweetId
    FROM TweetCategories TC
        INNER JOIN @id_list CL ON TC.CategoryId = CL.id
    GROUP BY TC.TweetId
    HAVING COUNT(1) = (SELECT COUNT(1) FROM @id_list);

END;
GO

DECLARE @category_list id_list;
    
INSERT INTO @category_list (id)
VALUES (1), (2);

EXECUTE [dbo].[selected_categories]
    @id_list = @category_list;
GO

DROP PROCEDURE selected_categories;
DROP TYPE id_list;
GO

For this approach to work properly I’m assuming (TweetId, CategoryId) are a composite primary key, although your sample code does not show it.

Leave a Reply

Your email address will not be published.