Odd execution plan behaviour from variations in WHERE…IN(List)

Posted on

Question :

(reposted from stack-overflow by request)

I’ve been looking at trying to optimise (or at least change) some EF code in C# to use stored procedures, and found what seems to be an anomaly (or something new to me) when finding rows matching a constant list.
The typical manually-generated short query I’d use would be something like…

SELECT Something FROM Table WHERE ID IN (one, two, others);

We had an EF query that we were replacing with a stored procedure call, so I looked at the output, saw that it was complex and thought my simpler query (similar to the above) would be better. It wasn’t. Here is a quick demo that reproduces this, trying a couple of intermediate variations to see where the change-over in plan behaviour occurs.

Can anyone explain why the execution plans for the final version – with the

...WHERE EXISTS(... (SELECT 1 AS X) AS Alias UNION ALL...) AS Alias...)

construct is better – seemingly because it omits the costly SORT operation, even though the plan includes TWO index scans (or more with more list members) rather than the one of the simpler query.

Here’s a self-contained example script (I hope)…

USE SandBox;  -- a dummy database, not a live one!
-- create our dummy table, dropping first if it exists
IF EXISTS (SELECT NULL FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'Test')
    DROP TABLE Test;
CREATE TABLE Test (Id INT IDENTITY(1,1) NOT NULL PRIMARY KEY, FormId INT NOT NULL, DateRead DATE NULL);
-- populate with some data
INSERT INTO Test VALUES (1, NULL), (1, GETDATE()), (1, NULL), (4, NULL), (5, NULL), (6, GETDATE());

-- Simple query that I might typically use
-- how many un-read entries are there for a set of 'forms' of interest, 1, 5 and 6
-- (we're happy to omit forms with none)
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test AS T
 WHERE  T.FormId IN (1, 5, 6)
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- This is the first step towards the EF-generated code
-- using an EXISTS gives basically the same plan but with constants
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (VALUES (1), (5), (6)
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- A step closer, using UNION ALL instead of VALUES to generate the 'table'
-- still the same plan
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 
                                UNION ALL
                                SELECT 5 
                                UNION ALL
                                SELECT 6 
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

-- Now what the EF actually generated (cleaned up a bit)
-- Adding in the "FROM (SELECT 1 as X) AS alias" changes the execution plan considerably and apparently costs less to run
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 FROM (SELECT 1 AS X) AS X1
                                UNION ALL
                                SELECT 5 FROM (SELECT 1 AS X) AS X2
                                UNION ALL
                                SELECT 6 FROM (SELECT 1 AS X) AS X3
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
   AND  T.DateRead IS NULL
 GROUP BY T.FormId;

Can anyone help me to understand why and (importantly for my learning) if there is a benefit in wider use for using this kind of query format?

I looked around for anything special in (SELECT 1 AS X) stuff and though many show it as being common in EF output, I couldn’t see anything about this particular apparent benefit.

Answer :

For the query that you have I don’t see a measurable performance advantage or disadvantage to adding the extraneous derived tables. I’m testing against SQL Server 2017 but I can easily reproduce the behavior that you’re seeing. I’m going to use a slightly simpler reproduction:

CREATE TABLE Test (FormId INT NOT NULL);

INSERT INTO Test
SELECT TOP (2580) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

-- query 1
SELECT  T.FormId, COUNT(*) AS TheCount
 FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;

-- query 2
SELECT  T.FormId, COUNT(*) AS TheCount
  FROM  Test T
 WHERE  EXISTS (    SELECT NULL
                      FROM (    SELECT 1 FROM (SELECT 1 x) AS X1 -- added
                                UNION ALL
                                SELECT 2
                            ) AS X(FormId) 
                     WHERE X.FormId = T.FormId
               )
GROUP BY T.FormId;

With 2580 rows in the table I get different plans as you did. However, it takes just five logical reads to read all data from the table. For both queries I wasn’t able to get SET STATISTICS IO, TIME ON to output anything other than:

CPU time = 0 ms, elapsed time = 0 ms.

The query plans may be different, but the amount of data in the table is so tiny that it doesn’t matter.

Now I’m going to truncate the table and insert 10 more rows than before.

TRUNCATE TABLE Test;

INSERT INTO Test
SELECT TOP (2590) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2

OPTION (MAXDOP 1);

With that minor change I get effectively the same plans for both queries. Both queries do one scan of the data and one sort:

same plans

As far as I can tell the plan that does multiple scans is not a legal plan option for the first query. I was unable to force it with the USE XML hint, but I could have made a mistake. The second query is able to choose between both plans on a cost basis. The query rule responsible for the multiple scan approach appears to be JNonUNIA. You can see this in action with the undocumented query hint QUERYRULEOFF JNonUNIA. The Google does not give me any results when searching for this optimizer rule, but based on the description it’s an abbreviation for “Join Union All -> Union All Join”. In other words, take a join to a UNION ALL and switch it to a UNION ALL of joins. That’s exactly what we see in the query plan with multiple scans.

I understand that you’re using an EF so you don’t have complete control over the code. But just because you see a certain plan today doesn’t mean that you’ll get that plan in all future versions of SQL Server, especially for something as obscure as what you’re observing. If for some reason you need to pick one plan over the other for a query I recommend being as direct as possible. For example, in the current version of SQL Server the following query appears to always result in a single scan:

SELECT
  FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId IN (1, 2)
GROUP BY FormId;

And the following query seems to always result in two scans:

SELECT
  1 AS FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 1

UNION ALL

SELECT
  2 AS FormId
  , COUNT(*) AS TheCount
FROM Test T
WHERE FormId = 2
OPTION (MAXDOP 1);

Of course, that behavior isn’t guaranteed either. It still feels more reliable, if that makes sense.

It is possible to think of modified queries in which using the UNION ALL approach will give a performance boost. Suppose that you have a million rows in the table and you want to put any rows with a FormId between 1 and 600000 into bucket 1 and any rows with a FormId between 600001 and 1000000 into bucket 2.

The following query takes around 200 ms to execute on my machine:

SELECT
  CASE WHEN FormId <= 600000 THEN 1 ELSE 2 END FormBucket
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId >= 1 AND FormId <= 1000000
GROUP BY CASE WHEN FormId <= 600000 THEN 1 ELSE 2 END
OPTION (MAXDOP 1);

Here’s the plan:

bad plan

However, this query executes in around 100 ms:

SELECT
  1 AS FormBucket
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId >= 1 AND FormId <= 600000

UNION ALL

SELECT
  2 AS FormId
, COUNT(*) AS TheCount
FROM Test T
WHERE FormId >= 600001 AND FormId <= 1000000
OPTION (MAXDOP 1);

Here’s the plan:

good plan

Leave a Reply

Your email address will not be published. Required fields are marked *