I built the following SQL Server query, but it is encountering the anti-semi join defect in SQL Server 2005 which results in inaccurate cardinality estimates (1 — urgh!) and runs forever. Since it is a longstanding production SQL Server I can’t easily suggest to upgrade versions, and as such I cannot force the traceflag 4199 hint on this specific query.
I’m having a hard time refactoring the
WHERE AND NOT IN (SELECT). Can anyone care to help? I’ve made sure to try and use the best joins based on clustered key pairs.
SELECT TOP 5000 d.doc2_id ,d.direction_cd ,a.address_type_cd ,d.external_identification ,s.hash_value ,d.publishdate ,d.sender_address_id AS [D2 Sender_Address_id] ,a.address_id AS [A Address_ID] ,d.message_size ,d.subject ,emi.employee_id FROM assentor.emcsdbuser.doc2 d(NOLOCK) INNER JOIN assentor.emcsdbuser.employee_msg_index emi(NOLOCK) ON d.processdate = emi.processdate AND d.doc2_id = emi.doc2_id INNER LOOP JOIN assentor.emcsdbuser.doc2_address a(NOLOCK) ON emi.doc2_id = a.doc2_id AND emi.address_type_cd = a.address_type_cd AND emi.address_id = a.address_id INNER JOIN sis.dbo.sis s(NOLOCK) ON d.external_identification = s.external_identification WHERE d.publishdate > '2008-01-01' **AND d.doc2_id NOT IN ( SELECT doc2_id FROM assentor.emcsdbuser.doc2_address d2a(NOLOCK) WHERE d.doc2_id = d2a.doc2_id AND d2a.address_type_cd = 'FRM' )** OPTION (FAST 10)
Note that the
Employee_MSG_Index table is 500m rows,
doc2 is 1.5b rows,
SIS is ~500m rows.
Any help would be appreciated!
Since it is a longstanding production SQL Server I can’t easily suggest to upgrade versions
The anti semi join cardinality estimation bug is reproducible on all versions of SQL Server from 2005 to 2012 inclusive. All require trace flag 4199 to enable the fix, so upgrading would not solve your problem without activating 4199 (though there are many other good reasons to upgrade from 2005, of course).
…as such I cannot force the traceflag 4199 hint on this specific query.
If it is just one particular query that is affected, you can use
OPTION (QUERYTRACEON 4199) to enable the trace flag for just that query. This query hint is documented and supported for use with 4199, and applies from SQL Server 2005 Service Pack 2 onward.
This hint effectively runs
DBCC TRACEON (4199) and
DBCC TRACEOFF (4199) around the query, and requires sysadmin permission as a result. If that is a problem, add the hint using a plan guide.
You should also look at testing your whole system with 4199 enabled instance-wide. Plan regressions are possible, but overall you may find that the various optimizer fixes enabled by this flag are well worth it. All future plan-affecting query processor fixes require this flag to activate.
All that said…
As mentioned in ypercube’s answer, the bug requires two or more join columns to manifest (among many details). The redundancy in your
NOT IN clause causes the optimizer to see two column comparisons (though logically there is only one), thereby exposing the bug.
Removing this redundancy will ‘solve’ the problem for this particular query, though other queries that really do have more than one join predicate will still be vulnerable.
To illustrate, here is an example based on the CSS blog post linked in the question (but with a complete script!):
CREATE TABLE dbo.tst_TAB1 ( c1 integer NOT NULL, c2 integer NOT NULL, c3 integer NOT NULL ); CREATE TABLE dbo.tst_TAB2 ( c1 integer NOT NULL, c2 integer NOT NULL, c3 integer NOT NULL ); CREATE INDEX i ON dbo.tst_TAB1 (c1, c2); CREATE INDEX i ON dbo.tst_TAB2 (c1, c2);
INSERT dbo.tst_TAB1 (c1, c2, c3) SELECT number, number, number FROM master.dbo.spt_values WHERE [type] = N'P' AND number BETWEEN 1 AND 2047; INSERT dbo.tst_TAB2 (c1, c2, c3) VALUES (1, 1, 1);
Test query using
NOT IN with redundant predicate:
SELECT T1.c1 FROM tst_TAB1 AS t1 WHERE t1.c1 NOT IN ( SELECT t2.c1 FROM tst_TAB2 AS t2 -- This is redundant! WHERE t2.c1 = t1.c1 );
The estimated execution plan shows an estimate of 1 row after the anti semi join:
Side note: In fact this is an example of another (rare) bug. Writing the
WHERE clause as
t1.c1 = t2.c1 instead of
t2.c1 = t1.c1 allows the optimizer to see that the two join predicates are in fact the same, and the bug does not manifest.
The same query with
OPTION (QUERYTRACEON 4199):
SELECT T1.c1 FROM tst_TAB1 AS t1 WHERE t1.c1 NOT IN ( SELECT t2.c1 FROM tst_TAB2 AS t2 WHERE t2.c1 = t1.c1 ) OPTION (QUERYTRACEON 4199);
The estimated execution plan now shows an estimate of 2046 rows, which is exactly right:
We can also remove the redundant predicate:
SELECT T1.c1 FROM tst_TAB1 AS t1 WHERE t1.c1 NOT IN ( SELECT t2.c1 FROM tst_TAB2 AS t2 );
The execution plan happens to use an additional unrelated optimization (the Stream Aggregate), but the important point is that the post-join estimate is correct without having to enable 4199:
Multiple anti semi join columns
It is possible to express an anti semi join over multiple columns using
NOT IN syntax. These cases will require 4199. For example, the next query joins on
SELECT T1.c1 FROM tst_TAB1 AS t1 WHERE t1.c1 NOT IN ( SELECT t2.c1 FROM tst_TAB2 AS t2 WHERE t2.c2 = t1.c2 );
The execution plan shows the erroneous 1-row estimate:
With 4199, the issue is resolved:
SELECT T1.c1 FROM tst_TAB1 AS t1 WHERE t1.c1 NOT IN ( SELECT t2.c1 FROM tst_TAB2 AS t2 WHERE t2.c2 = t1.c2 ) OPTION (QUERYTRACEON 4199);
NOT IN in this way is best avoided, not least for the reasons mentioned in Books Online:
That issue with
NOT IN and
NULLs has been written about many times. There are many alternative syntaxes available, of which
NOT EXISTS is my personal preference. Note that changing syntax will not avoid the cardinality estimation bug:
SELECT T1.c1 FROM dbo.tst_TAB1 AS t1 WHERE NOT EXISTS ( SELECT 1 FROM dbo.tst_TAB2 AS t2 WHERE t2.c1 = t1.c1 AND t2.c2 = t1.c2 );
That two column anti semi join produces the 1-row estimate, and requires 4199 to fix it. The execution plans are exactly the same as seen previously, so I will not repeat them. The
NOT EXISTS syntax does avoid the
NULLs issue with
I agree with ypercube’s other observations.
NOLOCKhints over every table in a query is a bad code smell. If the query can truly tolerate
READ UNCOMMITTEDtransaction semantics, set the isolation level explicitly.
ORDER BYis another sign of poor code.
ORDER BYclause to define what
TOPmeans. Never rely on observed behaviour, use an explicit top-level
ORDER BYto obtain a guarantee.
INNER LOOP JOINand join hints in general, imply a
FORCE ORDERquery hint. This severely limits the optimizer’s freedom, and is usually misunderstood and misapplied. Never use hints that you don’t fully understand.
The link you provided says that the bug affects only joins with more than one column:
Note that you only experience this issue when multiple joining columns are involved in the join as the example above.
And I can’t understand why you have written the
NOT IN in this way (adding the
d.doc2_id = d2a.doc2_id condition in the subquery.) It’s redundant (unless the joined columns are nullable – are they?), so you could write the
NOT IN as:
AND d.doc2_id NOT IN ( SELECT d2a.doc2_id FROM assentor.emcsdbuser.doc2_address d2a WHERE d2a.address_type_cd = 'FRM' )
AND NOT EXISTS ( SELECT 1 FROM assentor.emcsdbuser.doc2_address d2a WHERE d.doc2_id = d2a.doc2_id AND d2a.address_type_cd = 'FRM' )
Try both and check if the cardinality estimate issue is cured.
- Do you have an index on
- Why the multiple use
ORDER BYmay give you different results per execution.