Question :
I am writing a table valued function to geocode an address based on its Suburb, State and Postcode. I try to geocode the address using different methods, in order of decreasing accuracy:
- Exact match on unique suburb-postcode-state combination
- Exact match on unique suburb-postcode combination
- Exact match on unique suburb-state combination
- Exact match on unique postcode
- Approximate match by non-unique Postcode, where all Suburbs with this Postcode are within 5 km of one another.
(I am working with a geographical region where Suburb-Postcode-State relationships are all many-to-many. In other words, one Suburb can have multiple postcodes; one postcode can have multiple suburbs and may exist in different states.)
Following is an extract from the table-valued function:
ALTER FUNCTION [geocode].[tvfn_Customer_Suburb_From_Address]
(
@Suburb NVARCHAR(100),
@State NVARCHAR(100),
@Postcode NVARCHAR(100),
@Country NVARCHAR(100)
)
RETURNS TABLE
AS
RETURN
(
SELECT TOP 1 *
FROM (
-- Unique suburb-postcode-state combinations
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 4 -- Exact match by unique Postcode, Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.[Is_Active] = 1
AND s.[Suburb] = @Suburb
AND s.[State] = @State
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- -- Unique suburb-postcode combinations
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 3 -- Exact match by unique Postcode & Suburb
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1
AND Suburb = s.Suburb AND Postcode = s.Postcode
GROUP BY Postcode, Suburb
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Suburb] = @Suburb
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- Exact match by unique Suburb and State
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 6 -- Exact match by unique Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1 AND Is_PO_Box = 0 -- Exclude PO Boxes
AND Suburb = s.Suburb AND Postcode = s.Postcode
GROUP BY Suburb, Postcode
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Suburb] = @Suburb
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- Exact match by unique Postcode
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 2 -- Exact match by unique Postcode
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE EXISTS ( SELECT *
FROM geocode.tSuburbs_XX
WHERE Is_Active = 1
AND Postcode = s.Postcode
GROUP BY Postcode
HAVING COUNT(*) = 1
)
AND s.Is_Active = 1
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
-- Perform this extra check to make sure we don't match a postcode in a wrong country
AND ( @Country IN ('AAA', 'BBB', 'CCC')
OR @State IN ('MMM', 'NNN', 'OOO', 'PPP')
)
UNION ALL
-- Approximate match by non-unique Postcode, where all Suburbs with this Postcode are within 5 km of one another.
SELECT s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 5
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM [geocode].[tPostcode_Distances] pd
INNER JOIN geocode.tSuburbs_XX s
ON pd.Approx_Suburb_DID = s.Suburb_DID
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.Is_Active = 1
AND pd.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
-- Perform this extra check to make sure we don't match a postcode in a wrong country
AND ( @Country IN ('AAA', 'BBB', 'CCC')
OR @State IN ('MMM', 'NNN', 'OOO', 'PPP')
)
AND pd.Max_Distance <= 5000 -- within 5 km
) t
)
The above function works, but I would like to know if it can be improved. In particular, is it possible to force SQL Server to stop processing the SELECT
statements after the first SELECT
that returns a result-set (because we’re interested in the first matching result only – TOP 1
)?
Update
Thank you for suggestions thus far. I will add a [Priority]
column as suggested in several answers and comments and an ORDER BY
clause to ensure I get the top result.
I will also add a WITH SCHEMABINDING
to the TVFN so SQL Server can parallelise the plan. While we’re on this subject, the using a multi-statement table-valued function is a good idea (Thank you to Paul White), but multi-statement TVFN’s always force a serial plan.
I will now try Lennart’s answer, in which he suggested using a CTE.
Answer :
If you must use a single query (as required by a single inline function), you can use one of the two options below (illustrated in my recent answer to Relating 2 tables with possible wildcards?):
Option 1
Use multiple APPLY
clauses with a startup condition for each using an outer reference from a previous apply in the chain. The efficiency of this method depends on the presence of startup filters in the execution plan. Correct results are guaranteed, but plan shape is not.
Option 2
Add an extra column with a constant literal to each clause of the union e.g. [Priority] = 1
then add an ORDER BY [Priority] ASC
at the TOP (1)
scope. Efficient operation depends on the plan avoiding sorts.
On reflection, this is not what you want in this case, because one row from each option is required by the merge concatenation in the plan. Nevertheless, it is an option in more general situations (where the alternate inputs produce more than one row, and the first row at low cost).
In addition:
Option 3
Since you’re only returning a single row, you could use a multi-statement table-valued function instead, with explicit logic to try each option in order (in separate queries), returning as soon as the first result is found. This is guaranteed to produce correct results efficiently.
Note
The current function is tecnhically nondeterministic; SQL Server could evaluate the union all in any order it chooses, potentially returning a lower-priority result before evaluating a higher priority one.
I don’t know sql-server well enough to state whether it works this way or not, but in theory there is no way you can say that one part of the union is evaluated before another. I.e. even though you have a perfect match, you may still end up with an approximate one.
You can however force this behavior by adding a priority to each part of the union and order by that. Something like:
SELECT TOP 1 *
FROM (
-- Unique suburb-postcode-state combinations
SELECT 1 as prio
,s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 4 -- Exact match by unique Postcode, Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.[Is_Active] = 1
AND s.[Suburb] = @Suburb
AND s.[State] = @State
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
UNION ALL
-- -- Unique suburb-postcode combinations
SELECT 2 as prio
,s.Suburb_DID
,s.Suburb
,s.State
[...]
) t
order by prio
Now you will get – one of – the row with the smallest prio in return. Note that the DBMS may still evaluate other options so there is no guarantee that it will perform better.
Another idea is to pipeline the parts via CTE’s in the order of preference:
with t1 as (
-- Unique suburb-postcode-state combinations
SELECT 1 as prio
,s.Suburb_DID
,s.Suburb
,s.State
,s.Postcode
,Geocode_DID = 4 -- Exact match by unique Postcode, Suburb and State
,s.Geocode_Latitude
,s.Geocode_Longitude
FROM geocode.tSuburbs_XX s
INNER JOIN [geocode].[tGeocode_Methods] gm
ON s.Geocode_DID = gm.Geocode_DID
WHERE s.[Is_Active] = 1
AND s.[Suburb] = @Suburb
AND s.[State] = @State
AND s.[Postcode] = @Postcode
-- Only suburbs that are geocoded with methods that can be used for geocoding customers
AND gm.[Can_Use_For_VIP] = 1
), t2 as (
- -- Unique suburb-postcode combinations
SELECT 2 as prio
,s.Suburb_DID
,s.Suburb
,s.State
[...]
WHERE NOT EXISTS ( SELECT 1 FROM T1 )
), t3 as (
[...]
WHERE NOT EXISTS ( SELECT 1 FROM T2 )
)
select * from t1
union all
select * from t2
union all
[...]
The optimizer may be smart enough to understand where to stop. It is at least obligated to evaluate t1 before t2 and so forth.
Combine all queries using LEFT OUTER JOINS
and use the resulting columns in reverse order as parameters of COALESCE function.
This function will evaluate all the parameters from left to right and take the first not null value.
you can populate the table and check for rows
CREATE FUNCTION dbo.F1 (@int int)
RETURNS @table table (col1 int, col2 varchar(10))
WITH EXECUTE AS CALLER
AS
BEGIN
--insert into @table values (@int, 'one');
if((select count(*) from @table) > 0)
return;
insert into @table values (@int, 'two');
return;
END;
GO
SELECT *
FROM dbo.F1(1);
Thank you to Paul White, Lennart and Kunal Chitkara for their answers.
Also, thanks to ypercubeᵀᴹ for being the first to point out that ORDER BY
is required even when a SELECT TOP 1 *
query consists of multiple UNION
‘ed subqueries.
4 Approaches
I compared performance of the query written in 4 different ways:
- Method 1
Ordered SELECT TOP 1 from UNION’ed subqueries, which has this form:
SELECT TOP 1 *
FROM (
subquery1
UNION ALL
subquery2
UNION ALL
...
subquery7
) t
ORDER BY [Preference]
- Method 2
Chained CTE’s followed by SELECT FROM UNION’ed subqueries from CTE’s, which has this form:
WITH s AS (sharedQuery),
cte1 AS (subquery1),
cte1 AS (subquery2
WHERE NOT EXISTS (SELECT 1 FROM q1)
), ...
cte7 AS (subquery7
WHERE NOT EXISTS (SELECT 1 FROM cte1 UNION ALL ... SELECT 1 FROM cte6)
)
SELECT *
FROM (
cte1
UNION ALL
cte2
UNION ALL
...
cte7
) t
- Method 3
SELECT FROM COALESCE’ed fields based on LEFT JOIN’ed subqueries and a CTE, which has this form:
WITH s AS (sharedQuery),
SELECT COALESCE(q1.A, q2.A, ... q7.A)
FROM (
subquery1
) q1 LEFT JOIN (
subquery2
) q2 ON 1=1
...
LEFT JOIN (
subquery7
) q7 ON 1=1
- Method 4
SELECT FROM COALESCE’ed fields based on OUTER APPLY’ed subqueries and a CTE, which has this form:
WITH s AS (sharedQuery),
SELECT COALESCE(q1.A, q2.A, ... q7.A)
FROM (
subquery1
) q1 OUTER APPLY (
subquery2
) q2
...
OUTER APPLY (
subquery7
) q7
As performance was the main reason for this exercise, it was desirable that the queries be written in a way that allows SQL Server query optimizer to produce a parallel plan. I needed to be aware that:
- the TVFN is an inline one, rather than a multi-statement one
- the TVFN must contain
WITH SCHEMABINDING
constraint, which means that all objects (tables, views, etc) referenced by the TVFN must reside in the same database - using
TOP
forces the zone (the [sub]query to whichTOP
applies) to be run serially - calculated fields that are based on scalar UDF force the whole plan to be run serially
Test method
-- Generate test data by loading some random addresses into a temporary table
SELECT TOP 1000 *
INTO #Addresses
FROM geocode.tDelivery_Address_Geocodes
ORDER BY NEWID()
-- Switch on statistics before running queries
SET STATISTICS TIME ON
SET STATISTICS IO ON
-- Clear cache before running each test
DBCC FREEPROCCACHE
-- Use a SELECT and OUTER APPLY to run the TVFN under test:
SELECT *
FROM #Addresses a
OUTER APPLY [geocode].[tvfn_Customer_Geocode_From_Address2](a.Suburb, a.TownCityState, a.PostCode, a.Country) t
Test results
==========================================================================
| | CPU time| Elapsed | | Est Subtree| Cached | CPU Parse &|
| Method | (s) | time (s)| DOP| Cost | plan size| Compile (s)|
|------------------------------------------------------------------------|
| Method 1 | 16.427 | 16.508 | 1 | 50.4757 | 256 KB | 0.186 |
| Method 2 | 86.908 | 86.996 | 1 | 1540.59 | 4048 KB | 5.241 |
| Method 3*| 28.641 | 28.802 | 0 | 254.32 | 352 KB | 0.422 |
| Method 3 | 32.088 | 15.675 | 4 | 254.506 | 360 KB | 0.422 |
| Method 4 | 24.710 | 24.878 | 1 | 205.314 | 304 KB | 0.344 |
==========================================================================
Note:
- DOP = Degree of Parallelism
- Methods marked with a star* were executed with
OPTION (MAXDOP 1)
We can see from the results that Method 1 was the fastest, but it will always generate a serial plan (because it uses TOP
in the outer-most query).
Method 3 was slower when executed in a serial plan (OPTION MAXDOP 1
), but was faster on a multi-CPU machine when executed in a parallel plan. Depending on the number of CPU’s available on the server, this method may be the best option, especially if the query is run as part of a larger query that would also benefit from parallelism.
Method 4 was also a good candidate, but I couldn’t get SQL Server to generate a parallel plan for it.
Method 2 was by far the slowest.