I need to read data from a linked server and insert into a local table. I need to remove duplicates in data and I need to do it on a local server, because remote server is overloaded. So, I added
DISTINCT clause which does Distinct Sort as I want it to do.
The problem is that Remote Scan operator always estimates the number of rows as 10000, while the real number of rows is around 3M. So, the sorting spills to disk and becomes slow.
If there a way to hint to the optimizer that the real number of rows is much more than 10K?
Should I load raw data into a local staging table and then run DISTINCT off the local table? I didn’t want to write to disk twice.
The number of rows that are duplicates is small – few hundred out of 3M.
I mean by this that before the duplicates are removed there are ~3,000,000 rows; after the duplicated are removed there are ~2,999,800 rows. So, removing the duplicates on the remote server would not noticeably reduce the amount of data that is transferred over the network.
The destination table is truncated before insert, so I’m always inserting into an empty table. Also, the destination table doesn’t have any indexes, triggers or constraints. There are many columns in the table. About 110 columns. In the query below I wrote
WITH CTE_Raw AS ( SELECT [ManyManyColumns] FROM OpenQuery([remote_server],' SELECT [ManyManyColumns] FROM [DB].[dbo].[remote_view] ') ) ,CTE_Converted AS ( SELECT DISTINCT [ManyManyColumns] FROM CTE_Raw ) INSERT INTO [dbo].[TestVBFast2] ([ManyManyColumns] ) SELECT [ManyManyColumns] FROM CTE_Converted ;
SQL Server version:
Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64) Aug 15 2017 10:23:29 Copyright (c) Microsoft Corporation Standard Edition (64-bit) on Windows NT 6.3 (Build 9600: ) (Hypervisor)
I want to perform the
DISTINCT locally, because the remote server is overloaded and I want to reduce its load.
DISTINCT will remove only few hundred rows out of 3M, so the amount of data transferred over the network won’t change much.
I assume ManyManyColumns is really multiple columns and not one column?…I see your comment states it’s 110 actually.
10,000 rows is the default cardinality estimation for a Remote Scan operation in your version of SQL Server, so I don’t think you can do much to change that, unfortunately.
How slow is slow currently? Keep in mind even with perfectly accurate cardinality estimates, 3 million rows is always going to be a lot of data to pipe across the network / linked server, especially if you have many columns.
The only general ideas I have at the moment is to either pre-stage the
DISTINCT data on your remote server, or use a data synchronization feature like replication to copy it over to your local server instead of using a linked server. If I think of anything else, I’ll update my answer accordingly.
My suggestion is to switch to SSIS packages that can handle this in a better way, with buffers at source and at destination, faster than linked servers.
Please take a look at this: