I’ve got a data warehouse that goes through a full refresh each night that can take about an hour to process 16 million rows/25gigs of data and we’re looking for ways to reduce this time without going the incremental approach.
The basic format of our queries is as below, only I’ve stripped out about 20 more joins and 30+ more columns that would also be included. The stripped out columns and joins are very straightforward with no aggregation, subqueries, or other types of calculation involved. What’s left is the main fact table (First_Source_Table) and the most problematic datapoint to collect. Second_Source_Table consists of many records for each Account_ID, but we only want to include the first record for each Account_ID.
Now my constraints. This in a replicated environment on SQL Server 2008. Unfortunately I have no control over the source tables, and while I can add new indexes on them, they will be lost the next day. I’ve tried calculating an in-between table off of Second_Source_Table before I do the full-table, but as that would need to be re-calculated each night, it didn’t have a material impact on the overall calculation time.
The code below works, but if you look at the execution plan and IO Stats, the logic associated with Second_Source_Table constitutes about 80% of all resources used, but changing this field to NULL only cuts execution time in half. I’ll also point out again that being a replicated environment, there are no issues to worry about with locking or other users writing to the tables we’re in.
INSERT INTO New_Table SELECT First_Source_Table.Account_ID, ( select top 1 Second_Source_Table.Code FROM Second_Source_Table WHERE Second_Source_Table.Account_ID = First_Source_Table.Account_ID ORDER BY Second_Source_Table.ID ) as Code FROM First_Source_Table
You may want to consider partitioning instead of a scalar query.
So something like
insert into New_Table select [fst].Account_ID, [sst].Code from First_Source_Table as [fst] inner join (select row_number() over( partition by Account_ID order by Account_ID ) as [topN], Account_ID, Code from Second_Source_Table) as [sst] on ( [sst].Account_ID = [fst].Account_ID ) where ( [topN] = 1 ) --This is your topN query