How to insert/update millions of rows inside a sql server table(s) in an effective way?

Posted on

Question :

Often in my job I have to create a procedure inside SQL Server that will process millions of data rows, save them into a temp table (staging table) and finally save them into a table (s) in a database(s).

I am not looking into alternative solutions, like SSIS.

I cannot disable the indexes,constraints, take DB offline, change the recovery mode, etc.

we have looked for setting this process to run when the systems are less busy, but we work on a 24/7/365 online retailer environment.

there is a very similar question:
Performance Inserting and Updating Millions of rows into a table

This question is also relevant:
What is the fastest way to insert large numbers of rows?

example one:

CREATE PROCEDURE [dbo].[udpstaging_page_import_fromFilter]
      @sourceDesc nvarchar(50) -- e.g. 'Coremetrics'
      ,@feedDesc nvarchar(50) -- e.g. 'Daily Exports'
      ,@process_job_task_logID bigint
AS BEGIN


      SET NOCOUNT ON;
      BEGIN TRY


            --truncate table prior INSERT
            exec dbo.udpstaging_page_truncateTable;

            declare @source_feedID int;
            exec crm_admin.dbo.udpsource_feedID_select @sourceDesc
                  ,@feedDesc
                  ,@source_feedID = @source_feedID OUTPUT;


            -- drop temp tables
            if exists (select * from tempdb.dbo.sysobjects o where o.xtype in ('U') and o.id = object_id(N'tempdb..#pageImport'))
                  drop table #pageImport;


            -- create temp tables
            create table #pageImport(
                  pageImportID [bigint] identity(1,1) NOT NULL
                  ,pageCode [varbinary](16) NOT NULL
            );


            insert into #pageImport(
                  pageCode
            )
            select pageCode
            from Coremetrics.PageView
            group by pageCode;


            -- add indexes to temp table
            CREATE CLUSTERED INDEX IDX_pageImport_pageImportID ON #pageImport(pageImportID);
            CREATE INDEX IDX_pageImport_pageCode ON #pageImport(pageCode);


            declare @recordCount bigint
                  ,@counter int
                  ,@maxCounter int
                  ,@updateRowCount int;

            select @counter = MIN(pageImportID)
                  ,@recordCount = MAX(pageImportID)
            from #pageImport;

            set @updateRowCount = 1000000;

            while @counter <= @recordCount
            begin

                  set @maxCounter = (@counter + @updateRowCount - 1);

                  with pageImport as (
                        select pv.pageCode
                              ,pv.websiteCode as 'pageCIV'
                              ,dbo.udfDerivePageName(pv.PAGE_ID, pv.CONTENT_CATEGORY_ID) as 'pageName'
                              ,dbo.udfDerivePageName(pv.PAGE_ID, pv.CONTENT_CATEGORY_ID) as 'pageDesc'
                              ,pv.[TIMESTAMP] as 'pageCreateDate'
                              ,pv.pageTypeCode
                              ,'' as 'pageTypeCIV'
                              ,pv.websiteCode
                              ,pv.marketID
                              ,@source_feedID as 'source_feedID'
                              ,@process_job_task_logID as 'process_job_task_logID'
                              ,GETDATE() as 'createdDate'
                              ,SUSER_NAME() as 'createdBy'
                              ,GETDATE() as 'modifiedDate'
                              ,SUSER_NAME() as 'modifiedBy'
                              ,ROW_NUMBER() over (
                                    PARTITION BY [pi].pageCode
                                    ORDER BY pv.[TIMESTAMP]
                              ) as 'is_masterPageImport'
                        from #pageImport [pi]
                        inner join Coremetrics.PageView pv on pv.pageCode = [pi].pageCode
                              and [pi].pageImportID between @counter and @maxCounter
                  )                 
                  insert into staging.[page](
                        pageCode
                        ,pageCIV
                        ,pageName
                        ,pageDesc
                        ,pageCreateDate
                        ,pageTypeCode
                        ,pageTypeCIV
                        ,websiteCode
                        ,marketID
                        ,source_feedID
                        ,process_job_task_logID
                        ,createdDate
                        ,createdBy
                        ,modifiedDate
                        ,modifiedBy
                  )
                  select pageCode
                        ,pageCIV
                        ,pageName
                        ,pageDesc
                        ,pageCreateDate
                        ,pageTypeCode
                        ,pageTypeCIV
                        ,websiteCode
                        ,marketID
                        ,source_feedID
                        ,process_job_task_logID
                        ,createdDate
                        ,createdBy
                        ,modifiedDate
                        ,modifiedBy 
                  from pageImport
                  where 1 = 1
                        and is_masterPageImport = 1;

                  set @counter = @counter + @updateRowCount;
            end;


            SET NOCOUNT OFF;
            RETURN 0;

      END TRY
      BEGIN CATCH
            print N'inner catch: ' + error_message();
            SET NOCOUNT OFF;
            RETURN -10;
      END CATCH

END;

Example Two:

this is just part of a stored procedure that is too big to be posted here.

IF OBJECT_ID('tempdb.dbo.#ztblOrgProductStockView', 'U') IS NOT NULL
    DROP TABLE #ztblOrgProductStockView;

CREATE TABLE #ztblOrgProductStockView (
    [lngID] [int] NOT NULL IDENTITY PRIMARY KEY,
    [sintMarketId] [smallint] NOT NULL,
    [sintChannelId] [smallint] NOT NULL,
    [strOrgVwName] [varchar](50) COLLATE Latin1_General_CI_AS NOT NULL,
    [tintSequence] [tinyint] NOT NULL,
    [tintOrgGrpId] [tinyint] NOT NULL,
    [strTier1] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strTier2] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strTier3] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strTier4] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strTier5] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strTier6] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strItemNo] [varchar](20) COLLATE Latin1_General_CI_AS NOT NULL,
    [strStockTypeName] [varchar](50) COLLATE Latin1_General_CI_AS NOT NULL,
    [tintStockTypeId] [tinyint] NOT NULL,
    [sintDueWithinDays] [tinyint] NOT NULL,
    [bitOverSellingAllowed] [bit] NOT NULL,
    [dtmStartDate] [datetime] NULL,
    [dtmEndDate] [datetime] NULL,
    [dtmExpected] [datetime] NULL,
    [blnIsLocalToMarket] [bit] NULL,
    [blnPremiunDelvAllowed] [bit] NULL,
    [strStdDeliveryDaysCode] [varchar](20)
)



INSERT into #ztblOrgProductStockView (
    sintMarketId
    ,sintChannelId
    ,strOrgVwName
    ,tintSequence
    ,tintOrgGrpId
    ,strTier1
    ,strTier2
    ,strTier3
    ,strTier4
    ,strTier5
    ,strTier6
    ,strItemNo
    ,strStockTypeName
    ,tintStockTypeId
    ,sintDueWithinDays
    ,bitOverSellingAllowed
    ,dtmStartDate
    ,dtmEndDate
    ,dtmExpected
    ,blnIsLocalToMarket
    ,blnPremiunDelvAllowed
    ,strStdDeliveryDaysCode
)
    select
        rv.sintMarketId
        ,rv.sintChannelId
        ,rv.strOrgVwName
        ,tintSequence
        ,tintOrgGrpId
        ,ISNULL(rv.pnTier1,'ÿ')
        ,ISNULL(rv.pnTier2,'ÿ')
        ,ISNULL(rv.pnTier3,'ÿ')
        ,ISNULL(rv.strTier4,'ÿ')
        ,ISNULL(rv.strTier5,'ÿ')
        ,ISNULL(rv.strTier6,'ÿ')
        ,rv.strItemNo
        ,strStockTypeName
        ,tintStockTypeId
        ,sintDueWithinDays
        ,bitOverSellingAllowed
        ,dtmStartDate
        ,dtmEndDate
        ,dtmExpected
        ,blnIsLocalToMarket
        ,blnPremiunDelvAllowed
        ,strStdDeliveryDaysCode
    from #ztblOrgProductRangeView_1 rv
    inner join #ztblOrgProductSeqView_1 sv on rv.strItemNo = sv.strItemNo
        and rv.lngOrgVwId = sv.lngOrgVwId
    --order by rv.sintMarketId, rv.sintChannelId, sv.tintOrgGrpId, rv.strItemNo, sv.tintStockTypeId

--set @DebugDate = convert(nvarchar(10),getdate(),108)
--raiserror('%s [%s]', 0, 1, N'Populated #ztblOrgProductStockView', @DebugDate) with nowait

--select [sintMarketId], [sintChannelId], [tintOrgGrpId], [strItemNo], [tintStockTypeId], count(*)
--from [#ztblOrgProductStockView]
--group by [sintMarketId], [sintChannelId], [tintOrgGrpId], [strItemNo], [tintStockTypeId]
--having count(*) > 1

    set @lngRowcount = @@ROWCOUNT
    set nocount on;

        While @lngRowcount > 0
        Begin
            Set @lngMinID = @lngMaxID
            Set @lngMaxID = @lngMaxID + 5000

            INSERT INTO [ztblOrgProductStockView]
                   ([sintActiveView]    
                   ,[sintMarketId]
                   ,[sintChannelId]
                   ,[strOrgVwName]  
                   ,[tintSequence]
                   ,[tintOrgGrpId]
                   ,[strTier1]
                   ,[strTier2]
                   ,[strTier3]
                   ,[strTier4]
                   ,[strTier5]
                   ,[strTier6]
                   ,[strItemNo]
                   ,[strStockTypeName]
                   ,[tintStockTypeId]
                   ,[sintDueWithinDays]
                   ,[bitOverSellingAllowed]
                   ,[dtmStartDate]
                   ,[dtmEndDate]
                   ,[dtmExpected]
                   ,[blnIsLocalToMarket]
                   ,[blnPremiunDelvAllowed]
                   ,[strStdDeliveryDaysCode])
            Select
                    @sintActiveView_new
                   ,[sintMarketId]
                   ,[sintChannelId]
                   ,[strOrgVwName]
                   ,[tintSequence]
                   ,[tintOrgGrpId]
                   ,[strTier1]
                   ,[strTier2]
                   ,[strTier3]
                   ,[strTier4]
                   ,[strTier5]
                   ,[strTier6]
                   ,[strItemNo]
                   ,[strStockTypeName]
                   ,[tintStockTypeId]
                   ,[sintDueWithinDays]
                   ,[bitOverSellingAllowed]
                   ,[dtmStartDate]
                    ,[dtmEndDate]
                   ,[dtmExpected]
                   ,[blnIsLocalToMarket]
                   ,[blnPremiunDelvAllowed]
                   ,[strStdDeliveryDaysCode]
            From #ztblOrgProductStockView
            Where lngID >= @lngMinID
            And   lngID <  @lngMaxID

            set @lngRowcount = @@ROWCOUNT

        End

Questions
Please note that opinion based answers are not the most popular here, try to give evidence when possible.

1) how to decide the best way to organize the size of the batches? for example on the Example two it is 5000.

2) would it generally have more chances of improving performance if I BEGIN TRANSACTION and COMMIT TRANSACTION within the while loop?
One transaction for batch.

3) In case I would like to change the size of the batch, what I could monitor in order to decide whether I could increase the size of the batch, or I am causing I/O latency?

I currently find the I/O Latency using the script below:

-- How to identify I/O latency issues
-- Below SQL code helps in identifying the I/O latency issues in a SQL Server system on a per-file basis.
-- http://sqlserverdbknowledge.wordpress.com/2011/11/08/how-to-identify-io-latency-issues/
--http://www.sqlskills.com/blogs/paul/how-to-examine-io-subsystem-latencies-from-within-sql-server/

--MARCELO MIORELLI 26-JULY-2013

SELECT 
--- virtual file latency
ReadLatency = CASE WHEN num_of_reads = 0
THEN 0 ELSE (io_stall_read_ms / num_of_reads) END,
WriteLatency = CASE WHEN num_of_writes = 0 
THEN 0 ELSE (io_stall_write_ms / num_of_writes) END,
Latency = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)
THEN 0 ELSE (io_stall / (num_of_reads + num_of_writes)) END,
--– avg bytes per IOP
AvgBPerRead = CASE WHEN num_of_reads = 0 
THEN 0 ELSE (num_of_bytes_read / num_of_reads) END,
AvgBPerWrite = CASE WHEN io_stall_write_ms = 0 
THEN 0 ELSE (num_of_bytes_written / num_of_writes) END,
AvgBPerTransfer = CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)
THEN 0 ELSE ((num_of_bytes_read + num_of_bytes_written) / 
(num_of_reads + num_of_writes)) END, 
LEFT (mf.physical_name, 2) AS Drive,
DB_NAME (vfs.database_id) AS DB,
--- –vfs.*,
mf.physical_name
FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS vfs
JOIN sys.master_files AS mf
ON vfs.database_id = mf.database_id
AND vfs.file_id = mf.file_id
--WHERE vfs.file_id = 2 — log files
-- ORDER BY Latency DESC
-- ORDER BY ReadLatency DESC
ORDER BY WriteLatency DESC;
GO 

Answer :

opinion based answers are not the most popular here, try to give evidence when possible.

Well, that is not entirely fair given that ultimately, the “evidence” for what works the best will come from your system ;-). Your hardware, your data, your system load, etc will determine what works the best. There are a lot of variables in both how to approach things as well as how your system works, so what works best in one system might not be so great in another.

1) how to decide the best way to organize the size of the batches? for example on the Example two it is 5000.

This is mostly a matter of trial-and-error to see what works best. However, it is important to keep in mind that lock escalation typically occurs at 5000 locks. Depending on how your data is organized, 5000 changes could be 5000 row locks or a few page locks. This is on a per-object basis since the ordering of the rows can be different across different indexes. The point is, changes made to tables that need to be usable by other processes during these operations should try to avoid table locks (i.e. the result of lock escalation). But tables such as the staging table and temp tables would benefit from table locks since it is a single operation and there should be no contention, hence the TABLOCK hint when doing a “bulk” INSERT.

2) would it generally have more chances of improving performance if I BEGIN TRANSACTION and COMMIT TRANSACTION within the while loop? One transaction for batch.

Wrapping multiple DML operations into an explicit transaction greatly helps performance when those operations are row-by-row. From the code you posted here, you are already doing set-based operations so there is only minor timing benefit from combining anything into a transaction. Also, in your WHILE loops in both examples you are doing a single INSERT operation, which is its own transaction, hence adding a transaction inside of the WHILE loop gains nothing anyway. And, adding an explicit transaction around the WHILE loop would put the entire set into a single transaction, which again might help a little on the timing, but you would then also have a huge transaction that would increase chances for blocking as well as contribute to LOG file growth since this transaction would be active for longer.

3) In case I would like to change the size of the batch, what I could monitor in order to decide whether I could increase the size of the batch, or I am causing I/O latency?

Monitor whether the process run faster or slower. Try several different batch sizes and let each one run for several iterations of the process. Keep track of how long the process run for each batch size and you will find what works best.


Along these lines, I would at least try reducing the million row batch size in Example 1.

I would also convert the scalar UDF dbo.udfDerivePageName into an Inline TVF and incorporate that into the query (Example 1) using CROSS APPLY or OUTER APPLY. And, considering that both calls to the UDF pass in the same two parameters, you would just reference the returned field twice (once as pageName and once as pageDesc) rather than having two calls to the iTVF.

Another option to reduce contention on the destination table ( if this is merely inserting new rows and not updating existing rows ), is to use Table Partitioning. This would allow you to stage the data as you are currently doing, but then rather than inserting that new data into the live table, you would SWITCH the new Partition in which is a rather quick operation. This won’t help with the time or I/O it takes to stage the data in the first place, but it could eliminate the time and contention taken by “merging” the staged data into the live table. It’s something to look into.

Leave a Reply

Your email address will not be published. Required fields are marked *