I’m looking for a strategy for extracting data from a SQL Server database that is configured for change data capture (CDC) using SSIS.
Most of the documentation I have seen around using CDC and SSIS are simple scenarios where there is one source table configured with CDS, a conditional split to determine the inserts, updates and deletes and then those corresponding actions.
My current extract consists of queries that join multiple tables together and then feeds that into the transform and load. Changes could happen in any of those tables, not just one.
One approach could be to just look for changes on the base table and do lookups or merges on the other tables. However that would miss changes on a supporting table if there is no corresponding change in the source table.
Another option would be to do individual extracts on all tables involved and then do the joining in the transformation layer. But again, there could be times that one or more tables isn’t changed and isn’t available for the join.
What I’m leaning towards is one of these 2 options:
- Totally separate all tables into their own ETL that then do multiple inserts and updates on the destination table. For example, only process inserts for the source table of the query and process updates for all the supporting tables after the source table has been populated.
- Store all previous extracts for a table in a versioned directory in the spirit of a “data lake” and then look up the related record from a previous version.
Is there a strategy that I’m not aware of?
I have used the CDC functions to extract data in the past. My plan was to try to use the SSIS CDC components themselves if possible.
My main reason for looking into this is to increase the frequency of our loads to be on an hourly basis instead of a nightly basis. Using CDC on my biggest tables would reduce the load, and reduce the time needed. It would still be ideal to only pull the data that has changed no matter the size.
We also have the ability to add a read only secondary to our AG to scale out reads similar to what Dave describes in his answer. However, the time taken to pull all the data out would still be the same even though the same load wouldn’t be on the source system.
This appears to be your stumbling block in relation to most of the demos of SSIS CDC:
My current extract consists of queries that join multiple tables
together and then feeds that into the transform and load. Changes
could happen in any of those tables, not just one.
As such you’re on the right track with your first proposed option. For your join issues – it seems that you either have data that changes in tandem or a data set that shouldn’t implement CDC. You should probably reduce the number of CDC tables and do a full load from others. At the least you could come up with a custom incremental that makes sense for your
JOIN requirements while also reducing your data set size. Something like:
SELECT * FROM InfoTable WHERE Date >= '01-01-2016'
The typical pattern for data warehouse ETL is to move table by table from source to a Staging database, which replicates the table structure of the source systems nearly 1 to 1 (only for tables you actually use and skip things like
IDENTITY which will fail your inserts/transfer). Following this pattern should solve at least part of the CDC issue.
From Staging to DataWarehouse you should have your joins, business logic, clean-up, etc. in the ETL. One of the benefits of this pattern, beside simplifying incremental source system pulls, is the reduced load on LoB/OLTP systems. You get in and out from the source system as fast as possible and leave the majority of the ETL burden on the SSIS/DW server(s).
You could setup a variety of incremental load patterns on all your tables. The importance of moving ETL logic/resource use out of source systems becomes even more important if you’re running that frequently. If you’re trying to meet an hour load window there may be other options beside just incremental loads… even if your time is consumed by data transfer (e.g. network settings).