Can Transactional Replication and triggers be used for near-realtime ETL [closed]

Posted on

Question :

We want to have a data warehouse synced with our production OLTP database in near realtime. Is it possible to do this with Transactional Replication and triggers?

How we currently ETL
We use sql server hosted on azure. We have a OLTP database for our application and keep a second sql server database as a data warehouse for reporting and modeling. Currently we orchestrate a nightly ETL (ELT really) process with azure data factory. The nightly process syncs reference tables on the data warehouse with the OLAP app database and calls heavy stored procedures to rebuild all the denormalized tables required for reporting and modeling. The nightly process takes a long time.

How we think it could work better in near real time
Proposal is to select the reference tables we want to sync from the OLTP database to the data warehouse and do the one-way sync with sql server’s built in transactional replication. Then on the data warehouse we would put triggers for changes (IUD) to the reference tables. Those triggers would insert a modification record into a table on the warehouse which would act as a queue tracking changes to reference table data. We would then build a process to watch that queue. As new modification records came in, the process would see what table was changed and for what primary key and then go update all the denormalized tables that depend on that combination of reference table and primary key. We are unsure if we would write those denormalization procedures as stored procedures on sql server or as a separate .net application. Either way seems like an option.

Anyone with experience in those areas think that could work. Are we crazy?

Answer :

If you’re using a version and edition that supports Change Data Capture or Change Tracking, these mechanisms would be better suited to monitoring for changes to your reference tables because they provide a lightweight solution for identifying only changed rows.

You could implement CT or CDC on your reference tables and poll them frequently for changes. When a change is encountered, you can pull in the changed rows only, update your data warehouse and kick off your process for fixing the denormalised data in your data warehouse.

You may also need to re-evaluate the design of your data warehouse. From your description, these reference tables are possibly slowly changing dimensions and with the correct DWH design, you shouldn’t need to update your denormalised data in the warehouse after a change to a dimension, that change should be handled automatically.

There isn’t really a complete answer to your problem because there are a lot of unknowns that impact design decisions for both your data warehouse and ETL process. You need to properly define your requirements around performance, data latency (how far behind can the DWH be), historical data (should your fact table records reflect the dimension details at the time of insert or the current state of the dimension), reporting and several other factors which then inform your DWH and ETL design.

After the requirements are defined, you can start looking at technologies and architectures to meet these requirements.

Too long for a comment so here’s my take:
Just to add my two cents, I have experience using transactional replication to keep two databases concurrent in near-realtime (a few seconds variance max) synchronicity. The databases are about 100 GB each, and we probably make about a few hundred MB of transactional changes every day.

It works pretty well overall. When it fails, it sucks. This happens only a few times a year (as a result of us making specific schema changes), but that database is literally down for half a day while debugging the issue and re-generating a new snapshot to get the replication working again. All things considered, we do well with our current setup.

That being said, I think replication is starting to become a little antiquated as other methodologies are more heavily used (log shipping, database mirroring, availability group clusters, etc.) I’ve never used any of these personally, so I couldn’t advise much on the pros / cons, but I definitely recommend reading up on each of them before deciding what makes sense for your needs.

Also, I think the key takeaway from Kendra’s post linked above is “There‚Äôs No Easy Answer to this Problem”. Transactional replication is the answer in some scenarios (such as mine), other cases call for one of the other methodologies I listed above or another solution. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *