Is this a good strategy for importing a large amount of data and decomposing as an ETL?

Posted on

Question :

I have a set of five tables (a highly decomposed schema for an ETL if I understand the nomenclature) that I’m going to load via bulk import, then run some inserts from those five tables into a SLEW of other tables, including inserts that just rely on the values I just inserted into the first tables.

I can do my inserts as an A, B, C process, where I insert into the first table, then insert into some table S where exists in A + T (being some other table that has preloaded “configuration data”), then inserting into Z where exists in B + U, etc.

Should I be trying to batch those inserts with a cursor (I know, stone the traitor) or should I just run the raw insert into scripts and let the server die a thousand heat deaths? Basically I’m worried about starving the server or causing it to collapse from too many index or something else with inserts.

Should I stage out the inserts as:

  1. Insert into one set of tables
  2. Insert into secondary tables based on the first tables
  3. Insert into tertiary tables, etc.

OR should I insert into all the tales where the data is needed but do it via cursor, in a “for loop” style pattern of 100k rows at a time.

FWIW, this is a behavior I saw from the DBAs at my last job, so I figure that’s “what I’m supposed to do” (the batch process via cursors) but maybe I don’t understand enough about what they were doing (they were also live-loading into systems that already had data, and were loading new data afterwards).

Also bear in mind that I’m normally a C# dev, but I’ve got the most TSQL experience here and I’m trying to make the best process I can for raw-loading this data as opposed to our “current” method that is mostly webservice fetches and NHibernate save-commits.

Things I think are important to the question:

  1. There will be no other load on this server when I do this, I will have complete and total control of the box, and I’m the only user interested here (this is the initial data load before anyone else can do anything with the box)
  2. There are indexes on all the tables, where appropriate, and there are FKs and other predicates
  3. The entire db system is in use for slews of other clients right now, so I know the indexes are needed for operation, but if I should do something to disable those and re-enable them to make the inserts etc faster, I’m looking for those sorts of optimization hacks.
  4. I’m waiting on sample data but I expect the maximum records in a given table to be about a million rows, fairly wide, and that I’m going to insert into those other tables at most a million rows for the ancillary tables that can come secondary or tertiary.
  5. The hardware is “average”.

Answer :

If you are confident in the integrity of the data being imported, it may be a good idea to disable all the constraints to your database before beginning your inserts and then re-enabling them after the fact.

See this helpful stack overflow answer from awhile back: Can foreign key constraints be temporarily disabled using T-SQL?

This will save you the head ache of having to worry about layering the inserts in order to respect the existing constraints of the database you are loading into.

In terms of the actual inserts themselves, I’d be on the side of not using cursors. Not only is the process slow but they take up a large amount off memory and create db locks. If you are cursor-ing through a very large amount of rows you also run the risk of very quickly escalating the size of the database logs. If the server is only an average one then, space may eventually be a concern. Try to consider a more set based approach when doing the additional inserts needed for your process.

example, if you can do this:

insert into t1 (col1)

instead of this:

insert into t1 (col1) values ('foo');
insert into t1 (col1) values ('bar');
insert into t1 (col1 values 

Personally I would use SSIS to do this task. First I would bulk insert into staging tables or use a data flow, then do whatever clean up and transform tasks you need and load the transformed data into a final set of staging tables. Then use the data flow task to send the data to the production tables. The reason why I would have two separate sets of staging tables, one with the raw data and one with the transformed data is that it makes it easier to research data problems that come up with running imports over time and if the transform is done before the final load, then the final load affecting prod will be faster. Note SSIS in a dataflow does process one record at a time, but it does it much faster than the average cursor especially if you are not doing an data transforms at that point. Further in SSIS, you can send bad data (things like invalid values, missing required data etc.) to an exception table so you can inform the data provider of the problem but still process the rest of the data.

Leave a Reply

Your email address will not be published. Required fields are marked *