Question :
I have a large-ish data warehouse staging table (~2.5 million rows) that is using a mapping table to standardize a text column. I have an update statement that works perfectly, but has terrible performance, and would like some recommendations on modifications to the statement and/or alternate strategies for standardizing the data.
The source text can contain anything, and I’m basically looking for keywords in the strings. Because of the nature of the data, a longer string is more specific and therefore correct, so I want to get the longest mapping value found for each given row in the staging table.
Source data (simplified):
CREATE TABLE STAGINGTABLE (TextColumn VARCHAR(100));
INSERT INTO STAGINGTABLE VALUES ('This entry has a full foo bar entry');
INSERT INTO STAGINGTABLE VALUES ('Today was a foo day');
INSERT INTO STAGINGTABLE VALUES ('Only found bar data');
INSERT INTO STAGINGTABLE VALUES ('Blah blah blah');
CREATE TABLE MAPPING (SourceMap VARCHAR(100), TargetMap VARCHAR(100));
INSERT INTO MAPPING VALUES ('%FOO BAR%', 'Foo Bar');
INSERT INTO MAPPING VALUES ('%FOO%', 'Foo');
INSERT INTO MAPPING VALUES ('%BAR%', 'Bar');
INSERT INTO MAPPING VALUES ('%%', 'Unknown');
Update statement:
UPDATE STAGINGTABLE
SET TextColumn = (
SELECT TargetMap
FROM (
SELECT TargetMap
, ROW_NUMBER() OVER (PARTITION BY TextColumn
ORDER BY LEN(SourceMap) DESC) AS ROWNUM
FROM STAGINGTABLE
INNER JOIN MAPPING
ON UPPER(TextColumn) LIKE SourceMap
WHERE TextColumn IS NOT NULL
) AS STG_MAP
WHERE ROWNUM = 1
);
Final values in staging table after mapping:
Foo Bar
Foo
Bar
Unknown
Edit: the TextColumn and SourceMap columns do have non-clustered indexes.
Answer :
Thanks to everyone’s comments and suggestions, I tried all of the ideas @mustaccio posted, but in the end the fastest strategy was to abandon the mapping table altogether, and just go with a big ugly CASE statement to handle all of the situations, and insert that into another staging table. Execution time dropped to only a few seconds, and I can’t argue with that. The situations to handle are mostly static, so if I have to re-deploy a SSIS package once or twice a year to add new situations if they come up, so be it.
Final example insert statement (not exact, just going off memory):
INSERT INTO STAGINGTABLE_TWO (TextColumn)
SELECT
CASE
WHEN TextColumn LIKE '%FOO BAR%' THEN 'Foo Bar'
WHEN TextColumn LIKE '%FOO%' THEN 'Foo'
WHEN TextColumn LIKE '%BAR%' THEN 'Bar'
ELSE 'Unknown'
END
FROM STAGINGTABLE;