We use a SQL Server PDW appliance for our data warehouse. One of the tables in our warehouse is a replicated table with about 20 million rows. As part of our ETL process we need to expire old records from this dimension; however, we are seeing that updating a handful of records (<100) takes over 1 hour to complete. This is what I would like to improve if I can.
Naturally, one option that I thought about was changing this Dimension from Replicated to Distributed. My testing shows that it would fix the issue with the ETL process taking long (from 1.5 hours came down to 30 secs) but all the joins against the Distributed version of this dimension would be affected since the joins are almost never based on the same distribution column. When I look at the execution plan of some of these queries I usually see either a ShuffleMove or a BroadcastMove operation.
So my question to the PDW guru’s here is:
Is there anything else that can be done in order to improve the performance of updating records in the replicated version of this Dimension?
Again, moving to a Distributed table doesn’t seem to be the best solution since it will affect hundreds of already written SQL queries and reports developed by other people.
A few questions. 20 million rows isn’t necessarily that large.
What process are you using to perform your updates and deletes right now?
Is the dimension a CLUSTERED COLUMNSTORE INDEX, CLUSTERED INDEX or HEAP?
Are you saying that there is movement whilst you update and delete this table or did you just see movement when you changed the table from replicated to distributed?
If it is the latter that is not surprising. You are unlikely to be join and aggregation compatible. If you are doing something to trigger the movement through your update / delete then we could look at that – although a concrete example would be helpful.
In general terms I would start by trying to keep the ETL simple.
Use CTAS against the dimension selecting only the rows you want to keep, union in any new rows and use CASE to pick up any changes (converting the UPDATE into a transform within the CTAS). Once complete you can then use a pair of RENAME OBJECT commands to switch from the current table to the new table. This gives you the added benefit of having a historic view of your table – which you can drop at your leisure.
Being replicated doesn’t stop you from using partitioning. Partition your table.
Then, for the rows you need to delete or update, CTAS the entire partition into a new table, using LEFT JOIN and COALESCE to get the appropriate (i.e. new) values for updates from the changed rows while keeping the rows you want and excluding the ones you don’t.
Finally, partition switch the new table with your old partition.
And done 🙂
In my experience, PDW doesn’t like updates and deletes. CTAS and partition switches work well.
UPDATE statements in PDW are only partially parallel rather than fully parallel like CTAS.
That said, this could well be down to indexing. What is the actual code that you’re running? Do you have indexes in place to help find those records you’re expiring? You do still need to apply some of the standard tuning techniques of applying non-clustered indexes to rowstore tables. The fact that PDW doesn’t support primary keys often seems to mean people forget to index their natural key, so don’t find yourself in that boat…