If I have an
UPDATE statement that does not actually change any data (because the data is already in the updated state). Is there any performance benefit in putting a check in the
WHERE clause to prevent the update?
For example would there be any difference in execution speed between UPDATE 1 and UPDATE 2 in the following:
CREATE TABLE MyTable (ID int PRIMARY KEY, Value int); INSERT INTO MyTable (ID, Value) VALUES (1, 1), (2, 2), (3, 3); -- UPDATE 1 UPDATE MyTable SET Value = 2 WHERE ID = 2 AND Value <> 2; SELECT @@ROWCOUNT; -- UPDATE 2 UPDATE MyTable SET Value = 2 WHERE ID = 2; SELECT @@ROWCOUNT; DROP TABLE MyTable;
The reason I ask is that I need the row count to include the unchanged row so I know whether to do an insert if the ID does not exist. As such I used the UPDATE 2 form. If there is a performance benefit to using the UPDATE 1 form, is it possible to get the row count that I need somehow?
If I have an UPDATE statement that does not actually change any data (because the data is already in the updated state), is there any performance benefit in putting a check in the where clause to prevent the update?
There certainly could be as there is a slight performance difference due to UPDATE 1:
- not actually updating any rows (hence nothing to write to disk, not even minimal log activity), and
- taking out less restrictive locks than what are required for doing the actual update (hence better for concurrency) (Please see Update section towards the end)
However, how much of a difference there is would need to be measured by you on your system with your schema, and data, and system load. There are several factors that play into how much impact a non-updating UPDATE has:
- the amount of contention on the table being updated
- the number of rows being updated
- if there are UPDATE Triggers on the table being updated (as noted by Mark in a comment on the Question). If you execute
UPDATE TableName SET Field1 = Field1, then an Update Trigger will fire and indicate that the field was updated (if you check using either the UPDATE() or COLUMNS_UPDATED functions), and that the field in both
DELETEDtables are the same value.
Also, the following summary section is found in Paul White’s article, The Impact of Non-Updating Updates (as noted by @spaghettidba in a comment on his answer):
SQL Server contains a number of optimisations to avoid unnecessary logging or page flushing when processing an UPDATE operation that will not result in any change to the persistent database.
- Non-updating updates to a clustered table generally avoid extra logging and page flushing, unless a column that forms (part of) the cluster key is affected by the update operation.
- If any part of the cluster key is ‘updated’ to the same value, the operation is logged as if data had changed, and the affected pages are marked as dirty in the buffer pool. This is a consequence of the conversion of the UPDATE to a delete-then-insert operation.
- Heap tables behave the same as clustered tables, except they do not have a cluster key to cause any extra logging or page flushing. This remains the case even where a non-clustered primary key exists on the heap. Non-updating updates to a heap therefore generally avoid the extra logging and flushing (but see below).
- Both heaps and clustered tables will suffer the extra logging and flushing for any row where a LOB column containing more than 8000 bytes of data is updated to the same value using any syntax other than ‘SET column_name = column_name’.
- Simply enabling either type of row versioning isolation level on a database always causes the extra logging and flushing. This occurs regardless of the isolation level in effect for the update transaction.
Please keep in mind (especially if you don’t follow the link to see Paul’s full article), the following two items:
Non-updating updates still have some log activity, showing that a transaction is beginning and ending. It is just that no data modification happens (which is still a good savings).
As I stated above, you need to test on your system. Use the same research queries that Paul is using and see if you get the same results. I am seeing slightly different results on my system than what is shown in the article. Still no dirty pages to be written, but a little more log activity.
… I need the row count to include the unchanged row so I know whether to do an insert if the ID does not exist. … is it possible to get the row count that I need somehow?
Simplistically, if you are just dealing with a single row, you can do the following:
UPDATE MyTable SET Value = 2 WHERE ID = 2 AND Value <> 2; IF (@@ROWCOUNT = 0) BEGIN IF (NOT EXISTS( SELECT * FROM MyTable WHERE ID = 2 -- or Value = 2 depending on the scenario ) ) BEGIN INSERT INTO MyTable (ID, Value) -- or leave out ID if it is an IDENTITY VALUES (2, 2); END; END;
For multiple rows, you can get the information needed to make that decision by using the
OUTPUT clause. By capturing exactly what rows were updated, then you can narrow down the items to look up to know the difference between not updating rows that don’t exist as opposed to not updating rows that exist but don’t need the update.
I show the basic implementation in the following answer:
How to avoid using Merge query when upserting multiple data using xml parameter?
The method shown in that answer doesn’t filter out rows that exist yet do not need to be updated. That portion could be added, but you would first need to show exactly where you are getting your dataset that you are merging into
MyTable. Are they coming from a temporary table? A table-valued parameter (TVP)?
I was finally able to do some testing and here is what I found regarding transaction log and locking. First, the schema for the table:
CREATE TABLE [dbo].[Test] ( [ID] [int] NOT NULL CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED, [StringField] [varchar](500) NULL );
Next, the test updating the field to the value that it already has:
UPDATE rt SET rt.StringField = '04CF508B-B78E-4264-B9EE-E87DC4AD237A' FROM dbo.Test rt WHERE rt.ID = 4082117
-- Transaction Log (2 entries): Operation ---------------------------- LOP_BEGIN_XACT LOP_COMMIT_XACT -- SQL Profiler (3 Lock:Acquired events): Mode Type -------------------------------------- 8 - IX 5 - OBJECT 8 - IX 6 - PAGE 5 - X 7 - KEY
Finally, the test that filters out the update due to the value not changing:
UPDATE rt SET rt.StringField = '04CF508B-B78E-4264-B9EE-E87DC4AD237A' FROM dbo.Test rt WHERE rt.ID = 4082117 AND rt.StringField <> '04CF508B-B78E-4264-B9EE-E87DC4AD237A';
-- Transaction Log (0 entries): Operation ---------------------------- -- SQL Profiler (3 Lock:Acquired events): Mode Type -------------------------------------- 8 - IX 5 - OBJECT 7 - IU 6 - PAGE 4 - U 7 - KEY
As you can see, nothing is written to the Transaction Log when filtering out the row, as opposed to the two entries marking the beginning and ending of the Transaction. And while it is true that those two entries are almost nothing, they are still something.
Also, the locking of the PAGE and KEY resources is less restrictive when filtering out the rows that haven’t changed. If no other processes are interacting with this table then it is probably a non-issue (but how likely is that, really?). Keep in mind that that testing shown in any of the linked blogs (and even my testing) implicitly assumes that there is no contention on the table since it is never part of the tests. Saying that non-updating updates are so light-weight that it doesn’t pay to do the filtering needs to be taken with a grain of salt since the testing has been done, more or less, in a vacuum. But in Production, this table is most likely not isolated. Of course, it could very well be that the little bit of logging and more restrictive locks don’t translate into less efficiency. So the most reliable source of information to answer this question? SQL Server. Specifically: your SQL Server. It will show you which method is better for your system :-).
If the operations in which the new value is the same as the current value (i.e. no update) out number the operations in which the new value is different and the update is necessary, then the following pattern might prove to be even better, especially if there is a lot of contention on the table. The idea is to do a simple
SELECT first to get the current value. If you don’t get a value then you have your answer regarding the
INSERT. If you do have a value, you can do a simple
IF and issue the
UPDATE only if it is needed.
DECLARE @CurrentValue VARCHAR(500) = NULL, @NewValue VARCHAR(500) = '04CF508B-B78E-4264-B9EE-E87DC4AD237A', @ID INT = 4082117; SELECT @CurrentValue = rt.StringField FROM dbo.Test rt WHERE rt.ID = @ID; IF (@CurrentValue IS NULL) -- if NULL is valid, use @@ROWCOUNT = 0 BEGIN -- row does not exist INSERT INTO dbo.Test (ID, StringField) VALUES (@ID, @NewValue); END; ELSE BEGIN -- row exists, so check value to see if it is different IF (@CurrentValue <> @NewValue) BEGIN -- value is different, so do the update UPDATE rt SET rt.StringField = @NewValue FROM dbo.Test rt WHERE rt.ID = @ID; END; END;
-- Transaction Log (0 entries): Operation ---------------------------- -- SQL Profiler (2 Lock:Acquired events): Mode Type -------------------------------------- 6 - IS 5 - OBJECT 6 - IS 6 - PAGE
So there are only 2 locks acquired instead of 3, and both of these locks are Intent Shared, not Intent eXclusive or Intent Update (Lock Compatibility). Keeping in mind that each lock acquired will also get released, each lock is really 2 operations, so this new method is a total of 4 operations instead of the 6 operations in the originally proposed method. Considering this operation is running once every 15 ms (approximately, as stated by the O.P.), that is about 66 times per second. So the original proposal amounts to 396 lock/unlock operations per second, while this new method amounts to only 264 lock/unlock operations per second of even lighter-weight locks. This is not a guarantee of awesome performance, but certainly worth testing :-).
Zoom out a little and think about the bigger picture. In the real world, is your update statement really going to look like this:
UPDATE MyTable SET Value = 2 WHERE ID = 2 AND Value <> 2;
Or is it going to look more like this:
UPDATE Customers SET AddressLine1 = '123 Main St', AddressLine2 = 'Apt 24', City = 'Chicago', State = 'IL', (and a couple dozen more fields) WHERE ID = 2 AND (AddressLine1 <> '123 Main St' OR AddressLine2 <> 'Apt 24' OR City <> 'Chicago' OR State <> 'IL' (and a couple dozen more fields))
Because in the real world, tables have lots of columns. That means you’re going to have to generate a lot of complex dynamic app logic to build dynamic strings, OR you’re going to have to specify every field’s before-and-after contents, every time.
If you build these update statements dynamically for every table, only passing in the fields that are being updated, you can quickly run into a plan cache pollution problem similar to the NHibernate parameter sizes problem from a few years back. Even worse, if you build the update statements in SQL Server (like in stored procedures), then you’ll burn precious CPU cycles because SQL Server isn’t terribly efficient at concatenating strings together at scale.
Because of those complexities, it doesn’t usually make sense to do this kind of row-by-row, field-by-field comparison as you’re doing the updates. Think set-based operations instead.
You could see a performance gain in skipping rows that do not need to be updated only when the number of rows is large (less logging, less dirty pages to write to disk).
When dealing with single row updates as in your case, the performance difference is completely negligible. If updating the rows in all cases makes it easier for you, do it.
For further information on the topic see Non Updating Updates by Paul White
You can combine the update and insert into one statement. On SQL Server, you can use a MERGE statement to do both the update and insert if not found. For MySQL, you can use INSERT ON DUPLICATE KEY UPDATE.
It’s common to filter out non-updating updates as it would affect audit trail triggers or audit columns like a LastModifiedDateTime. The easiest way to do this for multiple columns that accounts for NULL is to use EXCEPT
Only Update If Different Using EXCEPT
UPDATE Table1 SET Col1 = @NewVal1 ,Col2 = @NewVal2 ... /*Audit trail columns*/ ,LastModifiedBy = @UserID ,LastModifiedDateTime = GETDATE() WHERE EXISTS ( /*Only updates if at least 1 column is different*/ SELECT Col1,Col2 EXCEPT SELECT @NewVal1,@NewVal2 )