What’s the overhead of updating all columns, even the ones that haven’t changed [closed]

Posted on

Question :

When it comes to updating a row, many ORM tools issue an UPDATE statement that sets every column associated to that particular entity.

The advantage is that you can easily batch the update statements since the UPDATE statement is the same no matter what entity attribute you change. More, you can even use server-side and client-side statement caching as well.

So, if I load an entity and only set a single property:

Post post = entityManager.find(Post.class, 1L);

All columns are going to be changed:

SET    score = 12,
       title = 'High-Performance Java Persistence'
WHERE  id = 1

Now, assuming that we have an index on the title property as well, shouldn’t the DB realize that the value hasn’t changed anyway?

In this article, Markus Winand says:

The update on all columns shows the same pattern we have already
observed in the previous sections: the response time grows with each
additional index.

I wonder why is this overhead since the database loads the associated data page from disk into memory and so it can figure out whether a column value needs to be changed or not.

Even for indexes, it does not to re-balance anything since the index values don’t change for the columns that haven’t changed, yet they were included in the UPDATE.

Is it that the B+ Tree indexes associated to the redundant unchanged columns need to be navigated as well, only for the database to realize that the leaf value is still the same?

Of course, some ORM tools allow you to UPDATE just the changed properties:

SET    score = 12,
WHERE  id = 1

But this type of UPDATE might not always benefit from batch updates or statement caching when different properties are changed for different rows.

Answer :

I know you’re mostly concerned about UPDATE and mostly about performance, but as a fellow “ORM” maintainer, let me give you another perspective on the problem of distinguishing between “changed”, “null”, and “default” values, which are three different things in SQL, but possibly only one thing in Java and in most ORMs:

Translating your rationale to INSERT statements

Your arguments in favour of batchability and statement cacheability hold true in the same way for INSERT statements as they do for UPDATE statements. But in the case of INSERT statements, omitting a column from the statement has a different semantics than in UPDATE. It means to apply DEFAULT. The following two are semantically equivalent:

INSERT INTO t (a, b)    VALUES (1, 2);
INSERT INTO t (a, b, c) VALUES (1, 2, DEFAULT);

This isn’t true for UPDATE, where the first two are semantically equivalent, and the third one has an entirely different meaning:

-- These are the same
UPDATE t SET a = 1, b = 2;
UPDATE t SET a = 1, b = 2, c = c;

-- This is different!
UPDATE t SET a = 1, b = 2, c = DEFAULT;

Most database client APIs, including JDBC, and by consequence, JPA, don’t allow for binding a DEFAULT expression to a bind variable – mostly because the servers don’t allow this either. If you want to re-use the same SQL statement for the aforementioned batchability and statement cacheability reasons, you’d use the following statement in both cases (assuming (a, b, c) are all the columns in t):

INSERT INTO t (a, b, c) VALUES (?, ?, ?);

And since c isn’t set, you’d probably bind Java null to the third bind variable, because many ORMs also cannot distinguish between NULL and DEFAULT (jOOQ, for example being an exception here). They only see Java null and don’t know whether this means NULL (as in the unknown value) or DEFAULT (as in the uninitialised value).

In many cases, this distinction doesn’t matter, but in case your column c is using any of the following features, the statement is simply wrong:

  • It has a DEFAULT clause
  • It might be generated by a trigger

Back to UPDATE statements

While the above is true for all databases, I can assure you that the trigger issue is true for the Oracle database as well. Consider the following SQL:


INSERT INTO x VALUES (1, 1, 1, 1);

  ON x
  IF updating('c') THEN
    dbms_output.put_line('Updating c');
  IF updating('d') THEN
    dbms_output.put_line('Updating d');

UPDATE x SET b = 1 WHERE a = 1;
UPDATE x SET c = 1 WHERE a = 1;
UPDATE x SET d = 1 WHERE a = 1;
UPDATE x SET b = 1, c = 1, d = 1 WHERE a = 1;

When you run the above, you will see the following output:

table X created.
1 rows inserted.
TRIGGER T compiled
1 rows updated.
1 rows updated.
Updating c

1 rows updated.
Updating d

1 rows updated.
Updating c
Updating d

As you can see, the statement that always updates all the columns will always fire the trigger for all the columns, whereas the statements updating only columns that have changed will fire only those triggers who are listening for such specific changes.

In other words:

The current behaviour of Hibernate that you’re describing is incomplete and could even be considered wrong in the presence of triggers (and probably other tools).

I personally think that your query cache optimisation argument is overrated in the case of dynamic SQL. Sure, there will be a few more queries in such a cache, and a bit more parsing work to be done, but this is usually not a problem for dynamic UPDATE statements, much less than for SELECT.

Batching is certainly an issue, but in my opinion, a single update shouldn’t be normalised to update all the columns just because there is a slight possibility of the statement being batchable. Chances are, the ORM can collect sub-batches of consecutive identical statements and batch those instead of the “whole batch” (in case the ORM is even capable of tracking the difference between “changed”, “null”, and “default”

I think the answer is – it’s complicated. I tried to write a quick proof using a longtext column in MySQL, but the answer is a little inconclusive. Proof first:

# in advance:
set global max_allowed_packet=1024*1024*1024;

  `b` char(255) NOT NULL,

mysql> insert into t2 (a, b, c) values (null, 'b', REPEAT('c', 1024*1024*1024));
Query OK, 1 row affected (38.81 sec)

mysql> UPDATE t2 SET b='new'; # fast
Query OK, 1 row affected (6.73 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql>  UPDATE t2 SET b='new'; # fast
Query OK, 0 rows affected (2.87 sec)
Rows matched: 1  Changed: 0  Warnings: 0

mysql> UPDATE t2 SET b='new'; # fast
Query OK, 0 rows affected (2.61 sec)
Rows matched: 1  Changed: 0  Warnings: 0

mysql> UPDATE t2 SET c= REPEAT('d', 1024*1024*1024); # slow (changed value)
Query OK, 1 row affected (22.38 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> UPDATE t2 SET c= REPEAT('d', 1024*1024*1024); # still slow (no change)
Query OK, 0 rows affected (14.06 sec)
Rows matched: 1  Changed: 0  Warnings: 0

So there is a small time difference between slow + changed value, and slow + no changed value. So I decided to look at another metric, which was pages written:

mysql> show global status like 'innodb_pages_written';
| Variable_name        | Value  |
| Innodb_pages_written | 198656 |
1 row in set (0.00 sec)

mysql> show global status like 'innodb_pages_written';
| Variable_name        | Value  |
| Innodb_pages_written | 198775 | <-- 119 pages changed in a "no change"
1 row in set (0.01 sec)

mysql> show global status like 'innodb_pages_written';
| Variable_name        | Value  |
| Innodb_pages_written | 322494 | <-- 123719 pages changed in a "change"!
1 row in set (0.00 sec)

So it looks like the time increased because there has to be a comparison to confirm that the value itself has not been modified, which in the case of a 1G longtext takes time (because it is split across many pages). But the modification itself does not seem to churn through the redo log.

I suspect that if values are regular columns that are in-page the comparison adds only a little overhead. And assuming the same optimization applies, these are no-ops when it comes to the update.

Longer Answer

I actually think the ORM should not eliminate columns which have been modified (but not changed), as this optimization has strange side-effects.

Consider the following in pseudo code:

# Initial Data does not make sense
# should be either "Harvey Dent" or "Two Face"

id: 1, firstname: "Two Face", lastname: "Dent"


session1.firstname = "Two"
session1.lastname = "Face"

session2.firstname = "Harvey"
session2.lastname = "Dent"

The outcome if the ORM were to “Optimize out” modification without change:

id: 1, firstname: "Harvey", lastname: "Face"

The outcome if the ORM sent all modifications to the server:

id: 1, firstname: "Harvey", lastname: "Dent"

The test-case here relies on repeatable-read isolation (MySQL default), but a time-window also exists for read-committed isolation where the session2 read occurs before session1 commit.

To put it another way: the optimization is only safe if you issue a SELECT .. FOR UPDATE to read the rows followed by an UPDATE. SELECT .. FOR UPDATE does not use MVCC and always reads the latest version of the rows.

Edit: Made sure test case data set was 100% in memory. Adjusted timing results.

Leave a Reply

Your email address will not be published. Required fields are marked *