Question :
I have a collection similar to the below, of say 5,000,000 records.
{ "_id" : ObjectId("56d18f95d2961d1ddcb7244f"),
"rec_id" : 0,
"status" : "open",
"updated by" : "A"
}
User B is trying to mark all status
to “closed”, with below command:
db.prod.update({},{'$set': {'status':'closed', 'updated by':'B'}}, false, true)
Let’s say user B is able to update closed status to 100,000 records and the connection is lost, or some problem occurs, and the remaining docs are not updated. So the data is inconsistent, is there any way in which we can fix this kind of problem?
I need to mark all status to closed or nothing to closed.
Answer :
The documentation states
When a single write operation modifies multiple documents, the modification of each document is atomic, but the operation as a whole is not atomic
Multi-row rollback is one of those things which are foregone in the NoSQL world, sadly.
That same link mentions the $isolated operator. This may be of some help, depending on the use case and topology.
Also documented is the so-called “two phase commit” pattern. This may require application changes, however, so may not be practical.
Five million rows isn’t a silly-big number. Would it be possible to arrange an application outage until the writes succeed, with continuous retries? Maybe run the write application on the same hardware as Mongo so the network is no longer an issue.
My two cents.
based on documentation, I believe if you have single operation which leads changes to multiple documents is atomic. It will succeed regardless of if client is still connected or not.
BUT…
if while performing updates, your primary fails then you’ll have inconsistent documents. Some will get changes while other will not (depending on configuration. i.e replication, write concerns etc)
In your case, I don’t see any problem. once user fireup a update query, it doesn’t mater if still connected or not. User may not have updated status, but operation will succeed as long as server is running and without any issue.
db.prod.update({},{'$set': {'status':'closed', 'updated by':'B'}}, false, true)
Remember mongodb queries run on server, not on client unless you are updating individual documents on client side inside a loop. In that case whatever queries client manage to push to client before disconnection, will succeed.
Frankly that two-phase commit is just a gimmick, and a very bad one because it’s not even consistent after it starts 😉
Solution would be to build a log-based reader/writer on top of this collection. So you’d send all reads and writes through additional layer. If you had your IDs sorted ascending you could add something like
Update WHERE id < LAST : Status = closed
Then on read you need to get the item + all pending operations in log, and calculate the state of item based on that. On write you’d put the operation on top of log, and then some backend would modify the data in collection based on what you have in the log, removing log entry when operation is finished.
This way you’d have it non-blocking and atomic / consistent. But that would probably take some time to code … don’t know if that’s worth it, for just one collection it shouldn’t take a long time.
Maybe better would be to switch to normal SQL if you need transactions and consistency across the app. It’s much easier to implement document-like behaviour in SQL that SQL-like behaviour in K/V stores.