On a site-redundant galera cluster, a commit should only return after a quorum of database nodes have accepted the transaction. If one database node goes down, all commits will be preserved by the rest of the cluster, and the downed database node will get in sync with the rest of the cluster when starting up the database again. If the database should have become corrupt, one can always nuke the mysql data dir and start with an empty database, it will eventually catch up with the rest of the cluster. Hence it sounds like we could tweak performance by turning off options for ensuring local ACID-compliance.
So the question here is really … “What could possibly go wrong?”
Based on the comments, I’ll give some information about our particular setup:
The cluster consists of three nodes. Two of them are actively used in production, one of them more heavily used than the other. The third node is only used for quorum and backup purposes.
Site-redundant means that the nodes are in different server centers. I find it hard to think of anything that would cause two nodes to go down simultaneously – except a serious mysql bug, how likely is that? Admittedly, two of the nodes are situated less than 10 km apart (backup/quorum node is separated by several hundred km plus a national border). A medium-sized nuclear bomb could potentially take out two of the nodes simultaneously … then again, “we have problems with the database” is probably the least of our concerns in such a scenario. Could a solar storm possibly take out both or all of the servers simultaneously?
Our performance problem is primary due to the write cache on our SAN going full from time to time. We’re working on mitigating this issue, but we can never guarantee that it won’t happen again. Every now and then we experience “hicups” where we have transactions waiting for some 10-30 seconds.
In this particular setup, a 30s delay may actually be a matter of life and death. Well, most likely not, but if the customer perceives it that way, then it’s bad enough. Reasonably low latency and quick recovery should the whole cluster fall down is the most pressing priority. Losing some few transactions may be bad enough, but it’s not a matter of life and death.
The performance problem we’re seeing is write-transactions being stuck in “wsrep in pre-commit”-state. This is not a flow-control issue, it’s only one node having the problems. I did a bit of research on it, apparently it’s so that all local write queries are waiting for locks while the node is writing remote changesets to the database. This problem should be fixed in galera 4, but upgrading is currently not an option.
Our performance problem is only on one node, so it’s only on one node I’m considering to turn off those things. I would add to our documentation that whatever happens, said node should never be used for bootstrapping the cluster.
We’re not doing sharding, and we do not intend to do any sharding. Except for those hic-ups we don’t have any performance problems.
I’ve done some research on this now, so I’ll answer on my own question.
If the performance problems are observed only on one node, it’s perfectly safe to turn off innodb_doublewrite and innodb_flush_log_at_trx_commit on the affected node – just remember to use one of the other nodes for bootstrapping should the whole cluster fail.
# echo "[server]" >> /etc/mysql/conf.d/temp_perfomance_hax.cnf # echo "innodb_doublewrite = 0" >> /etc/mysql/conf.d/temp_performance_hax.cnf # echo "innodb_flush_log_at_trx_commit = 2" >> /etc/mysql/conf.d/temp_performance_hax.cnf # systemctl restart mysql # innodb_doublewrite cannot be set run-time
Also, in a true site-redundant galera cluster, it should be reasonably safe to put innodb_flush_log_at_trx_commit to 2 and sync_binlog to 0 on any number of nodes. If one is experiencing performance problems, if uptime and performance is important, and if one lost transaction probably isn’t a matter of life and death – then no hesitations needed, just go ahead and do it.
set global sync_binlog = '0'; set global innodb_flush_log_at_trx_commit = 2;
and one can probably go ahead and turn off double write as well if performance is more important than ACID-compliance. The probability of disasters is pretty low.
I can see two reasons for sticking to the “safe” settings:
- Preparedness for the worst-case-scenario: what if the whole cluster melts down simultaneously?
- More robustness on the nodes
Node robustness: keep innodb_doublewrite turned on
innodb_doublewrite can be safely turned off at one node without losing ACID-complience, but it may still be better to have it turned on. According to When is it safe to disable InnoDB doublewrite buffering? turning off innodb-doublewrite may cause corrupted data, in that case it may be needed to wipe the data partition on the downed node and start it up again. This comes with three costs:
- manual sysadmin intervention
- elevated levels of IO on the cluster while the corrupted node is performing a SST
- the cluster will have to run with reduced redundancy until the node is fully operative again.
(As far as I can understand, a missing flush on log files shouldn’t cause problems starting a db node – i.e. for innodb_flush_log_at_trx_commit the manual explicitly states that “InnoDB’s crash recovery works regardless of the value”)
Reasons why the whole cluster may burn down at once
Consider innodb_flush_log_at_trx_commit to be set to 2, binlog_sync to be set to 0 and innodb-doublewrite-buffering set to 1, this may give performance gains, and as far as I can understand the only failure mode it won’t cover is if a quorum of nodes are suffering a power-loss or kernel panic. For a true multi-site-setup, those seems rare enough that one may want to neglect them.
If the nodes are located nearby, a local problem, i.e. power outage, flooding, fire, etc can knock all of the nodes out in the same time. I’ve multiple times witnessed a whole server site going down for the weirdest reasons even though everything was supposed to be redundant.
Could solar storms affect the whole cluster simultaneously? Especially if the cluster is located in the same city and in polar regions? I’ve seen two laptops crashing simultaneously, being in the polar region during a solar storm, and with no other plausible reasons (different OS’es, it wasn’t under a leap second, etc).
If all nodes runs the same OS, some OS kernel problem could potentially cause all nodes to crash hard with a kernel panic in exactly the same time.
I believe that if innodb_flush_log_at_trx_commit is set to 0 or innodb_flush_log_at_trx_commit is set to 0, then a mysql crash affecting all nodes simultaneously could cause data loss.
Has anyone ever observed galera crashing in such a way?
A system admin or buggy script doing “kill -9” simultaneously on mysql on all the nodes too … but if the system admin wants to cause data loss, he will probably find a way regardless of those settings.
Cluster going down for other reasons
In those cases, those settings shouldn’t matter:
Sudden and total network isolation between all three nodes, or leaving the cluster fragmented so no quorum can be found. One would need to bootstrap the cluster after such an incident, and if one really want to be 100% sure that no transactions are lost, one needs to manually investigate binlogs (see more on that below) – but since neither mysql nor the OS has crashed, all state will eventually be synced, no matter the sync settings.
Bad timing – i.e. one node going down and while it’s down another node could go down or there could be some network isolation. Afterwards, the cluster needs to be bootstrapped. However, everything that the nodes have received will be written to disk, so the above mentioned settings will not affect anything. (We’ve had this coming over us once).
Manual intervention is needed on cluster failures
No matter the settings, to be 100% sure that no transaction is lost after a catastrophic cluster failure, it’s actually needed to manually inspect/merge binlogs. We need to bootstrap the cluster using only one node, but we don’t know if this node got all the changesets, this info is only available by investigating the binlogs from all nodes (with
log_slave_updates turned on, one can probably do with binlogs from a quorum of the nodes).
This means that you need to make a decision in advance – what should be the priority if we encounter a catastrophic cluster meltdown? Is the priority to get the cluster up and running again as fast as possible, or is it to make 100% sure no transactions have been lost before bootstrapping the cluster?
Your mileage may vary
If some downtime is acceptable under extreme circumstances (a full catastrophic db cluster meltdown is extreme) and if it’s paramount to ensure all transactions are in place before returning to production mode, then one needs to make 100% sure that the binlogs are synced and kept, and that the backup regime is good. Syncing innodb logs and double write is then non-critical and can be turned off if that gives performance benefits.
If “getting the cluster up and running again as fast as possible” is important, then it’s important to
set innodb_flush_log_at_trx_commit = 1 and
set innodb_doublewrite = 1 at a quorum of the nodes; if the cluster should go down then it can be quickly bootstrapped from any of the servers with those settings turned on. (Some consideration may be needed. On our previous meltdown node C fell down and got up again. Network glitch between node A and node B before node C got properly joined into the cluster, resulting in a shutdown. Obviously, we can bootstrap from A or B without any data loss, but not from C).
By “site redundant”, do you mean that you have 1 node in each of 3 different datacenters?
Is your goal to survive any single point of failure? Including a hard crash of any one server or even one datacenter?
Are you willing to force SST if a hard crash occurs?
Then I ‘believe’ those 3 settings can be set to their fast-but-not-safe settings.
(Money-back guarantee on my advice.)