Can I run UPDATEs on a replicated MySQL slave, safely?

Posted on

Question :

We would like access to accurate MySQL backups for debugging, without having personal information (PII) in the backups.

Is it possible to run UPDATEs on the replicated database (ie. replacing text columns with Lorem Ipsum) without breaking replication?

(Note: The schema allows us to identify newly updated rows very easily – meaning it is trivial for us to identify new PII quickly)

Answer :

Yes with many caveats– I have literally this setup to expose Wikipedia database in public, and I would not recommend it unless you do not have an alternative options. We use triggers that update/blank fields that are private, replication filters to avoid full tables/databases, and views to limit certain per-row accesses.

Problems we faced:

  • Statement based replication is very flexible, but if you run certain writes based on existing data (e.g. INSERT…SELECT, updates based on modified columns), it will either break replication or create a data drift. You should limit application writes by inserts and updates by fields you are not going to modify-but it is not possible in most cases to control all possibilities except on simple models (insert-only writes, for example). The breakage can be very subtle but it may happen over time if the application is minimally complex.
  • Row based replication is more accurate (because you get the final state), but it does not trigger the triggers. We started using MariaDB 10.1 to have replica-only triggers to be able to do that (thus solving data drift), but we still have issues due to row based replication have limitation in terms of supporting different schema on master and replica.
  • Triggers and filters make things difficult to manage and more room for errors
  • If possible, I would suggest not having real time replication, but using dumps so you can filter on query time (in our cases it is not possible due to the size of the database and requiring real time from some usages).
  • Because you have different data, running pt-table-checksum and other comparing tools is not really possible (to check for data discrepancies)

When we have the resources, we will likely move to a custom filtering automation that analyzes the row based binary protocol in pseudo real-time and performs more complex filtering on application layer.

Yes, you can UPDATE replicated database without breaking replication but you have to evaluate the following:

  1. You can not update PRIMARY KEY fields.
  2. You can not update fields used in any modification WHERE in master:
    Example:

In Master:

DELETE table1 WHERE text_field='abc'

In SLAVE:

YOU CAN NOT UPDATE text_field

A thought…

Move PII info out of the current database into a separate one. Then binlog_ignore_db to avoid replicating it. (Or you could filter on the Slave or …)

Even without your current requirement, this “isolation” may be a good idea for security.

This has the flaw that the JOINs to get to such info are broken.

Another thought…

Don’t access PII via JOINs, use Functions. Then you have a single location to stop or scramble PII data.

Leave a Reply

Your email address will not be published. Required fields are marked *