WL#4209: Integrate Backup with Replication
Affects: Server-6.0 — Status: Complete
Rationale --------- Replication and backup must work together in a consistent way. Overview -------- The MySQL backup system is part of the MySQL data integrity toolset. As such, it needs to be compatible with other data protection and recovery mechanisms. Replication is used for a variety of data integrity methods including, but not limited to, redundancy, recovery, and scale out. The focus of this worklog shall be how backup can be used to improve replication. Since this work is the main starting point for replication and backup integration, it is necessary to take the perspective of replication in identifying how backup and restore can be used to enhance replication. However, related worklogs (e.g. WL#4280) may be written from the perspective of the backup system. The MySQL backup system is not complete and therefore this work shall be limited to discussing the enhancement of replication based on what backup and restore can do now. Where appropriate, notes and future enhancements concerning features of the backup system and how it can be used with replication shall be noted. Common Replication Tasks ------------------------ In an effort to polarize the focus of this work, this section explores the more common replication tasks with suggestions on how backup and restore could be used to enhance the process. 1) Make a copy of a master to setup a new slave. This is a very common customer need when replication is initiated or the replication topology is being expanded to multiple slaves. We create a slave by making a copy of the data on the master and restoring it to a new MySQL server then configuring the new slave to replicate from the master. There is a similar use where a slave has failed and the need is to restore redundancy capabilities without impacting production service on the master. There are several possible solutions for this task. The best solution is to backup the master and restore it on the slave. Another possibility is to start the slave and tell it to request the data from the master in the form of a backup on the master piped to a restore on the slave. Note: This is future work and is described by WL#4273. 2) Make a copy of a slave to another server that will become the master of the slave as soon as possible. This is a case where the master has failed or needs to be replaced and the need is to failover to a new master. Typically, the process is to keep the slave in service until the master has been restored resume replication with the restored master. It could also be a case of normal master failover to a slave – the slave takes the place of the master. While there are many ways to perform this task, the slave needs to be synchronized with the master. Whether this is done by first promoting the slave (failover) then restoring the master via replication from the promoted slave or by first restoring the master and reestablishing replication, backup can provide a method of creating a copy of the data in either case to be restored on the other server. 3) Make a copy of a slave to another slave so the new slave can join the replication topology. This is similar to the first common task above, but in this case, we are creating an additional slave. This differs in that a backup performed on a slave has no binary log information to record. Indeed, unless the slave is also a master to another sub-topology, the binary log is not typically used. In this case, additional work may be needed to allow the backup to record information about the master’s binary log in the backup logs for later use. The same basic backup and restore process can be used in this task. But this requires metadata about the master’s binary log in order to start the new slave at the correct point. This information could be recorded when the backup is performed on the slave, but could also be noted by the user when the backup is performed on the slave. 4. My master died, I have ten slaves, each at a different position. The goal of this task is to synchronize the slaves enough so that one can be used in a failover process to replace the master. This is typically the slave that has the least amount of data loss as measured by how much of the master’s binary log events it has executed. Using backup and restore in this task requires that one first apply the master's binary log (for point in time recovery) then promote that server to the master and point the others to it. While this is a normal replication recovery effort, backup could be used to make a copy of the promoted slave and restored on the old master and then failover the new master thereby restoring the original topology. Note: If a topology containing slaves setup to replicate a portion of the data on the master and these segments are not divided by database, backup and restore may not be possible until selective restore is available. Using Backup and Restore in Replication --------------------------------------- The normal use of backup and restore in replication is to recover from errors due to machine or data failures. The goal of using backup is to ensure a consistent copy of the data and the goal of using restore is to ensure the lossless recovery of that data. There are two basic scenarios where replication can benefit from backup and restore. The first is making a backup of the master to recover from failures (complete server recovery) and the second is the restoration of either a master or a slave to a specific point (point-in-time recovery). There are also scenarios where backup and restore can be used on slaves but these are less likely scenarios. The following illustrate the use of backup and restore in a basic replication topology of a single master and one or more slaves. 1) When a Master Fails This is when the master has been taken offline either under an intentional shutdown (taking offline for preventive maintenance) or as the result of a crash. There are many ways backup can be used in this scenario. * Backup as a preventive measure by making regular copies of data. * Backup from a known good slave and restore to a new master. * Restore data to repair the master. * Restore data to create a new master with the backup data. A sample process for this scenario follows. 1. Failover to slave is performed. 2. A backup is taken on new master (former slave). 3. Backup image is restored to make the old master a slave. Another variant of this process follows. 1. After failure of master, master is restored from backup image. 2. A new backup is performed on master. 3. The resulting backup image is restored on slave. 4. Replication is restarted. 2) Point-in-time Recovery When a situation exists where some operations (transactions, queries, etc.) in the binary log cause a problem (crash or loss of data) and one wishes to restore the master (or slave, but that is a special case) to a certain point prior to the problem, it is possible to use backup to enhance this process. The most common way to use backup and restore in this scenario is to use a known good backup image to restore on both the master and slave then replaying the binary logs from the master on the master allowing the changes to replicate. This requires the following. For additional information on point-in- time recovery, see the MySQL Users Guide. * Backups are being made on the master as a preventive measure. * Copies of the binary logs are saved. * Copies of the backup logs are saved (optional). A sample process for this scenario follows. 1. Replication is stopped manually (if not already interrupted). 2. A restore is performed on both master and slave. 3. Replication is resumed. 4. The binary log information as stored in the backup image file is examined and a starting and stopping point are determined. Note: this can be obtained using the backup_history log or via the MySQL backup client (future release). 5. The binary log is played on the master and the slave is allowed to catch up (process all events in its relay log). Note: Some users will use grep to filter out the castrophic operation from the binary log. In that case, the grep on the binlog must include removing the restore event (see below). Use Cases --------- A brief discussion of how backup and restore can be used in a replication scenario has been presented above. This section examines the details of the expected behavior and enhancements to the backup system under a variety of use cases. Given the base replication pair of a master and a slave, the following use cases explain how backup and restore will enhance replication. For those scenarios where a slave can be a master to another slave, both use cases shall apply. For example, use case 1 and 3 apply when backup is run on a slave that is also a master. Use Case 1 - Backup performed on a master. When a backup is performed on a master, the master shall not log the backup event nor shall the master replicate any data produced (logged) by the backup. Therefore, the slave remains unaffected by a backup run on the master. The backup shall record the binary log position, time, and file name and store the information in the backup_history log as well as in the backup image file. Use Case 2 - Restore performed on a master. When a restore is run on the master, a signal must be generated to signal the slave that a significant change that cannot be replicated has occurred on the master. This shall be accomplished by issuing a restore incident event which will cause the slave to stop. Note: Once WL#4273 is complete, it may be possible for the slave upon reading the restore incident event to request a backup from the master and thereby allow for automatic propagation of the restore on the master to the slave. Until such time, the slave shall stop when a restore incident event is encountered. Use Case 3 - Backup performed on a slave When backup is performed on a slave, the slave shall not log the backup event in its binary log (if enabled). Since data is only read, there is no affect on replication. Also, when a backup is run on a slave, no binary log information is stored in the backup history log because the slave's relay log is not the same as the binary log on the master. Note: If the slave is behind the master, care should be taken to ensure the binary log from the master is preserved and the information about the slave WRT the master’s binary log position recorded. The information from the slave concerning the master’s binary log position and the slave's relay log position shall be stored in the backup logs. Use Case 4 - Restore performed on a slave. When restore is run on a slave and no binary log is enabled, the restore is performed as a normal server. However, it should be noted this requires that the slave stop replication with the master during the restore. In this case, the slave’s recorded position will need to be recorded and replication restarted from that point once the restore is complete. If a slave has its binary log enabled, a signal shall be written to the binary log in the form of a restore incident event. This shall permit the slave to act as a master and therefore signal its slaves to stop. Note: Conflicts with the data from the master are possible if a restore is performed on a replicated database. In this case, the replicated events from the master will be blocked until after the restore is complete. While the MySQL backup system shall not prohibit this use case, the user must consider the side effects of running a restore on a slave. Also, to protect any connected slaves (where the slave is also a master) from conflicting changes as a result of the restore, replication shall be blocked during the restore is complete and no additional slaves shall be permitted to connect until after the restore is complete and the restore incident event is written. Use Case 5 - Backup run with no binary log. Since there is no logging, the server cannot act as a master and therefore backup shall run as normal with nothing logged in the binary log. Use Case 6 - Restore run with binary log turned on but no slaves attached. When restore is run in this scenario, it is similar to running restore on a master but there are no slaves. However, slaves could attach at a later time and request data from the master. Therefore, a restore incident event shall be written to the binary log. Requirements ------------ The following additional requirements are considered the minimal set which describes how MySQL backup and replication shall work together. R01: BACKUP commands shall not be logged. R02: BACKUP commands run on a slave shall record the slave's binary log information (if enabled) in the backup logs. R03: RESTORE commands shall not be logged when run on a master. R04: RESTORE commands run on a slave are not permitted unless replication is turned off. R05: The effects of the RESTORE shall not be logged. A restore incident event shall be issued instead to signal any connected slaves to stop. R06: Changes to the backup logs shall not be logged. R07: Replication shall not be allowed to execute while doing restore. This means replication is not allowed to start during restore. R08: Backup on a master shall not affect its slaves. R09: Restore shall not record any data events in the binary log unless specifically requested. Note: If the slave is also a master then it shall generate a restore incident event. But if just a slave, then no such event shall be generated. R10: At no time shall the BACKUP or RESTORE commands be written to the binary log. Restrictions ------------ MySQL backup must not disrupt or interfere with any of the uses of replication. Indeed, it is the purpose of this worklog to ensure that backup is designed and implemented in a manner that is complimentary with replication. There are many uses for replication and many more possible replication topologies. This work shall focus on the most basic replication topologies; a single master with a single slave connected. All other replication topologies can be represented given this base pairing. For example, a star replication topology is simply a single master with two or more slaves connected. There are no cases where replication inhibits backup or restore aside from the normal locking issues common to applicable DML and DDL events. For example, a long transaction event could block backup or restore commands. Restore of a subset of the replicated databases is not supported at this time. A restore incident event triggering the copying of databases from a master to a slave (WL#4273) is not within the current scope. Notes ----- In later versions, the slave position will be saved and synched with the backup image and thus the restore will set the replication state correctly so that it may continue replication (see WL#4105). In later versions, a special option may be provided to override not logging restore. The override would permit the logging of the writes to the data, not the actual restore command. This could be useful for situations where the default backup drivers are used (these are currently the only type of drivers that can support logging of data during restore). RESTORE commands run on a master may require blocking of new slaves from connecting until the restore operation is complete. Replaying the binary log should normally be with binary logging off. When a master fails and a new backup image is created from a slave, the restore incident event shall not be sent to slave after failover to the original master. Users who perform point-in-time recovery should take care to remove the restore incident event from master’s binary log prior to replaying the events.
The backup system shall be modified to support replication services. Under no circumstances shall the implementation of backup interfere or disrupt replication. There must be no loss of data or functionality as a result of a backup or restore operation. Note: Temporary suspension of service and destructive restore notwithstanding. Description of Operations ------------------------- There are several replication conditions under which backup and restore operations can be performed. Each of these present a unique set of challenges. The following lists all of the possible replication conditions that backup and restore operations can encounter and the behavior of the command with respect to replication. Operation Condition MySQL Backup behavior WRT Replication --------- -------------- ------------------------------------------ backup on master Master does not log anything. Replication is not affected. restore on master A restore incident event will be written to the binary log. backup on slave No binary log information is saved unless slave is also a master. Replication is not affected. restore on slave A restore incident event will be written to the binary log if enabled. backup no logging No affect - nothing is logged. restore no replication A restore incident event will be written to the binary log. Server Services Needed ---------------------- The following services are required in order to successfully implement the requirements of this worklog and behavior of backup and restore under the above replication conditions. * Detect when replication is underway (binary log is active). * Suspend/resume replication (turn binary log on/off). * Disable/enable connections from slaves. Note: Additional services may be provided pending completion of the design for WL#4280. It was decided that these operations shall be provided via a service interface which encapsulates the varied mechanisms for executing these operations. The service interface work is being completed under WL#4280. Additional Information ---------------------- * This work shall ensure that the defect in BUG#36533 is solved by this solution. * Advanced testing techniques are needed to fully cover all tests of the requirements. See WL#4612 for details.
The operations described above shall be implemented as calls to the service interface (WL#4280) from the MySQL backup code. Overview -------- Where ever possible, the calls to the service interface methods shall be used and at no time shall any calls be made to the server facilities directly (data items, classes, methods, etc.). There are two sets of operations that are needed to realize this worklog. There are operations concerned with the control or status of the binary log. There are also operations concerned with the status and control of slave- and master- related operations. The design of each of these is described in more detail below. Binary log Operations ----------------- All operations associated with the binary log shall be made via the server service interface (see WL#4280). These calls shall be placed as high as possible in the execution path of MySQL backup code. Slave and Master Operations --------------------------- All operations associated with the status and control of a master or slave shall be made via the server service interface. These calls shall be placed in the locations where best suited to meet the requirements. Some of these calls may require modification of the server code. For example, when a slave attempts connect, a call to the service interface methods may be required to check that it is ok to allow the connection (no backup is running on the master). Fulfillment of Requirements --------------------------- The following section describes how each of the requirements described above shall be satisfied by the design and/or implementation of these features. R01: BACKUP commands are not written to the binary logs. The code to initiate the write does not exist in the backup source code. No further action is required. R02: BACKUP commands run on a slave record the master's binary log information by writing a note to the backup_progress log. The information shall include the name of the master's binary log and the position. R03: RESTORE commands are not written to the binary logs. The code to initiate the write does not exist in the backup source code. No further action is required. R04: RESTORE commands run on slaves shall generate an error and the restore command shall fail. Code shall be added prior to checking privileges for the restore command to call the service interface to identify whether the server is acting as a slave and if so generate an error and abort the command. R05: The effects of the RESTORE shall not be logged. A restore incident event shall be issued instead to signal any connected slaves to stop. This requirement requires two actions. First, the code must be changed to turn of binlogging while the restore is running and restarted after the restore. The position of these commands shall in no way affect replication or logging of other commands. Second, the code must be inserted to generate an incident event if the binlog is engaged. R06: When data is written to the backup logs and the log destination includes the TABLE option, the code shall be modified to turn of binlogging for the local thread while the write is in progress and turned back on after the write. R07: Replication shall not be allowed to execute while doing restore. This means replication is not allowed to start during restore. This requirement is satisfied by adding code to disable slave connections during the restore. This code shall be inserted prior to restore starting and then slave connections enabled after restore. R08: Backup on a master shall not affect its slaves. Since neither the backup command nor the backup log writes are written to the binary log, this requirement is satisfied by R01 and R06. R09: Restore shall not record any data events in the binary log unless specifically requested. Since neither the restore command nor the backup log writes are written to the binary log, this requirement is satisfied by R03, R05, and R06. R10: At no time shall BACKUP or RESTORE commands be written to the binary log. This is satisfied by R01 and 03.
Copyright (c) 2000, 2021, Oracle Corporation and/or its affiliates. All rights reserved.