WL#3610: Multi-master upgrade 5.1
Affects: Server-6.0
—
Status: Complete
SUMMARY ------- The purpose of this worklog is to outline the necessary changes to make in order to upgrade a system consisting of several masters from version 5.1 to any version after 5.1, or to outline situation where this is not possible. This worklog only contain work to do for 5.1. Note added by Trudy Pelzer, 2006-12-08 Per Rafal Somla, this WL is mostly documentation and specification work. Any concrete solution of the problem will be put as a separate WL task. REQUIREMENT ----------- Assume that OLD version < NEW version and that the major version number of OLD and NEW differ with at most 1. R1. It should always be possible to replicate OLD -> NEW, e.g. 4.1.10 -> 5.0.12 or 4.1.10 -> 4.1.12. R2. It should always be possible to replicate NEW -> OLD, where NEW is the latest minor version within its major version, e.g. 5.1.20 -> 5.0.12 (if 5.1.20 is the latest 5.1 release). R1 is old requirement, R2 is new requirement. Counter-examples to these requirements -------------------------------------- - BUG#24674 : Not possible to replicate 5.0.20->5.0.19 due to new DEFINER clause in SP. USE CASE -------- The following use case is due to Jimmy. (In the description below read s/Cluster/Server/.) - Replication of Server A (Cluster 5.1) to Cluster B (Slave 5.1) - Stop Cluster B - Upgrade Cluster B to 5.2 - Start replication from Cluster A (Master 5.1) to Cluster B (Slave 5.2) - Wait for Cluster B to 'catch up' with missed changes - Stop replication at Cluster B - Promote Cluster B to Master/Active - All changes should be replicated to Cluster A in case upgrade is a failure and must be backed-out (Master 5.2 to Slave 5.1) - Stop replication at Cluster A - Stop Cluster A - Upgrade Cluster A to 5.2 - Start replication on Cluster A - Cluster A (Slave 5.2) and Cluster B (Master 5.2) ------------------------------------------------------------------------------ Version Upgrade Discussion Replication Meeting, Stockholm, 2 February 2007 Problem ------- We want to have online upgradeability. When we upgrade servers, the replication should result in a consistent copy of the master's data on the slave, i.e. the slave converged to the master. In any topology, you can remove a server and change the topology such that any servers that converged prior to removal can still converge. For example, the removal of a server from a circular replication topology cannot result in loss of convergence between the remaining servers. Terminology ----------- For this document, versions are defined as different releases of the server executable with different semantics of the binary log. Requirements ------------ The slave shall always have the same contents (state) as the master, or the contents of the slave shall converge to the contents of the master. In a circular replication topology, one shall be able to upgrade the servers without taking all of them down. Since the masters may be used for load-balancing writes, it is necessary to change the topology to form a smaller circle, i.e., all the remaining servers shall still move to converge to the same state. Slave and master shall always be able to communicate. [What is the purpose of this requirement? /Matz] No event shall be reapplied to the server. For example, given a circular replication topology with three servers, if one server is removed, the other servers shall not apply events that have already been applied. Implementation Details ---------------------- Record server_id. Record global position. Servers maintain a vector of {server_id, binlog_pos}* called a binlog_vector. When slave connects to master, it sends it?s binlog_vector to the master requesting the latest position where binlog of the master converges. The master returns the latest position of its binlog to the slave. The slave processes the binlog from the position master. Possible Solutions ------------------ 1. New->Old, downgrade new. 2. Old->New, upgrade new. 3. Use a binlog_vector of all servers in the slave's master replication graph containing server_id of the master, binlog position, and slave's binlog position). Solution 1 ---------- Later versions of the server can always read older versions of the binary log. Later versions of the server can always produce older versions of the binary log. An upgraded server needs to have input and output transformers. The input transformer can stop, Scenario -------- Consider servers {A, B, C} where all servers are the same version. Assume old version can replicate to new version, but new version cannot replicate to old version. 1. Initial topology is A->B, B->C, C->A. 2. When C is removed, A->B and B->A. 3. When C is upgraded to C?, A->C?. Note that changes in C? are not replicated. 4. When B is upgraded to B?, B?<->C?. 5. When A is upgraded to A?, A?->B? B?->C? C?->A?. Problems (Rafal) ---------------- A. Removing a server from replication setup. 1. Avoid re-execution of replication events. 2. When new master->slave connection established, where does replication start? B. Upgrading all servers in a replication setup. Solution for Problem A.1. ------------------------- Use 'global' replication event positions. Remember position of last seen event for every replication node. Don't execute events which were seen previously. Solution for Problem A.2. ------------------------- Use binlog_pos vector. Slave connects to master. Slave sends binlog_pos vector to the master. Master searches its binlog for the oldest event that has not been seen by the slave (via the slave?s binlog_pos vector). Master begins sending binlog starting from the event identified. Solution for Problem B (#1) --------------------------- Use a loop to remove old node and reconnect using solution for A. Upgrade then reconnect (one direction only). Solution for Problem B (#2) --------------------------- Allow new master to talk with old slave. Rafal's Ideas for Version Numbering ----------------------------------- There are 2 kinds of version problems: 1) If the binary is totally different and cannot be read (has a major change). Solution is to step binlog version. 2) A compatible extension to the format is created and permits an old slave to still read the event. How should a new slave know if it can use the extension or not? Solutions are: a) use the length of the event (current solution), b) use minor version number, or c) use server version number. Mats will add a case where these solutions do not completely work.
Requirements: - The ability to test version 5.2 on Cluster B as the master with live data changes. If this testing fails, they want to switch back to Cluster A as Master 5.1, without losing data. WORK ---- A) Preparations in MySQL 5.1 to: a) make the format more stable (some refactoring and documentation), b) add checks so that the format can't be changed by all of our bug fixers (adding assertions) B) Add check that MySQL 5.1 gets correct version of binlog and produces good error message otherwise. (This might already work, but the code should be reviewed to check if there are any flaws.) UPGRADE PROCEDURE ----------------- Here is an outline of an upgrade procedure. For the purpose of this example, we assume that we have three servers, A, B, and C, running servers of version 5.1, replicating in a circle (A B C), and we want to upgrade C to version 5.2. We also assume that server A, B, and C has server id 1, 2, and 3 respectively. We assume that each server records at least the position of the next event to execute for each server, i.e., the ``Exec_Master_Log_Pos`` for each server and that the information is stored in durable storage. We assume that the table in worklog WL#3835 is used to track progress with respect to individual masters. We will use a number of helper methods, which we now outline as separate sections. These methods can be automated, but does require specialized clients, since there is no support for accessing individual columns of, e.g., the output from ``SHOW BINLOG EVENTS``. Redirecting server B to replicate from server A =============================================== The goal of this method is to in a topology (A->B->C->A) to switch over A to replicate from B instead and disassociate C from the replication group in order to shut it down. This will form the topology (A<->B) (C). The goal is to find a position on C such that there is a known position on A that B can start replicating from. We assume that server C is off line, i.e., no statements are executed on C such that there are entries written to the binary log for this and only the slave thread executes any events. This means that all entries written to the binary log of C are sent from A, and are either created at A or at B. In order to explain the procedure, we will use a *marker event* that should be distinguishable from other events in the binary log. The marker event can be optimized away from this procedure, as long as one event can be identified and distinguished from the other events in the binary log. The marker event will be introduced in server B, will propagate to server C, and there used to find a binlog position that can be used for switching over A to B. [I actually think it is sufficient to pick an arbitrary event on B and there is no need for a distinguishable marker event for this procedure to work. It just needs to be an event that has not propagated to C yet, so that we can stop C at the right time. /Matz] 1. Stop slave C. The reason for this is that the marker event below shall not be executed on the server B. At this step, we also assume that B is not too far behind A, i.e., that ``Seconds_behind_master`` is within reasonable range. 2. Insert a marker event on B, e.g., create a unique table. - Note the position and file of this event and call these *BPos* and *BFile*. This is done by executing ``SHOW BINLOG EVENTS`` and looking in the ``Info`` column for an event containing the marker statement. 3. Start slave C and let it run until it reaches the log file position (*BFile*,*BPos*). This is done by issuing the commands:: START SLAVE UNTIL MASTER_LOG_FILE=*BFile*, MASTER_LOG_POS=*BPos*; SELECT MASTER_POS_WAIT(*BFile*,*BPos*); After this step, the slave will have stopped just after executing the marker event. Since C is off-line, only the slave thread can execute statements that go into the binary log of C, so the last event in the binary log of C is the marker event. 4. Find the end of the last binary log on C. This is trivially done by executing ``SHOW MASTER STATUS``. Use *CFile* and *CPos* to denote the file and position respectively. Since both (*BFile*,*BPos*) and (*CFile*,*CPos*) refers to the end of the marker event, we have now found a position that we can use to redirect server A to replicate from B instead of from C. Since there are no events being written to the binary log on C, it is now the case that server A either has replicated to this position, and is waiting for more input, or have outstanding events to execute before reaching the marker event. 5. Stop slave A. If A has *not* reached and executed the marker event, do the following steps to execute the marker event and stop after it:: START SLAVE UNTIL MASTER_LOG_FILE=*CFile*, MASTER_LOG_POS=*CPos*; SELECT MASTER_POS_WAIT(*CFile*, *CPos*); 6. Redirect A to replicate from B instead by issuing the following command on server A:: CHANGE MASTER TO MASTER_HOST=*B*, MASTER_LOG_FILE=*BFile*, MASTER_LOG_POS=*BPos*; [Need add a host as well and check the exact syntax of the command, but is to tired right now. /Matz] 7. Start slave A. A and B are now replicating each other, and C is disconnected from the master group. Main upgrade procedure ====================== 1. Bring server C off-line so that all clients are disconnected and no new clients are allowed to connect. It is essential to allow a user with SUPER privileges to connect in order to perform the upgrade. This is eased by implementing WL#3836 on 5.1, but might be done manually in a more tedious manner by setting the ``@@read_only`` variable and kicking out the clients one by one as they complete their queries. Preventing new users to connect can be handled by removing the users from the ``mysql.user`` table (and saving them in another table, in order to make it easy to restore them on starting the server again. At this point, it is also necessary to switch off the scheduler, since it can execute statements causing events to be added to the binary log. 2. At this point, there is nothing that can generate events into the binary log of server C, so we allow all changes done at server C to propagate to server A. Server A now has no outstanding changes from server C. In order to simplify the upgrade procedure, a method for waiting for all applied events to propagate could be implemented, but this is not critical and is not needed in version 5.1. 3. Execute the redirection procedure above to redirect A to replicate from B instead of from C. After this step, the server C will not be replicating at all and both slave threads will be stopped. 4. Shut down server C and upgrade the system from version 5.1 to 5.2. 5. Start server C in an off-line read-only configuration with no client connected and no changes done to the server. 6. Instruct server C to generate a binary log for version 5.1. This requires implementation of WL#3837 in version 5.2. 7. Bring server C on-line and turn off read-only. If the server was shut down as described in 1, no user will be able to connect to the server except the super user, which is what is desired. This point requires no additional changes to 5.1. 8. Instruct server C to replicate from server B and let it catch up with server B. This can be done by either taking a backup of B and install it on C, or by just starting the slave threads on C. C> START SLAVE; 9. Change master of server A to server C. This requires that the C server is behind the A server, which can be accomplished by stopping the C server and let A get ahead a little bit. In order to achieve this, we execute the following steps: a) On C: ``STOP SLAVE`` b) Wait until the binlog position on A is later than binlog position on C. This can be accomplished by issuing a ``SHOW SLAVE STATUS`` on C and noting the binlog file and position (call these *BFile* and *BPos* respectively), and then on A execute:: SELECT MASTER_POS_WAIT(*BFile*, *BPos*); Note that this position denotes a position on B for both A and C. c) On A: ``STOP SLAVE`` d) On A: Do a ``SHOW SLAVE STATUS`` and get the file and binlog position last executed. Call them *AFile* and *APos* respectively. e) On C: execute the statement:: START SLAVE UNTIL MASTER_LOG_FILE=*AFile*, MASTER_LOG_POS=*APos*; SELECT MASTER_POS_WAIT(*AFile*, *APos*); C has now stopped at the position where A was stopped. f) On C: Do a ``SHOW MASTER STATUS`` and take the value of ``File`` and ``Position`` and call them *CFile* and *CPos* respectively. We now have mapped the position where A stopped to a position on C g) On A: set A to replicate from C starting with this position by executing the statement:: CHANGE MASTER TO MASTER_HOST=*C*, MASTER_LOG_FILE=*CFile*, MASTER_LOG_POS=*CPos* h) Start slave A: ``START SLAVE`` i) Start slave C: ``START SLAVE`` We have now changed the master of A to be C and now have a circular replication topology with the upgraded server incorporated and all servers are running. 10. Repeat step 1-9 for each server in turn. 11. Instruct each server to generate binary logs for version 5.2. This requires implementation of WL#3837 in version 5.2. All servers of the master group is now upgraded. REFERENCES ---------- See WL#3614: Multi-master upgrade 5.2
Copyright (c) 2001-2007 by MySQL AB. All rights reserved. What is the binary log ======================= Binlog is a trace of changes of the server's global state generated during its operation. It consists of events describing changes of this state. More precisely, binlog events describe actions which can be used to reproduce the same changes of global state which have happened on server. There are two types of binlogs (choosen by server options): - statement based binlog, - row based binlog. In the first type, events contain normal SQL queries which reproduce changes of the data. In the former, apart from SQL query events there are special events which directly describe changes of table contents in terms of rows which have been added, deleted or changed. (TODO: file related events) TODO: explain "internal events" and how they differ from other ones. Binlog and replication ======================= Binlog is used for the replication purposes. The idea is that binlog created on master server is shipped out to slave server and executed there. Many details of binlog format and handling are specific to that use of the binlog. Format of binlog and how it differs in different binlog versions ================================================================= Currently there are 3 binlog versions supported in the server code: v1 generated by mysqld v 3.23 v3 generated by mysqld v 4.0.2-4.1 v4 generated by mysqld v 5.0+ An obsolete format v2 was generated by early versions of 4.0 mysqld. TODO: Start of binary log: binlog magic In all three versions events have the following general layout: +===================+ | event header | +===================+ | | | event data | | | +===================+ Structure of the header ----------------------- First 13 bytes of the header (let's call it fixed part of the header) contain the following information: +==================+ | timestamp : 4 | +------------------+ | type code : 1 | // at EVENT_TYPE_OFFSET +------------------+ | server_id : 4 | // at SERVER_ID_OFFSET +------------------+ | length : 4 | // at EVENT_LEN_OFFSET +------------------+ Event type codes are defined in log_event.h. The length field contains the length of whole event packet (including the header). Server_id is the unique server id set in the server configuration files for the purpose of replication. Size of the event header is different in different versions of binlog: v1: 13 bytes v3: 19 bytes v4: flexible size specified by a format description event (see below) but at least 19 bytes. Event headers in different binlog formats (v1-4) are backward compatibile. That is, a header in newer format starts with the header of old format. This is why regardless of the binlog format the first 13 bytes of the header contain the same information (which is exactly the v1 header). The v3 header contains 2 additional fields: +===============+ | log_pos : 4 | +---------------+ | flags : 2 | +===============+ TODO: find the meaning of log_pos field. TODO: describe used flags. Currently, v4 header is the same as v3 header. However, binlog format v4 provides mechanism for adding extra fields to the header without breaking the format (this is implemented via format description event). Structure of event data ----------------------- In v4 of the format event data is split into two parts: fixed length part called a post-header (or event-specific header) and an (optional) variable length part (let's call it event payload): +=================+ | post-header | } fixed length (the same for all events of one type) +=================+ | | } | event payload | } length can vary for different events | | } +=================+ Detailed structure of event packets in different binlog formats --------------------------------------------------------------- v1: +==================+ | timestamp : 4 | } +------------------+ } | type code : 1 | } +------------------+ } v1 header (13 bytes) | server_id : 4 | } +------------------+ } | length : 4 | } +==================+ | | } | event data | } length - 13 bytes | | } +==================+ v3: +==================+ | timestamp : 4 | } +------------------+ } | type code : 1 | } +------------------+ } | server_id : 4 | } +------------------+ } | length : 4 | } v3 header (19 bytes) +------------------+ } | log_pos : 4 | } +------------------+ } | flags : 2 | } +==================+ | | } | event data | } length - 19 bytes | | } +==================+ v4: +=========================+ | timestamp : 4 | } +-------------------------+ } | type code : 1 | } +-------------------------+ } | server_id : 4 | } +-------------------------+ } | length : 4 | } v4 header (x bytes) +-------------------------+ } x is given by format description event (FDE). | log_pos : 4 | } +-------------------------+ } | flags : 2 | } +-------------------------+ } | extra hdr fields : x-19 | } +=========================+ | post-header : y | // y, specific to event type, given by FDE. +=========================+ | | } | event payload | } length - (x+y) bytes | | } +=========================+ Format description event ------------------------ Each binlog starts with a format description event. (In v1 and v3 this is called Start event). The layout of format description event is as follows. Note that the v4 format description event starts with v3 event. v1 & v3: +==========================+ | event header | // 13 (v1) or 19 (v3) bytes +==========================+ | binlog_version : 2 | +--------------------------+ | server_version : CHAR(x) | // x = ST_SERVER_VER_LEN (=50) +--------------------------+ | created? : 4 | +==========================+ TODO: semantics of created field. v4: +==========================+ | event header | // 19 bytes +==========================+ | binlog_version : 2 | +--------------------------+ | server_version : CHAR(x) | // x = ST_SERVER_VER_LEN +--------------------------+ | created? : 4 | +--------------------------+ | header length : 1 | +==========================+ | | } | post-header lengths for | } n bytes where n is the number of events | all event types | } used in this binlog format. | | } +==========================+ Binlog rotation event --------------------- Binlog events are stored in several files to limit size of a single file. In the code these files are reffered to as "binlogs", thus the view is that we have several binlogs each containing a range of binlog events. The binlogs are identified by names (corresponding to names of the files in which they are stored). At the end of each binlog a ROTATE event is written which points to the next binlog in a squence. The layout of this event is as follows: +====================+ | common header | +====================+ | position : 8 | +====================+ | | | next log name | | | +====================+ 'position' is the position of the first event in the new binlog. Note that this should be a format description event. TODO: How it looks in different binlog versions? Recognizing binlog format version ================================== To correctly handle future format updates, the following should be true for all future formats: a) binlog starts with format description event, b) this event should start with v3 header (19 bytes), c) the 2 bytes at positions 20 and 21 should contain format version number. Note that requirement b) doesn't hold for v1 of the format. Thus v1 must be handled separately. One can recognize binlog format v1 by: 1) reading type code from 1 byte at postion EVENT_TYPE_OFFSET (=4) and checking that it is START_EVENT_V3 (=1) 2) reading length of the event packet from 4 bytes at position EVENT_LEN_OFFSET (=9) and checking that it is smaller than (19+6+ST_SERVER_VER_LEN = 78) This is exactly what is done inside mysqlbinlog to recognize binlog format v1. For all other binlog formats, assuming that requirements a-c) are preserved, a procedure to recognize binlog format is as follows: read 2 bytes at position 20 of first event packet -- this gives binlog version. Further sanity checks can be added. For example if format version read is v3 or v4 then the type code (1 byte at EVENT_TYPE_OFFSET) should be START_EVENT_V3 (=1) or FORMAT_DESCRIPTION_EVENT (=15), respectively. Note: Format description event in v4 is designed so that it can handle future format updates. A new format with the same layout of event packets as in v4 but with additional fields in header and post-headers can use this format description event to correctly describe itself. Actually, it is (theoretically) possible to have different "flavours" of binlog format v4 which have different (bigger) header lengths and even different number of events. Current code is written to take care of this possibility. I.e. any code parsing binlog and discovering that it is v4, uses header lengths as given by the format description event (thus potentially different from the values hard-wired in the server code). Note: Although headers of event packets in v4 of binlog can be longer than 19 bytes, the format description event is an exception. Its header is always 19 bytes long to meet the above backward compatibility requirements. Where binlog events are read/parsed in the code ================================================ 1. Replication master's binlog dump thread (this is the thread handling COM_BINLOG_DUMP command sent by a slave). It reads events from binlog files and sends them to replication slave over the SQL connection. Code: mysql_binlog_send() function in sql/sql_repl.cc 2. Replication slave's I/O thread: Reads events sent by master's dump thread and stores them in relay logs (local to slave's host). Code: handle_slave_io() function in sql/slave.cc 3. Replication slave's SQL thread: Reads events from relay logs and executes them. Code: handle_slave_sql() function in sql/slave.cc 4. Mysqlbinlog program: Reads events from binlog files or sent by a mysqld server and presents them in a form of SQL queries. Code: dump_{local,remote}_log_entries() functions in client/mysqlbinlog.cc Notes: 1a. If the current binlog file ends, mysql_binlog_send() switches to the next file. 1b. Mysql_binlog_send() sends a "fake" Rotate event before sending any other events from a binlog and after each single binlog file. 1c. In case an event offset was specified when requesting binlog dump, mysql_binlog_send() finds the format description event at the beginning of the binlog file and sends it first, before proceeding to the events at the given offset. How different binlog formats are handled in the slave ====================================================== I/O thread reads version of the server running on master from the mysql->server_version value (where mysql is an object representing established connection to the master). It aborts replication if master's version is x.y.z with x <= 2. I/O thread reads events using cli_safe_read() function which simply reads protocol packets (?). Thus, this is binlog-version transparent. Before writing events to the relay log, I/O thread can process them: - if it consider current binlog to be in format v1-3 it translates events to v4 format, - otherwise it copies event packets as they are, except that it recognizes and parses FD and Rotate events. It also recognizes (and don't put in relay log) a Stop event. I/O thread decides about binlog version based on the FD events it parses. Before first FD event arrives, binlog version is based on the master's server version. For masters with server x.y.z with x > 4 the binlog version is assumed to be v4 (even though in fact it can be higher). Parsing FD and Rotate events is done by constructors of corresponding objects. These constructors don't do any checks on binlog version. Note: the Rotate_log_event constructor reads header and post-header lengths from the current event description event (thus possibly binlog format sensitive). Translation of evnets from old formats to the current format is done by parsing event data using Log_event::read_log_event() (i.e., in the constructors of event objects) and then writting the event objects to the relay log. Note: Looks like the event object constructors don't trust the binlog version stored in the FD event passed to them.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.