WL#3610: Multi-master upgrade 5.1

Affects: Server-6.0   —   Status: Complete

SUMMARY
-------

The purpose of this worklog is to outline the necessary changes to 
make in order to upgrade a system consisting of several masters 
from version 5.1 to any version after 5.1, or to outline situation 
where this is not possible. This worklog only contain work to do for 5.1.

Note added by Trudy Pelzer, 2006-12-08
Per Rafal Somla, this WL is mostly documentation and specification work. 
Any concrete solution of the problem will be put as a separate WL task.

REQUIREMENT
-----------
Assume that OLD version < NEW version and that the major 
version number of OLD and NEW differ with at most 1.

R1. It should always be possible to replicate OLD -> NEW,
    e.g. 4.1.10 -> 5.0.12  or  4.1.10 -> 4.1.12.

R2. It should always be possible to replicate NEW -> OLD, 
    where NEW is the latest minor version within its 
    major version, e.g. 5.1.20 -> 5.0.12 (if 5.1.20 is the 
    latest 5.1 release).

R1 is old requirement, R2 is new requirement.

Counter-examples to these requirements
--------------------------------------
- BUG#24674 : Not possible to replicate 5.0.20->5.0.19 due to
  new DEFINER clause in SP.


USE CASE
--------

The following use case is due to Jimmy.

(In the description below read s/Cluster/Server/.)

- Replication of Server A (Cluster 5.1) to Cluster B (Slave 5.1)
- Stop Cluster B
- Upgrade Cluster B to 5.2
- Start replication from Cluster A (Master 5.1) to Cluster B (Slave 5.2)
- Wait for Cluster B to 'catch up' with missed changes
- Stop replication at Cluster B
- Promote Cluster B to Master/Active
- All changes should be replicated to Cluster A in case upgrade is a failure
and must be backed-out (Master 5.2 to Slave 5.1)
- Stop replication at Cluster A
- Stop Cluster A
- Upgrade Cluster A to 5.2
- Start replication on Cluster A
- Cluster A (Slave 5.2) and Cluster B (Master 5.2)


------------------------------------------------------------------------------

Version Upgrade Discussion
Replication Meeting, Stockholm, 2 February 2007

Problem
-------

We want to have online upgradeability.

When we upgrade servers, the replication should result in a consistent
copy of the master's data on the slave, i.e. the slave converged to the
master.

In any topology, you can remove a server and change the topology such
that any servers that converged prior to removal can still converge. For
example, the removal of a server from a circular replication topology
cannot result in loss of convergence between the remaining servers.

Terminology
-----------

For this document, versions are defined as different releases of the
server executable with different semantics of the binary log.

Requirements
------------

The slave shall always have the same contents (state) as the master, or the
contents of the slave shall converge to the contents of the master.

In a circular replication topology, one shall be able to upgrade the
servers without taking all of them down.

Since the masters may be used for load-balancing writes, it is necessary to
change the topology to form a smaller circle, i.e., all the remaining servers
shall still move to converge to the same state.

Slave and master shall always be able to communicate. [What is the purpose of
this requirement? /Matz]

No event shall be reapplied to the server. For example, given a circular
replication topology with three servers, if one server is removed, the
other servers shall not apply events that have already been applied.

Implementation Details
----------------------

Record server_id.

Record global position.

Servers maintain a vector of {server_id, binlog_pos}* called a
binlog_vector.

When slave connects to master, it sends it?s binlog_vector to the master
requesting the latest position where binlog of the master converges.
The master returns the latest position of its binlog to the slave.
The slave processes the binlog from the position master.

Possible Solutions
------------------

1. New->Old, downgrade new.
2. Old->New, upgrade new.
3. Use a binlog_vector of all servers in the slave's master replication
   graph containing server_id of the master, binlog position, and slave's
   binlog position).

Solution 1
----------

Later versions of the server can always read older versions of the
binary log.

Later versions of the server can always produce older versions of the
binary log.

An upgraded server needs to have input and output transformers.
The input transformer can stop,


Scenario
--------

Consider servers {A, B, C} where all servers are the same version.
Assume old version can replicate to new version, but new version cannot
replicate to old version.

1. Initial topology is A->B, B->C, C->A.
2. When C is removed, A->B and B->A.
3. When C is upgraded to C?, A->C?. Note that changes in C? are not
replicated.
4. When B is upgraded to B?, B?<->C?.
5. When A is upgraded to A?, A?->B? B?->C? C?->A?.

Problems (Rafal)
----------------

A. Removing a server from replication setup.
1. Avoid re-execution of replication events.
2. When new master->slave connection established, where does replication
start?
B. Upgrading all servers in a replication setup.

Solution for Problem A.1.
-------------------------

Use 'global' replication event positions.

Remember position of last seen event for every replication node.

Don't execute events which were seen previously.

Solution for Problem A.2.
-------------------------

Use binlog_pos vector.

Slave connects to master.

Slave sends binlog_pos vector to the master.

Master searches its binlog for the oldest event that has not been seen
by the slave (via the slave?s binlog_pos vector).

Master begins sending binlog starting from the event identified.

Solution for Problem B (#1)
---------------------------

Use a loop to remove old node and reconnect using solution for A.

Upgrade then reconnect (one direction only).

Solution for Problem B (#2)
---------------------------

Allow new master to talk with old slave.

Rafal's Ideas for Version Numbering
-----------------------------------

There are 2 kinds of version problems:

1) If the binary is totally different and cannot be read (has a major
   change). Solution is to step binlog version.

2) A compatible extension to the format is created and permits an old
   slave to still read the event. How should a new slave know if it can use
   the extension or not? Solutions are:

   a) use the length of the event (current solution),

   b) use minor version number, or

   c) use server version number.

Mats will add a case where these solutions do not completely work.
Requirements:

- The ability to test version 5.2 on Cluster B as the master with live data
  changes.  If this testing fails, they want to switch back to Cluster A as
  Master 5.1, without losing data.


WORK
----

A) Preparations in MySQL 5.1 to:
   a) make the format more stable (some refactoring and documentation),
   b) add checks so that the format can't be changed by all of 
      our bug fixers (adding assertions)

B) Add check that MySQL 5.1 gets correct version of binlog and
   produces good error message otherwise.  (This might already work,
   but the code should be reviewed to check if there are any flaws.)


UPGRADE PROCEDURE
-----------------

Here is an outline of an upgrade procedure. For the purpose of this example, we
assume that we have three servers, A, B, and C, running servers of version 5.1,
replicating in a circle (A B C), and we want to upgrade C to version 5.2. We
also assume that server A, B, and C has server id 1, 2, and 3 respectively.

We assume that each server records at least the position of the next event to
execute for each server, i.e., the ``Exec_Master_Log_Pos`` for each server and
that the information is stored in durable storage. We assume that the table in
worklog WL#3835 is used to track progress with respect to individual masters.

We will use a number of helper methods, which we now outline as separate
sections. These methods can be automated, but does require specialized clients,
since there is no support for accessing individual columns of, e.g., the output
from ``SHOW BINLOG EVENTS``.


Redirecting server B to replicate from server A
===============================================

The goal of this method is to in a topology (A->B->C->A) to switch over A to
replicate from B instead and disassociate C from the replication group in order
to shut it down. This will form the topology (A<->B) (C). The goal is to find a
position on C such that there is a known position on A that B can start
replicating from.

We assume that server C is off line, i.e., no statements are executed on C such
that there are entries written to the binary log for this and only the slave
thread executes any events. This means that all entries written to the binary
log of C are sent from A, and are either created at A or at B.

In order to explain the procedure, we will use a *marker event* that should be
distinguishable from other events in the binary log. The marker event can be
optimized away from this procedure, as long as one event can be identified and
distinguished from the other events in the binary log.  The marker event will be
introduced in server B, will propagate to server C, and there used to find a
binlog position that can be used for switching over A to B.

[I actually think it is sufficient to pick an arbitrary event on B and there is
no need for a distinguishable marker event for this procedure to work. It just
needs to be an event that has not propagated to C yet, so that we can stop C at
the right time. /Matz]

1. Stop slave C. The reason for this is that the marker event below shall
   not be executed on the server B. At this step, we also assume that B is
   not too far behind A, i.e., that ``Seconds_behind_master`` is within
   reasonable range.

2. Insert a marker event on B, e.g., create a unique table.
   
   - Note the position and file of this event and call these *BPos*
     and *BFile*.

     This is done by executing ``SHOW BINLOG EVENTS`` and looking in
     the ``Info`` column for an event containing the marker statement.

3. Start slave C and let it run until it reaches the log file
   position (*BFile*,*BPos*). This is done by issuing the commands::

      START SLAVE UNTIL
          MASTER_LOG_FILE=*BFile*, MASTER_LOG_POS=*BPos*;
      SELECT MASTER_POS_WAIT(*BFile*,*BPos*);

   After this step, the slave will have stopped just after executing the
   marker event. Since C is off-line, only the slave thread can execute
   statements that go into the binary log of C, so the last event in
   the binary log of C is the marker event.

4. Find the end of the last binary log on C. This is trivially done by
   executing ``SHOW MASTER STATUS``. Use *CFile* and *CPos* to denote
   the file and position respectively.

   Since both (*BFile*,*BPos*) and (*CFile*,*CPos*) refers to the end
   of the marker event, we have now found a position that we can use
   to redirect server A to replicate from B instead of from C.

   Since there are no events being written to the binary log on C, it is
   now the case that server A either has replicated to this position, and
   is waiting for more input, or have outstanding events to execute before
   reaching the marker event.

5. Stop slave A. If A has *not* reached and executed the marker event, do
   the following steps to execute the marker event and stop after it::

       START SLAVE UNTIL
           MASTER_LOG_FILE=*CFile*, MASTER_LOG_POS=*CPos*;
       SELECT MASTER_POS_WAIT(*CFile*, *CPos*);

6. Redirect A to replicate from B instead by issuing the following command
   on server A::

       CHANGE MASTER TO
           MASTER_HOST=*B*,
           MASTER_LOG_FILE=*BFile*, MASTER_LOG_POS=*BPos*;

   [Need add a host as well and check the exact syntax of the command, but
   is to tired right now. /Matz]

7. Start slave A.

A and B are now replicating each other, and C is disconnected from the master group.


Main upgrade procedure
======================

1. Bring server C off-line so that all clients are disconnected and no new
   clients are allowed to connect. It is essential to allow a user with SUPER
   privileges to connect in order to perform the upgrade.

   This is eased by implementing WL#3836 on 5.1, but might be done manually in
   a more tedious manner by setting the ``@@read_only`` variable and kicking
   out the clients one by one as they complete their queries. Preventing new
   users to connect can be handled by removing the users from the ``mysql.user``
   table (and saving them in another table, in order to make it easy to restore
   them on starting the server again.

   At this point, it is also necessary to switch off the scheduler, since
   it can execute statements causing events to be added to the binary log.

2. At this point, there is nothing that can generate events into the binary
   log of server C, so we allow all changes done at server C to propagate to
   server A.

   Server A now has no outstanding changes from server C.

   In order to simplify the upgrade procedure, a method for waiting for all
   applied events to propagate could be implemented, but this is not critical
   and is not needed in version 5.1.

3. Execute the redirection procedure above to redirect A to replicate from B
   instead of from C. After this step, the server C will not be replicating at
   all and both slave threads will be stopped.

4. Shut down server C and upgrade the system from version 5.1 to 5.2.

5. Start server C in an off-line read-only configuration with no client
   connected and no changes done to the server.

6. Instruct server C to generate a binary log for version 5.1.

   This requires implementation of WL#3837 in version 5.2.

7. Bring server C on-line and turn off read-only.

   If the server was shut down as described in 1, no user will be able to
   connect to the server except the super user, which is what is desired.

   This point requires no additional changes to 5.1.

8. Instruct server C to replicate from server B and let it catch up
   with server B. This can be done by either taking a backup of B
   and install it on C, or by just starting the slave threads on C.

     C> START SLAVE;

9. Change master of server A to server C.

    This requires that the C server is behind the A server, which can be
    accomplished by stopping the C server and let A get ahead a little bit.
    In order to achieve this, we execute the following steps:

    a) On C: ``STOP SLAVE``

    b) Wait until the binlog position on A is later than binlog position
       on C.  This can be accomplished by issuing a ``SHOW SLAVE STATUS``
       on C and noting the binlog file and position (call these *BFile*
       and *BPos* respectively), and then on A execute::

           SELECT MASTER_POS_WAIT(*BFile*, *BPos*);

       Note that this position denotes a position on B for both A and C.

    c) On A: ``STOP SLAVE``

    d) On A: Do a ``SHOW SLAVE STATUS`` and get the file and binlog
       position last executed. Call them *AFile* and *APos* respectively.

    e) On C: execute the statement::

         START SLAVE UNTIL
             MASTER_LOG_FILE=*AFile*, MASTER_LOG_POS=*APos*;
         SELECT MASTER_POS_WAIT(*AFile*, *APos*);

       C has now stopped at the position where A was stopped.

    f) On C: Do a ``SHOW MASTER STATUS`` and take the value of
       ``File`` and ``Position`` and call them *CFile* and *CPos*
       respectively.

       We now have mapped the position where A stopped to a position
       on C

    g) On A: set A to replicate from C starting with this position by
       executing the statement::

         CHANGE MASTER TO
             MASTER_HOST=*C*,
             MASTER_LOG_FILE=*CFile*, MASTER_LOG_POS=*CPos*

    h) Start slave A: ``START SLAVE``

    i) Start slave C: ``START SLAVE``

    We have now changed the master of A to be C and now have a circular 
    replication topology with the upgraded server incorporated and all
    servers are running.

10. Repeat step 1-9 for each server in turn.

11. Instruct each server to generate binary logs for version 5.2.

    This requires implementation of WL#3837 in version 5.2.


All servers of the master group is now upgraded.


REFERENCES
----------

See WL#3614: Multi-master upgrade 5.2
Copyright (c) 2001-2007 by MySQL AB. All rights reserved.

What is the binary log
=======================

Binlog is a trace of changes of the server's global state generated during its
operation. It consists of events describing changes of this state. More
precisely, binlog events describe actions which can be used to reproduce the
same changes of global state which have happened on server.

There are two types of binlogs (choosen by server options): 

- statement based binlog,
- row based binlog.

In the first type, events contain normal SQL queries which reproduce changes of
the data. In the former, apart from SQL query events there are special events
which directly describe changes of table contents in terms of rows which have
been added, deleted or changed.

(TODO: file related events)

TODO: explain "internal events" and how they differ from other ones.


Binlog and replication
=======================

Binlog is used for the replication purposes. The idea is that binlog created on
master server is shipped out to slave server and executed there. Many details of
binlog format and handling are specific to that use of the binlog.


Format of binlog and how it differs in different binlog versions
=================================================================

Currently there are 3 binlog versions supported in the server code:

v1	generated by mysqld v 3.23
v3      generated by mysqld v 4.0.2-4.1
v4      generated by mysqld v 5.0+

An obsolete format v2 was generated by early versions of 4.0 mysqld.

TODO: Start of binary log: binlog magic

In all three versions events have the following general layout:

  +===================+
  |   event header    |
  +===================+
  |                   |
  |    event data     |
  |                   |
  +===================+

Structure of the header
-----------------------
  
First 13 bytes of the header (let's call it fixed part of the header) contain
the following information:

  +==================+
  | timestamp    : 4 |	
  +------------------+
  | type code    : 1 |     // at EVENT_TYPE_OFFSET
  +------------------+     
  | server_id    : 4 |     // at SERVER_ID_OFFSET
  +------------------+     
  | length       : 4 |     // at EVENT_LEN_OFFSET
  +------------------+

Event type codes are defined in log_event.h. The length field contains the
length of whole event packet (including the header). Server_id is the unique
server id set in the server configuration files for the purpose of replication.

Size of the event header is different in different versions of binlog:

v1:  13 bytes
v3:  19 bytes
v4:  flexible size specified by a format description event (see below) but at 
     least 19 bytes.

Event headers in different binlog formats (v1-4) are backward compatibile. That
is, a header in newer format starts with the header of old format. This is why
regardless of the binlog format the first 13 bytes of the header contain the
same information (which is exactly the v1 header).

The v3 header contains 2 additional fields:

  +===============+
  | log_pos   : 4 |
  +---------------+
  | flags     : 2 |
  +===============+

TODO: find the meaning of log_pos field.
TODO: describe used flags.

Currently, v4 header is the same as v3 header. However, binlog format v4
provides mechanism for adding extra fields to the header without breaking the
format (this is implemented via format description event).

Structure of event data
-----------------------

In v4 of the format event data is split into two parts: fixed length part called
a post-header (or event-specific header) and an (optional) variable length part
(let's call it event payload):

  +=================+   
  |   post-header   |   } fixed length (the same for all events of one type)
  +=================+   
  |                 |   }
  |  event payload  |   } length can vary for different events
  |                 |   }
  +=================+


Detailed structure of event packets in different binlog formats
---------------------------------------------------------------

v1:

  +==================+
  | timestamp    : 4 |	}
  +------------------+  }
  | type code    : 1 |  }  
  +------------------+  }  v1 header (13 bytes)  
  | server_id    : 4 |  }  
  +------------------+  }   
  | length       : 4 |  }   
  +==================+
  |                  |  }
  |   event data     |  }  length - 13 bytes
  |                  |  }
  +==================+
  
v3:

  +==================+
  | timestamp    : 4 |	}
  +------------------+  }
  | type code    : 1 |  }  
  +------------------+  }     
  | server_id    : 4 |  }  
  +------------------+  }   
  | length       : 4 |  }  v3 header (19 bytes)
  +------------------+  }
  | log_pos      : 4 |  }
  +------------------+  }
  | flags        : 2 |  }
  +==================+
  |                  |  }
  |   event data     |  }  length - 19 bytes
  |                  |  }
  +==================+

v4:

  +=========================+
  | timestamp        : 4    |  }
  +-------------------------+  }
  | type code        : 1    |  }  
  +-------------------------+  }     
  | server_id        : 4    |  }  
  +-------------------------+  }   
  | length           : 4    |  }  v4 header (x bytes)
  +-------------------------+  }  x is given by format description event (FDE).
  | log_pos          : 4    |  }
  +-------------------------+  }
  | flags            : 2    |  }
  +-------------------------+  }
  | extra hdr fields : x-19 |  }   
  +=========================+
  |    post-header   : y    |  // y, specific to event type, given by FDE.
  +=========================+
  |                         |  }
  |      event payload      |  }  length - (x+y) bytes
  |                         |  }
  +=========================+


Format description event
------------------------

Each binlog starts with a format description event. (In v1 and v3 this is called
Start event). The layout of format description event is as follows. Note that
the v4 format description event starts with v3 event.

v1 & v3:
 
 +==========================+
 |      event header        |  // 13 (v1) or 19 (v3) bytes
 +==========================+
 | binlog_version : 2       |
 +--------------------------+
 | server_version : CHAR(x) |  // x = ST_SERVER_VER_LEN (=50)
 +--------------------------+
 | created?       : 4       |
 +==========================+

TODO: semantics of created field.

v4:

 +==========================+
 |      event header        |  // 19 bytes
 +==========================+
 | binlog_version : 2       |
 +--------------------------+
 | server_version : CHAR(x) |  // x = ST_SERVER_VER_LEN
 +--------------------------+
 | created?       : 4       |
 +--------------------------+
 | header length  : 1       |
 +==========================+
 |                          | }
 | post-header lengths for  | } n bytes where n is the number of events
 |     all event types      | } used in this binlog format.
 |                          | }
 +==========================+


Binlog rotation event
---------------------

Binlog events are stored in several files to limit size of a single file. In the
code these files are reffered to as "binlogs", thus the view is that we have
several binlogs each containing a range of binlog events. The binlogs are
identified by names (corresponding to names of the files in which they are
stored). At the end of each binlog a ROTATE event is written which points to the
next binlog in a squence. The layout of this event is as follows:

 +====================+
 |   common header    |
 +====================+
 |    position    : 8 |
 +====================+
 |                    |
 |    next log name   |
 |                    |
 +====================+

 'position' is the position of the first event in the new binlog. Note that 
 this should be a format description event.

TODO: How it looks in different binlog versions?


Recognizing binlog format version
==================================

To correctly handle future format updates, the following should be true for all
future formats:

a) binlog starts with format description event,
b) this event should start with v3 header (19 bytes),
c) the 2 bytes at positions 20 and 21 should contain format version number.

Note that requirement b) doesn't hold for v1 of the format. Thus v1 must be
handled separately. One can recognize binlog format v1 by:

1) reading type code from 1 byte at postion EVENT_TYPE_OFFSET (=4) and checking 
   that it is START_EVENT_V3 (=1)
2) reading length of the event packet from 4 bytes at position EVENT_LEN_OFFSET 
   (=9) and checking that it is smaller than (19+6+ST_SERVER_VER_LEN = 78)

This is exactly what is done inside mysqlbinlog to recognize binlog format v1.

For all other binlog formats, assuming that requirements a-c) are preserved, a
procedure to recognize binlog format is as follows: read 2 bytes at position 20
of first event packet -- this gives binlog version. 

Further sanity checks can be added. For example if format version read is v3 or
v4 then the type code (1 byte at EVENT_TYPE_OFFSET) should be START_EVENT_V3
(=1) or FORMAT_DESCRIPTION_EVENT (=15), respectively.

Note: Format description event in v4 is designed so that it can handle future
format updates. A new format with the same layout of event packets as in v4 but
with additional fields in header and post-headers can use this format
description event to correctly describe itself. Actually, it is (theoretically)
possible to have different "flavours" of binlog format v4 which have different
(bigger) header lengths and even different number of events. 

Current code is written to take care of this possibility. I.e. any code parsing
binlog and discovering that it is v4, uses header lengths as given by the format
description event (thus potentially different from the values hard-wired in the
server code).

Note: Although headers of event packets in v4 of binlog can be longer than 19
bytes, the format description event is an exception. Its header is always 19
bytes long to meet the above backward compatibility requirements.


Where binlog events are read/parsed in the code
================================================

1. Replication master's binlog dump thread (this is the thread handling 
   COM_BINLOG_DUMP command sent by a slave).

  It reads events from binlog files and sends them to replication slave 
  over the SQL connection.

  Code: mysql_binlog_send() function in sql/sql_repl.cc

2. Replication slave's I/O thread:

  Reads events sent by master's dump thread and stores them in relay logs 
  (local to slave's host).

  Code: handle_slave_io() function in sql/slave.cc

3. Replication slave's SQL thread:

  Reads events from relay logs and executes them.

  Code: handle_slave_sql() function in sql/slave.cc

4. Mysqlbinlog program:

  Reads events from binlog files or sent by a mysqld server and presents them in
  a form of SQL queries.

  Code: dump_{local,remote}_log_entries() functions in client/mysqlbinlog.cc


Notes:

1a. If the current binlog file ends, mysql_binlog_send() switches to the next 
    file.

1b. Mysql_binlog_send() sends a "fake" Rotate event before sending any other 
    events from a binlog and after each single binlog file. 

1c. In case an event offset was specified when requesting binlog dump, 
    mysql_binlog_send() finds the format description event at the beginning of 
    the binlog file and sends it first, before proceeding to the events at the 
    given offset.


How different binlog formats are handled in the slave
======================================================

I/O thread reads version of the server running on master from the
mysql->server_version value (where mysql is an object representing established
connection to the master). It aborts replication if master's version is x.y.z
with x <= 2.

I/O thread reads events using cli_safe_read() function which simply reads
protocol packets (?). Thus, this is binlog-version transparent.

Before writing events to the relay log, I/O thread can process them:

- if it consider current binlog to be in format v1-3 it translates events to v4 
  format,

- otherwise it copies event packets as they are, except that it recognizes and
  parses FD and Rotate events. It also recognizes (and don't put in relay log) a 
  Stop event.

I/O thread decides about binlog version based on the FD events it parses. Before
first FD event arrives, binlog version is based on the master's server version.
For masters with server x.y.z with x > 4 the binlog version is assumed to be v4
(even though in fact it can be higher).

Parsing FD and Rotate events is done by constructors of corresponding objects.
These constructors don't do any checks on binlog version. Note: the
Rotate_log_event constructor reads header and post-header lengths from the
current event description event (thus possibly binlog format sensitive).

Translation of evnets from old formats to the current format is done by parsing
event data using Log_event::read_log_event() (i.e., in the constructors of event
objects) and then writting the event objects to the relay log. 

Note: Looks like the event object constructors don't trust the binlog version
stored in the FD event passed to them.