WL#2540: Replication event checksums

Affects: Server-5.6   —   Status: Complete

SITUATION
=========

Events are written to the binary log as they are executed. They are
then sent to the slave on an event-by-event basis either through a
socket or over a network.


RATIONALE
=========

This is some fundamental validity checking that would check that
replication works correctly.

It would make it much more clear to our customers when the replication
failure is due to network/disk/memory failures and when the failure is
due to bugs in the servers.  See amazingly long list below for
potentially previously affected customers.

WISHFUL REQUIREMENT
===================

After this is implemented, it should (preferingly) not be possible 
to ever crash the slave due to corrupted events.  To make this happen,
one could go through all values in all events and check that they can't
be illegal.  This could possibly be a second patch.

PROBLEM
=======

Some customers get very strange replication failures and 
it is impossible to know what causes them.  Sinisa says
that they could be "network problems" (e.g. CSC#4792).

The failures causes the slave to corrupt the data rather than stop,
which would be the appropriate action.

Some reported incidents (please update these once the patch is pushed):

- BUG#25737
- BUG#26123
- BUG#27048
- BUG#23619
- BUG#29309
- http://forums.mysql.com/read.php?26,148423,148423
- BUG#22889
- BUG#5116
- BUG#38718

The WL is expecting fixes from BUG#49741 to make tests to pass
with the new options supplied to the servers.

SUMMARY
=======

Add checksum to binary log events.

SINISA WRITES (2005-04-13, CSC#4792):
This points to the missing reliability features, that SHOULD be
implemented in 5.0.

* checksum stored in binary and relay log to check for RAM / disk
  corruption. 

* checksum sent to slave for each event to check for network
  corruption. 


TYPE OF CHECKSUM
================

Alternatives:

  1. 8-12 bit checksum
     PRO: Can use the flags part of common event header 
          and thus not need to change binary log format
     CON: has higher than CRC32 probability of undetectable error
 
  2. CRC32
     PRO: Standard, implemented with multiple alg:s.
          Generally computational inexpensive
 
  3. SHA or MD5
     PRO: Can also be used for authentication of events

     CON: Are very long (at least 128 bits)
          Computationally expensive.

Mats and Lars are currently considering 32 bit checksum 
(described in ISO 3309).


REPAIRING EVENTS
================

- Should the checksum be capable of also repairing the event?
  PRO: That would be really nice, especially if the mysqlbinlog 
       client can be extended to do repairs of binary logs
  CON: Then we can't use the bits for authentication

Mats and Lars thinks that we can have the type of the checksum in the
format_description log event (or implicitly as the binary log version).
This so that we, in the future, can change to an error correcting 
checksum.


SUGGESTED SOLUTION (By Mats)
============================

To ensure the integrity of each event arriving to the slave, a checksum should
be added to each event.  This allows the slave to check that the event was
transmitted correctly and written correctly to the relay log. If it was not, the
slave can stop indicating an error rather than trying to process the event.

This is particularly important when using row-based replication, since subtle
transmission errors can be applied without any form of error.

In my opinion, we should focus on the integrity of the events, and ignore issues
that relates to authenticity, since methods for handling that are
computationally expensive, and can be achieved through other means (e.g., using
SSL). Note that the suggestion below (using SSL) does not solve the issue of
corruption in the binary log or in the relay log: it just handled the corruption
during transfer of the event.


INTERESTING WORK-AROUND (ANDREI'S TEXT)
=======================================

From reading about SSL's features [1] and simulating corrupted packets
on slave via changing data just after libc's recv function returns the
buffer we can conclude that this idea (using SSL to create and verify
a checksum) should work.

Of course I can not check all the possible situations, for that we
would need to study algorithms that are used.  Almost obviously the
alg of encryption in ssl take care of a checksum.

Reference: 

    [1] 5.1 Manual, 5.8.7.1. Basic SSL Concepts

    SSL is a protocol that uses different encryption algorithms to
    ensure that data received over a public network can be trusted. It
    has mechanisms to detect any data change, loss, or replay

Since SSL is not used for every connection, we still need to consider per-event
checksums.


OPEN ISSUES
===========

1. How about only having checksums at commit time?  How would that work?
   What about non-trx tables, these would probably need a 
   checksum in that event.  

   Probably the solution is that *whenever* something is committed, 
   the checksum is needed, be it a statement (due to autocommit), 
   a non-transactional table update, or a real transaction.
Requirements for the design include:

R0. Checksumming shall be optional as for computing so for verifying.
R1. Checksumming aims at detecting corrupted events *before* the event
    starts to be applied.
R2. The Old Master shall work with the New checksum-aware slave seemlessly.
R3. The checksum-aware New Master shall by all possible work with the Old Slave.
R4. Checksumming framework should not be constrained with just one CRC32
    implementation.


CheckSum is attached as the last item to follow the event data.
Typical structure of the event with the checksum (CS) is as
the following

       +-----------+------------+------------+--------------+
       |           |            |            |              |
       | Common Hdr| SubHeader  | Payload    | Checksum (V) |
       +-----------+------------+------------+--------------+

Additionally FormatDescriptor (FD) event contains the checksum algorithm
descriptor (A) which is placed in between of Payload and (V) the
checksum value.
All events following FD are checksummed according to FD's (A).

0         no checksum
1         CRC32 algorithm
[2 - 127] extensible range for new algorithm
[128-254] reserved range
255       Event that contains (A) == 255 is generated by
          checksum-unaware server

The binlog file can have only one type of events: either with checksum
(and then with the same alg) or without (the alg is none therefore).
While other than FD events rely on the checksum info of FD, FD becomes
vulnerable. In particular, we need to distinguish between
a checksummed-and-corrupted FD and not checksummed. 
To resolve this issue the duplication technique can be applied. However,
simpler is to *always* checksum FD regardless of the current policy of R0.
FD needs two descriptors: one for the binlog and the other for itself.
The simplest is to add a new flag to CH part that would determine
whether the FD following event are with checksum or not:

       +-----------+------------+------------+------+------------+
       | Common Hdr|            |            |      |            |
       |   (F)     | SubHeader  | Payload    |  (A) | Checksum(V)|
       +-----------+------------+------------+------+------------+

If (F) stays down, (A) describes the alg to use in order to verify (V)
solely for the FD event.
If (F) is up, (A) applies to FD and all following events which are
treated as having the trailing (V) component^\footnote{
this part of the reliable FD is worked out from the current WL}.


NOTES FROM DISCUSSION (Lars, Luís) - 26/02/2009,
edited by Andrei to extend with further details
===============================================

     S1. mysqlbinlog
        we need to upgrade the tools to work with the new checksum
        field (these should now handle the checksum as well)

     S2. Server must work with multiple slave versions
       1. OM -> OS      (ok)
       2. OM -> NS      (ok)
       3. NM -> OS      ok, if NM does not compute checksum
       4. NM -> NS      (ok)
       OM - Old Master, OS - Old Slave, 
       NM - New Master, NS - New Slave

       NOTES:
         a) we need to check if the slave registers some information
            about its version when connecting to the master

     S3. Make implementation modular 
       - make a function that can be reused in several places
       
     S4. Configurable options

       The following server scope configuration parameters are added:

       1. --binlog-checksum = NONE | CRC32^[1] the default is set to NONE
       for backward compatibility.
       That is there is only CRC32 checksum algorithm implemented.
       When CRC32 master server is configured to log events with checksum.
       Change from NONE to CRC32 and backward forces rotating the binlog
       so that the new binlog file will be headed with FD having having (A)
       equal to one being set. That provides the checksum-homogeneousness of
       the binlog.
       Notice, that IO thread of the slave server and its user sessions 
       create local events to write them into relay log.
       Computation of checksum in this case 
       will be controlled by the relay-log checksumming status that indicates 
       what was the checksum alg of the last FD recorded into RL.

       2. --master-verify-checksum=0,1; the default is zero.
       When 1 master examines the checksum of an event at reading
       from binlog by either user (via show binlog events) 
       and the dump threads, as well as at recovery and binlog
       initialization.
       The value 0 can be acceptable without much compromising reliability
       of the system  considering that the slave IO thread performs 
       checksum examination mandatory anyways upon receiving a
       checksummed event from the network.

       3. --slave-sql-verify-checksum=0,1; the default is 1.
       When 1 the slave SQL threads checks the control sum at reading 
       an event from the relay log file. Also the value 1 forces
       slave server to start relay-log with its local FD checksummed (A) != 0.
       If the value is set to zero, checksum is not verified by SQL thread,
       and the local FD is created with (A) == 0.
       Notice, Relay Log can have heterogeneous content with events
       checksummed different ways. Still, FD events carry (A)_b value
       matching the following events checksum status.

       The parameters initialize the namesake global server variables:
 
       @@global.binlog_checksum can be altered in run-time to force
       rotation of the current binlog file as a side effect.
       The binlog is supposed to be homogeneous that is either to hold all 
       the events with checksum or without it;
       @@global.master_verify_checksum can be turned ON|OFF;
       @@global.slave_sql_verify_checksum can be turned ON|OFF.
 
       The R3. requirement is implemented as the following.
       NM rejects OS connecting attempt if OS requests binlog position
       in a checksum-ed binlog file.
       NM disconnects from OS if it is about to send the first checksum-ed
       event. Notice, NM -> OS support is generally limited as documented.
       Still, while NM is configured w/o checksumming NM -> OS works.
       Notice, that the New checksum-aware mysqlbinlog might be engaged
       to transport the NM's binlog file to the Old Slave.


[1] CRC32 of zlib.
    My searches for a nomenclature of a standard revealed

http://www.gzip.org/zlib/rfc-gzip.html

    CRC32 (CRC-32)
    
    This contains a Cyclic Redundancy Check value of the uncompressed
    data computed according to CRC-32 algorithm used in the ISO 3309
    standard and in section 8.1.1.6.2 of ITU-T recommendation V.42.

Comparison specifications and polynomials from 
http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-V.42-200203-I!!PDF-E&type=items
and
http://en.wikipedia.org/wiki/Cyclic_redundancy_check

points at

CRC-32-IEEE 802.3 

x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1
(V.42, MPEG-2, PNG[21], POSIX cksum)

Because of all but FD events are not changed in their header part, FD is
the only way to find out if an event contains checksum.
Low-level design shall also aim at making sure the checksum-holder FD event 
can be located for all the following cases:

C1. SHOW BINLOG EVENTS
C2. the dump thread when reads out from binlog to send to the slave
C3. IO thread when it receives an event from network to queue it into the Relay Log
C4. SQL thread when it reads out from the Relay log 
C5. Artificial Rotate event that is sent to the slave *before* any FD.

Atop of that, the RL has specifics not not be checksum-homogeneous because of
the slave server creates the very first RL file without coordination with 
the master server and the slave server writes the relay-log local events
itself. 
Given these facts the following task shapes off:

C6. Recording into RL is done respecting its checksum status. 
    The status changes when a new recorded FD brings in new checksum value.
    The RL checksum status must survive any admin SQL commands that preserve
    the relay-log index which includes STOP SLAVE, FLUSH LOGS but not
    CHANGE MASTER or RESET SLAVE.

Finally, OM -> NS forces

C7. Identifying version of FD event and compare one against (H) a checksum-home
    constant. If FD.version >= (H) the FD instance must have (A),(V) components.

C1-C2 are addressed naturally by watching over FD:s in the stream and extracting
their (A) descriptor. Events after the read-out FD are treated to be checksummed
according to (A).
C3 is done via maintaining the RL checksum status to update one
with a new value of (A) of FD_m the FormatDescriptor event from the master.
C4 is addressed alike C1-C2. Even though RL contains FD from master and slave,
the last met FD is treated to be the checksum-holder of the following stream.

    There are three patterns of RL events sequence:

    1. FD_s (A_s), R_s (or S_s)
    2. FD_s (A_s), R_m
    3. FD_s (A_s), FD_m (A_m), E_1, ... E_l, R_s (or S_s)

    here FD_s, FD_m - format descriptor events from Slave and Master, R -
    rotate, E - event from master, S_s stop event by the slave.
    Our task is to make sure that in 1-2 R_s,m have checksum according to
    FD_s's A_s, and in 3. R_s to have checksum according to the last
    seen FD_m.
    The 2nd pattern requires to compute checksum on the slave
    if NM does not send R_m checksummed or the event is OM's R_m.
    In capturing 3 we rely on the RL checksum status stability which
    is not affected by STOP SLAVE.
    Notice, that the last event in the 3rd sequence is recorded at
    restart of the slave service.
    
C5 is addressed with querying the master as connecting to the dump thread time.
The master's @@bilog_checksum value at that moment becomes known to both parties
so that IO thread knows what to do with the first fake Rotate event.

RL checksum status of C6 is implemented as a member of 
Relay_log_Info.relay_log entry. It is initialized as 255 and the first Slave's
local FD  changes the value. In the following course, the value is altered
by receiving FD_m from a master.


Further fine details of implementation.

1. @@global.binlog_checksum

   requires a new sys_var class that

   - engages mysql_bin_log's LOCK_lock mutex as the guard to modify
     opt_binlog_checksum actual value.
   - update method needs making sure the current binlog is rotated
     with the terminal event in it having the value of the checksum
     equal to the old value, and
     the new file's FD event receives the checksum alg description byte
     and the checksum value is set according to the new being set value/
2. @@global.master_verify_checksum
   @@global.slave_sql_verify_checksum

   require a new class with a standard LOCK_global_system_variables
   as a guard. Unlike p.1 the update method does not own any specific.

3. Computation of the checksum is started at time of Log_event::write() for
   events written directly to the binlog, and at  MYSQL_BIN_LOG::write_cache()
   for events written via the transaction cache. 
   By the existing framework 
   MYSQL_BIN_LOG::LOCK_log is acquired before and released after computing and
   writing to the binlog file is done. While opt_binlog_checksum  can be read
   multiple times during this process its value stays the same thanks to p.1
   mutex.
   Log_event::write() can receive the event instance marked 
   for checksumming by the caller.
   FD event is treated specially. Even though @@binlog_checksum is NONE a FD_m 
   instance carry (A). In that case (A) describes algorithm of checksumming
   the FD solely.
   The position of (A) and (V) at the end of FD makes OS 
   construct FD instance with extra bytes at the end of post_header_len[]
   array. The extra bytes remains unreferrable but harmless garbage in the OS's
   FD post_header_len. Since none of event-constructors-with-FD
   argument can access them, because post_header_len[] is indexed by event type
   codes and the extra bytes' index value >= the max value of event type codes
   of OS,

      post_header_len[]= {UNKNOWN_EVENT, START_EVENT_V3, ... ,
                          HEARTBEAT_LOG_EVENT /* the current max code */,
                          (A), (V)}

   that proves transparent NM -> OS support provided that NM does not
   checksum other than FD (@@global.binlog_checksum == NONE).

   Slave can mark local relay-log events via Log_event::checksum_alg prior
   calling Log_event::write() which  triggers computation as well.

4. checksum handshake is initiated by slave to send over a user 
   variable setting to master:
      set @master_binlog_checksum= @@global.binlog_checksum
   The Unknown system var error indicates the fact of the Old slave.
   Otherwise the setting make the dump thread to know what alg to use
   in checksumming the fake Rotate event as well as the slave is aware
   of the alg.
   The slave reads out back @master_binlog_checksum value to know
   what is going to be (A) for the first event from master which
   is so called fake Rotate (see C4.2 above).
   mysql_binlog_send() verifies the client's (slave or mysqlbinlog) 
   capability each time the dump thread
   reads out a binlog event which FD-parent marked with (A) == CRC32.

5. Testing shall demonstrate the user interface, including the default values, 
   the domain value, OLD<->NEW compatibility, effectiveness, insignificant
performance
   penalty if any.

Notice, not all tests can pass currently with the new option --binlog-checksum
due to BUG#49741.

Observation: write_cache() forced laborious hacking because earlier it had
been received changes to make the absolute offsets from the relative.
Im my, Andrei's, opinion, that should not have happened and instead handler
of show-binlog-events that needed that relative-to-absolute conversion
would rather conduct the conversion itself when it's called (instead of to do
that per-event) and the binlog to contain events of transaction with their
relative offsets.