WL#2540: Replication event checksums
Affects: Server-5.6
—
Status: Complete
SITUATION ========= Events are written to the binary log as they are executed. They are then sent to the slave on an event-by-event basis either through a socket or over a network. RATIONALE ========= This is some fundamental validity checking that would check that replication works correctly. It would make it much more clear to our customers when the replication failure is due to network/disk/memory failures and when the failure is due to bugs in the servers. See amazingly long list below for potentially previously affected customers. WISHFUL REQUIREMENT =================== After this is implemented, it should (preferingly) not be possible to ever crash the slave due to corrupted events. To make this happen, one could go through all values in all events and check that they can't be illegal. This could possibly be a second patch. PROBLEM ======= Some customers get very strange replication failures and it is impossible to know what causes them. Sinisa says that they could be "network problems" (e.g. CSC#4792). The failures causes the slave to corrupt the data rather than stop, which would be the appropriate action. Some reported incidents (please update these once the patch is pushed): - BUG#25737 - BUG#26123 - BUG#27048 - BUG#23619 - BUG#29309 - http://forums.mysql.com/read.php?26,148423,148423 - BUG#22889 - BUG#5116 - BUG#38718 The WL is expecting fixes from BUG#49741 to make tests to pass with the new options supplied to the servers. SUMMARY ======= Add checksum to binary log events. SINISA WRITES (2005-04-13, CSC#4792): This points to the missing reliability features, that SHOULD be implemented in 5.0. * checksum stored in binary and relay log to check for RAM / disk corruption. * checksum sent to slave for each event to check for network corruption. TYPE OF CHECKSUM ================ Alternatives: 1. 8-12 bit checksum PRO: Can use the flags part of common event header and thus not need to change binary log format CON: has higher than CRC32 probability of undetectable error 2. CRC32 PRO: Standard, implemented with multiple alg:s. Generally computational inexpensive 3. SHA or MD5 PRO: Can also be used for authentication of events CON: Are very long (at least 128 bits) Computationally expensive. Mats and Lars are currently considering 32 bit checksum (described in ISO 3309). REPAIRING EVENTS ================ - Should the checksum be capable of also repairing the event? PRO: That would be really nice, especially if the mysqlbinlog client can be extended to do repairs of binary logs CON: Then we can't use the bits for authentication Mats and Lars thinks that we can have the type of the checksum in the format_description log event (or implicitly as the binary log version). This so that we, in the future, can change to an error correcting checksum. SUGGESTED SOLUTION (By Mats) ============================ To ensure the integrity of each event arriving to the slave, a checksum should be added to each event. This allows the slave to check that the event was transmitted correctly and written correctly to the relay log. If it was not, the slave can stop indicating an error rather than trying to process the event. This is particularly important when using row-based replication, since subtle transmission errors can be applied without any form of error. In my opinion, we should focus on the integrity of the events, and ignore issues that relates to authenticity, since methods for handling that are computationally expensive, and can be achieved through other means (e.g., using SSL). Note that the suggestion below (using SSL) does not solve the issue of corruption in the binary log or in the relay log: it just handled the corruption during transfer of the event. INTERESTING WORK-AROUND (ANDREI'S TEXT) ======================================= From reading about SSL's features [1] and simulating corrupted packets on slave via changing data just after libc's recv function returns the buffer we can conclude that this idea (using SSL to create and verify a checksum) should work. Of course I can not check all the possible situations, for that we would need to study algorithms that are used. Almost obviously the alg of encryption in ssl take care of a checksum. Reference: [1] 5.1 Manual, 5.8.7.1. Basic SSL Concepts SSL is a protocol that uses different encryption algorithms to ensure that data received over a public network can be trusted. It has mechanisms to detect any data change, loss, or replay Since SSL is not used for every connection, we still need to consider per-event checksums. OPEN ISSUES =========== 1. How about only having checksums at commit time? How would that work? What about non-trx tables, these would probably need a checksum in that event. Probably the solution is that *whenever* something is committed, the checksum is needed, be it a statement (due to autocommit), a non-transactional table update, or a real transaction.
Requirements for the design include:
R0. Checksumming shall be optional as for computing so for verifying.
R1. Checksumming aims at detecting corrupted events *before* the event
starts to be applied.
R2. The Old Master shall work with the New checksum-aware slave seemlessly.
R3. The checksum-aware New Master shall by all possible work with the Old Slave.
R4. Checksumming framework should not be constrained with just one CRC32
implementation.
CheckSum is attached as the last item to follow the event data.
Typical structure of the event with the checksum (CS) is as
the following
+-----------+------------+------------+--------------+
| | | | |
| Common Hdr| SubHeader | Payload | Checksum (V) |
+-----------+------------+------------+--------------+
Additionally FormatDescriptor (FD) event contains the checksum algorithm
descriptor (A) which is placed in between of Payload and (V) the
checksum value.
All events following FD are checksummed according to FD's (A).
0 no checksum
1 CRC32 algorithm
[2 - 127] extensible range for new algorithm
[128-254] reserved range
255 Event that contains (A) == 255 is generated by
checksum-unaware server
The binlog file can have only one type of events: either with checksum
(and then with the same alg) or without (the alg is none therefore).
While other than FD events rely on the checksum info of FD, FD becomes
vulnerable. In particular, we need to distinguish between
a checksummed-and-corrupted FD and not checksummed.
To resolve this issue the duplication technique can be applied. However,
simpler is to *always* checksum FD regardless of the current policy of R0.
FD needs two descriptors: one for the binlog and the other for itself.
The simplest is to add a new flag to CH part that would determine
whether the FD following event are with checksum or not:
+-----------+------------+------------+------+------------+
| Common Hdr| | | | |
| (F) | SubHeader | Payload | (A) | Checksum(V)|
+-----------+------------+------------+------+------------+
If (F) stays down, (A) describes the alg to use in order to verify (V)
solely for the FD event.
If (F) is up, (A) applies to FD and all following events which are
treated as having the trailing (V) component^\footnote{
this part of the reliable FD is worked out from the current WL}.
NOTES FROM DISCUSSION (Lars, Luís) - 26/02/2009,
edited by Andrei to extend with further details
===============================================
S1. mysqlbinlog
we need to upgrade the tools to work with the new checksum
field (these should now handle the checksum as well)
S2. Server must work with multiple slave versions
1. OM -> OS (ok)
2. OM -> NS (ok)
3. NM -> OS ok, if NM does not compute checksum
4. NM -> NS (ok)
OM - Old Master, OS - Old Slave,
NM - New Master, NS - New Slave
NOTES:
a) we need to check if the slave registers some information
about its version when connecting to the master
S3. Make implementation modular
- make a function that can be reused in several places
S4. Configurable options
The following server scope configuration parameters are added:
1. --binlog-checksum = NONE | CRC32^[1] the default is set to NONE
for backward compatibility.
That is there is only CRC32 checksum algorithm implemented.
When CRC32 master server is configured to log events with checksum.
Change from NONE to CRC32 and backward forces rotating the binlog
so that the new binlog file will be headed with FD having having (A)
equal to one being set. That provides the checksum-homogeneousness of
the binlog.
Notice, that IO thread of the slave server and its user sessions
create local events to write them into relay log.
Computation of checksum in this case
will be controlled by the relay-log checksumming status that indicates
what was the checksum alg of the last FD recorded into RL.
2. --master-verify-checksum=0,1; the default is zero.
When 1 master examines the checksum of an event at reading
from binlog by either user (via show binlog events)
and the dump threads, as well as at recovery and binlog
initialization.
The value 0 can be acceptable without much compromising reliability
of the system considering that the slave IO thread performs
checksum examination mandatory anyways upon receiving a
checksummed event from the network.
3. --slave-sql-verify-checksum=0,1; the default is 1.
When 1 the slave SQL threads checks the control sum at reading
an event from the relay log file. Also the value 1 forces
slave server to start relay-log with its local FD checksummed (A) != 0.
If the value is set to zero, checksum is not verified by SQL thread,
and the local FD is created with (A) == 0.
Notice, Relay Log can have heterogeneous content with events
checksummed different ways. Still, FD events carry (A)_b value
matching the following events checksum status.
The parameters initialize the namesake global server variables:
@@global.binlog_checksum can be altered in run-time to force
rotation of the current binlog file as a side effect.
The binlog is supposed to be homogeneous that is either to hold all
the events with checksum or without it;
@@global.master_verify_checksum can be turned ON|OFF;
@@global.slave_sql_verify_checksum can be turned ON|OFF.
The R3. requirement is implemented as the following.
NM rejects OS connecting attempt if OS requests binlog position
in a checksum-ed binlog file.
NM disconnects from OS if it is about to send the first checksum-ed
event. Notice, NM -> OS support is generally limited as documented.
Still, while NM is configured w/o checksumming NM -> OS works.
Notice, that the New checksum-aware mysqlbinlog might be engaged
to transport the NM's binlog file to the Old Slave.
[1] CRC32 of zlib.
My searches for a nomenclature of a standard revealed
http://www.gzip.org/zlib/rfc-gzip.html
CRC32 (CRC-32)
This contains a Cyclic Redundancy Check value of the uncompressed
data computed according to CRC-32 algorithm used in the ISO 3309
standard and in section 8.1.1.6.2 of ITU-T recommendation V.42.
Comparison specifications and polynomials from
http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-V.42-200203-I!!PDF-E&type=items
and
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
points at
CRC-32-IEEE 802.3
x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1
(V.42, MPEG-2, PNG[21], POSIX cksum)
Because of all but FD events are not changed in their header part, FD is
the only way to find out if an event contains checksum.
Low-level design shall also aim at making sure the checksum-holder FD event
can be located for all the following cases:
C1. SHOW BINLOG EVENTS
C2. the dump thread when reads out from binlog to send to the slave
C3. IO thread when it receives an event from network to queue it into the Relay Log
C4. SQL thread when it reads out from the Relay log
C5. Artificial Rotate event that is sent to the slave *before* any FD.
Atop of that, the RL has specifics not not be checksum-homogeneous because of
the slave server creates the very first RL file without coordination with
the master server and the slave server writes the relay-log local events
itself.
Given these facts the following task shapes off:
C6. Recording into RL is done respecting its checksum status.
The status changes when a new recorded FD brings in new checksum value.
The RL checksum status must survive any admin SQL commands that preserve
the relay-log index which includes STOP SLAVE, FLUSH LOGS but not
CHANGE MASTER or RESET SLAVE.
Finally, OM -> NS forces
C7. Identifying version of FD event and compare one against (H) a checksum-home
constant. If FD.version >= (H) the FD instance must have (A),(V) components.
C1-C2 are addressed naturally by watching over FD:s in the stream and extracting
their (A) descriptor. Events after the read-out FD are treated to be checksummed
according to (A).
C3 is done via maintaining the RL checksum status to update one
with a new value of (A) of FD_m the FormatDescriptor event from the master.
C4 is addressed alike C1-C2. Even though RL contains FD from master and slave,
the last met FD is treated to be the checksum-holder of the following stream.
There are three patterns of RL events sequence:
1. FD_s (A_s), R_s (or S_s)
2. FD_s (A_s), R_m
3. FD_s (A_s), FD_m (A_m), E_1, ... E_l, R_s (or S_s)
here FD_s, FD_m - format descriptor events from Slave and Master, R -
rotate, E - event from master, S_s stop event by the slave.
Our task is to make sure that in 1-2 R_s,m have checksum according to
FD_s's A_s, and in 3. R_s to have checksum according to the last
seen FD_m.
The 2nd pattern requires to compute checksum on the slave
if NM does not send R_m checksummed or the event is OM's R_m.
In capturing 3 we rely on the RL checksum status stability which
is not affected by STOP SLAVE.
Notice, that the last event in the 3rd sequence is recorded at
restart of the slave service.
C5 is addressed with querying the master as connecting to the dump thread time.
The master's @@bilog_checksum value at that moment becomes known to both parties
so that IO thread knows what to do with the first fake Rotate event.
RL checksum status of C6 is implemented as a member of
Relay_log_Info.relay_log entry. It is initialized as 255 and the first Slave's
local FD changes the value. In the following course, the value is altered
by receiving FD_m from a master.
Further fine details of implementation.
1. @@global.binlog_checksum
requires a new sys_var class that
- engages mysql_bin_log's LOCK_lock mutex as the guard to modify
opt_binlog_checksum actual value.
- update method needs making sure the current binlog is rotated
with the terminal event in it having the value of the checksum
equal to the old value, and
the new file's FD event receives the checksum alg description byte
and the checksum value is set according to the new being set value/
2. @@global.master_verify_checksum
@@global.slave_sql_verify_checksum
require a new class with a standard LOCK_global_system_variables
as a guard. Unlike p.1 the update method does not own any specific.
3. Computation of the checksum is started at time of Log_event::write() for
events written directly to the binlog, and at MYSQL_BIN_LOG::write_cache()
for events written via the transaction cache.
By the existing framework
MYSQL_BIN_LOG::LOCK_log is acquired before and released after computing and
writing to the binlog file is done. While opt_binlog_checksum can be read
multiple times during this process its value stays the same thanks to p.1
mutex.
Log_event::write() can receive the event instance marked
for checksumming by the caller.
FD event is treated specially. Even though @@binlog_checksum is NONE a FD_m
instance carry (A). In that case (A) describes algorithm of checksumming
the FD solely.
The position of (A) and (V) at the end of FD makes OS
construct FD instance with extra bytes at the end of post_header_len[]
array. The extra bytes remains unreferrable but harmless garbage in the OS's
FD post_header_len. Since none of event-constructors-with-FD
argument can access them, because post_header_len[] is indexed by event type
codes and the extra bytes' index value >= the max value of event type codes
of OS,
post_header_len[]= {UNKNOWN_EVENT, START_EVENT_V3, ... ,
HEARTBEAT_LOG_EVENT /* the current max code */,
(A), (V)}
that proves transparent NM -> OS support provided that NM does not
checksum other than FD (@@global.binlog_checksum == NONE).
Slave can mark local relay-log events via Log_event::checksum_alg prior
calling Log_event::write() which triggers computation as well.
4. checksum handshake is initiated by slave to send over a user
variable setting to master:
set @master_binlog_checksum= @@global.binlog_checksum
The Unknown system var error indicates the fact of the Old slave.
Otherwise the setting make the dump thread to know what alg to use
in checksumming the fake Rotate event as well as the slave is aware
of the alg.
The slave reads out back @master_binlog_checksum value to know
what is going to be (A) for the first event from master which
is so called fake Rotate (see C4.2 above).
mysql_binlog_send() verifies the client's (slave or mysqlbinlog)
capability each time the dump thread
reads out a binlog event which FD-parent marked with (A) == CRC32.
5. Testing shall demonstrate the user interface, including the default values,
the domain value, OLD<->NEW compatibility, effectiveness, insignificant
performance
penalty if any.
Notice, not all tests can pass currently with the new option --binlog-checksum
due to BUG#49741.
Observation: write_cache() forced laborious hacking because earlier it had
been received changes to make the absolute offsets from the relative.
Im my, Andrei's, opinion, that should not have happened and instead handler
of show-binlog-events that needed that relative-to-absolute conversion
would rather conduct the conversion itself when it's called (instead of to do
that per-event) and the binlog to contain events of transaction with their
relative offsets.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.