WL#2540: Replication event checksums
Affects: Server-5.6
—
Status: Complete
SITUATION ========= Events are written to the binary log as they are executed. They are then sent to the slave on an event-by-event basis either through a socket or over a network. RATIONALE ========= This is some fundamental validity checking that would check that replication works correctly. It would make it much more clear to our customers when the replication failure is due to network/disk/memory failures and when the failure is due to bugs in the servers. See amazingly long list below for potentially previously affected customers. WISHFUL REQUIREMENT =================== After this is implemented, it should (preferingly) not be possible to ever crash the slave due to corrupted events. To make this happen, one could go through all values in all events and check that they can't be illegal. This could possibly be a second patch. PROBLEM ======= Some customers get very strange replication failures and it is impossible to know what causes them. Sinisa says that they could be "network problems" (e.g. CSC#4792). The failures causes the slave to corrupt the data rather than stop, which would be the appropriate action. Some reported incidents (please update these once the patch is pushed): - BUG#25737 - BUG#26123 - BUG#27048 - BUG#23619 - BUG#29309 - http://forums.mysql.com/read.php?26,148423,148423 - BUG#22889 - BUG#5116 - BUG#38718 The WL is expecting fixes from BUG#49741 to make tests to pass with the new options supplied to the servers. SUMMARY ======= Add checksum to binary log events. SINISA WRITES (2005-04-13, CSC#4792): This points to the missing reliability features, that SHOULD be implemented in 5.0. * checksum stored in binary and relay log to check for RAM / disk corruption. * checksum sent to slave for each event to check for network corruption. TYPE OF CHECKSUM ================ Alternatives: 1. 8-12 bit checksum PRO: Can use the flags part of common event header and thus not need to change binary log format CON: has higher than CRC32 probability of undetectable error 2. CRC32 PRO: Standard, implemented with multiple alg:s. Generally computational inexpensive 3. SHA or MD5 PRO: Can also be used for authentication of events CON: Are very long (at least 128 bits) Computationally expensive. Mats and Lars are currently considering 32 bit checksum (described in ISO 3309). REPAIRING EVENTS ================ - Should the checksum be capable of also repairing the event? PRO: That would be really nice, especially if the mysqlbinlog client can be extended to do repairs of binary logs CON: Then we can't use the bits for authentication Mats and Lars thinks that we can have the type of the checksum in the format_description log event (or implicitly as the binary log version). This so that we, in the future, can change to an error correcting checksum. SUGGESTED SOLUTION (By Mats) ============================ To ensure the integrity of each event arriving to the slave, a checksum should be added to each event. This allows the slave to check that the event was transmitted correctly and written correctly to the relay log. If it was not, the slave can stop indicating an error rather than trying to process the event. This is particularly important when using row-based replication, since subtle transmission errors can be applied without any form of error. In my opinion, we should focus on the integrity of the events, and ignore issues that relates to authenticity, since methods for handling that are computationally expensive, and can be achieved through other means (e.g., using SSL). Note that the suggestion below (using SSL) does not solve the issue of corruption in the binary log or in the relay log: it just handled the corruption during transfer of the event. INTERESTING WORK-AROUND (ANDREI'S TEXT) ======================================= From reading about SSL's features [1] and simulating corrupted packets on slave via changing data just after libc's recv function returns the buffer we can conclude that this idea (using SSL to create and verify a checksum) should work. Of course I can not check all the possible situations, for that we would need to study algorithms that are used. Almost obviously the alg of encryption in ssl take care of a checksum. Reference: [1] 5.1 Manual, 5.8.7.1. Basic SSL Concepts SSL is a protocol that uses different encryption algorithms to ensure that data received over a public network can be trusted. It has mechanisms to detect any data change, loss, or replay Since SSL is not used for every connection, we still need to consider per-event checksums. OPEN ISSUES =========== 1. How about only having checksums at commit time? How would that work? What about non-trx tables, these would probably need a checksum in that event. Probably the solution is that *whenever* something is committed, the checksum is needed, be it a statement (due to autocommit), a non-transactional table update, or a real transaction.
Requirements for the design include: R0. Checksumming shall be optional as for computing so for verifying. R1. Checksumming aims at detecting corrupted events *before* the event starts to be applied. R2. The Old Master shall work with the New checksum-aware slave seemlessly. R3. The checksum-aware New Master shall by all possible work with the Old Slave. R4. Checksumming framework should not be constrained with just one CRC32 implementation. CheckSum is attached as the last item to follow the event data. Typical structure of the event with the checksum (CS) is as the following +-----------+------------+------------+--------------+ | | | | | | Common Hdr| SubHeader | Payload | Checksum (V) | +-----------+------------+------------+--------------+ Additionally FormatDescriptor (FD) event contains the checksum algorithm descriptor (A) which is placed in between of Payload and (V) the checksum value. All events following FD are checksummed according to FD's (A). 0 no checksum 1 CRC32 algorithm [2 - 127] extensible range for new algorithm [128-254] reserved range 255 Event that contains (A) == 255 is generated by checksum-unaware server The binlog file can have only one type of events: either with checksum (and then with the same alg) or without (the alg is none therefore). While other than FD events rely on the checksum info of FD, FD becomes vulnerable. In particular, we need to distinguish between a checksummed-and-corrupted FD and not checksummed. To resolve this issue the duplication technique can be applied. However, simpler is to *always* checksum FD regardless of the current policy of R0. FD needs two descriptors: one for the binlog and the other for itself. The simplest is to add a new flag to CH part that would determine whether the FD following event are with checksum or not: +-----------+------------+------------+------+------------+ | Common Hdr| | | | | | (F) | SubHeader | Payload | (A) | Checksum(V)| +-----------+------------+------------+------+------------+ If (F) stays down, (A) describes the alg to use in order to verify (V) solely for the FD event. If (F) is up, (A) applies to FD and all following events which are treated as having the trailing (V) component^\footnote{ this part of the reliable FD is worked out from the current WL}. NOTES FROM DISCUSSION (Lars, Luís) - 26/02/2009, edited by Andrei to extend with further details =============================================== S1. mysqlbinlog we need to upgrade the tools to work with the new checksum field (these should now handle the checksum as well) S2. Server must work with multiple slave versions 1. OM -> OS (ok) 2. OM -> NS (ok) 3. NM -> OS ok, if NM does not compute checksum 4. NM -> NS (ok) OM - Old Master, OS - Old Slave, NM - New Master, NS - New Slave NOTES: a) we need to check if the slave registers some information about its version when connecting to the master S3. Make implementation modular - make a function that can be reused in several places S4. Configurable options The following server scope configuration parameters are added: 1. --binlog-checksum = NONE | CRC32^[1] the default is set to NONE for backward compatibility. That is there is only CRC32 checksum algorithm implemented. When CRC32 master server is configured to log events with checksum. Change from NONE to CRC32 and backward forces rotating the binlog so that the new binlog file will be headed with FD having having (A) equal to one being set. That provides the checksum-homogeneousness of the binlog. Notice, that IO thread of the slave server and its user sessions create local events to write them into relay log. Computation of checksum in this case will be controlled by the relay-log checksumming status that indicates what was the checksum alg of the last FD recorded into RL. 2. --master-verify-checksum=0,1; the default is zero. When 1 master examines the checksum of an event at reading from binlog by either user (via show binlog events) and the dump threads, as well as at recovery and binlog initialization. The value 0 can be acceptable without much compromising reliability of the system considering that the slave IO thread performs checksum examination mandatory anyways upon receiving a checksummed event from the network. 3. --slave-sql-verify-checksum=0,1; the default is 1. When 1 the slave SQL threads checks the control sum at reading an event from the relay log file. Also the value 1 forces slave server to start relay-log with its local FD checksummed (A) != 0. If the value is set to zero, checksum is not verified by SQL thread, and the local FD is created with (A) == 0. Notice, Relay Log can have heterogeneous content with events checksummed different ways. Still, FD events carry (A)_b value matching the following events checksum status. The parameters initialize the namesake global server variables: @@global.binlog_checksum can be altered in run-time to force rotation of the current binlog file as a side effect. The binlog is supposed to be homogeneous that is either to hold all the events with checksum or without it; @@global.master_verify_checksum can be turned ON|OFF; @@global.slave_sql_verify_checksum can be turned ON|OFF. The R3. requirement is implemented as the following. NM rejects OS connecting attempt if OS requests binlog position in a checksum-ed binlog file. NM disconnects from OS if it is about to send the first checksum-ed event. Notice, NM -> OS support is generally limited as documented. Still, while NM is configured w/o checksumming NM -> OS works. Notice, that the New checksum-aware mysqlbinlog might be engaged to transport the NM's binlog file to the Old Slave. [1] CRC32 of zlib. My searches for a nomenclature of a standard revealed http://www.gzip.org/zlib/rfc-gzip.html CRC32 (CRC-32) This contains a Cyclic Redundancy Check value of the uncompressed data computed according to CRC-32 algorithm used in the ISO 3309 standard and in section 8.1.1.6.2 of ITU-T recommendation V.42. Comparison specifications and polynomials from http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-V.42-200203-I!!PDF-E&type=items and http://en.wikipedia.org/wiki/Cyclic_redundancy_check points at CRC-32-IEEE 802.3 x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1 (V.42, MPEG-2, PNG[21], POSIX cksum)
Because of all but FD events are not changed in their header part, FD is the only way to find out if an event contains checksum. Low-level design shall also aim at making sure the checksum-holder FD event can be located for all the following cases: C1. SHOW BINLOG EVENTS C2. the dump thread when reads out from binlog to send to the slave C3. IO thread when it receives an event from network to queue it into the Relay Log C4. SQL thread when it reads out from the Relay log C5. Artificial Rotate event that is sent to the slave *before* any FD. Atop of that, the RL has specifics not not be checksum-homogeneous because of the slave server creates the very first RL file without coordination with the master server and the slave server writes the relay-log local events itself. Given these facts the following task shapes off: C6. Recording into RL is done respecting its checksum status. The status changes when a new recorded FD brings in new checksum value. The RL checksum status must survive any admin SQL commands that preserve the relay-log index which includes STOP SLAVE, FLUSH LOGS but not CHANGE MASTER or RESET SLAVE. Finally, OM -> NS forces C7. Identifying version of FD event and compare one against (H) a checksum-home constant. If FD.version >= (H) the FD instance must have (A),(V) components. C1-C2 are addressed naturally by watching over FD:s in the stream and extracting their (A) descriptor. Events after the read-out FD are treated to be checksummed according to (A). C3 is done via maintaining the RL checksum status to update one with a new value of (A) of FD_m the FormatDescriptor event from the master. C4 is addressed alike C1-C2. Even though RL contains FD from master and slave, the last met FD is treated to be the checksum-holder of the following stream. There are three patterns of RL events sequence: 1. FD_s (A_s), R_s (or S_s) 2. FD_s (A_s), R_m 3. FD_s (A_s), FD_m (A_m), E_1, ... E_l, R_s (or S_s) here FD_s, FD_m - format descriptor events from Slave and Master, R - rotate, E - event from master, S_s stop event by the slave. Our task is to make sure that in 1-2 R_s,m have checksum according to FD_s's A_s, and in 3. R_s to have checksum according to the last seen FD_m. The 2nd pattern requires to compute checksum on the slave if NM does not send R_m checksummed or the event is OM's R_m. In capturing 3 we rely on the RL checksum status stability which is not affected by STOP SLAVE. Notice, that the last event in the 3rd sequence is recorded at restart of the slave service. C5 is addressed with querying the master as connecting to the dump thread time. The master's @@bilog_checksum value at that moment becomes known to both parties so that IO thread knows what to do with the first fake Rotate event. RL checksum status of C6 is implemented as a member of Relay_log_Info.relay_log entry. It is initialized as 255 and the first Slave's local FD changes the value. In the following course, the value is altered by receiving FD_m from a master. Further fine details of implementation. 1. @@global.binlog_checksum requires a new sys_var class that - engages mysql_bin_log's LOCK_lock mutex as the guard to modify opt_binlog_checksum actual value. - update method needs making sure the current binlog is rotated with the terminal event in it having the value of the checksum equal to the old value, and the new file's FD event receives the checksum alg description byte and the checksum value is set according to the new being set value/ 2. @@global.master_verify_checksum @@global.slave_sql_verify_checksum require a new class with a standard LOCK_global_system_variables as a guard. Unlike p.1 the update method does not own any specific. 3. Computation of the checksum is started at time of Log_event::write() for events written directly to the binlog, and at MYSQL_BIN_LOG::write_cache() for events written via the transaction cache. By the existing framework MYSQL_BIN_LOG::LOCK_log is acquired before and released after computing and writing to the binlog file is done. While opt_binlog_checksum can be read multiple times during this process its value stays the same thanks to p.1 mutex. Log_event::write() can receive the event instance marked for checksumming by the caller. FD event is treated specially. Even though @@binlog_checksum is NONE a FD_m instance carry (A). In that case (A) describes algorithm of checksumming the FD solely. The position of (A) and (V) at the end of FD makes OS construct FD instance with extra bytes at the end of post_header_len[] array. The extra bytes remains unreferrable but harmless garbage in the OS's FD post_header_len. Since none of event-constructors-with-FD argument can access them, because post_header_len[] is indexed by event type codes and the extra bytes' index value >= the max value of event type codes of OS, post_header_len[]= {UNKNOWN_EVENT, START_EVENT_V3, ... , HEARTBEAT_LOG_EVENT /* the current max code */, (A), (V)} that proves transparent NM -> OS support provided that NM does not checksum other than FD (@@global.binlog_checksum == NONE). Slave can mark local relay-log events via Log_event::checksum_alg prior calling Log_event::write() which triggers computation as well. 4. checksum handshake is initiated by slave to send over a user variable setting to master: set @master_binlog_checksum= @@global.binlog_checksum The Unknown system var error indicates the fact of the Old slave. Otherwise the setting make the dump thread to know what alg to use in checksumming the fake Rotate event as well as the slave is aware of the alg. The slave reads out back @master_binlog_checksum value to know what is going to be (A) for the first event from master which is so called fake Rotate (see C4.2 above). mysql_binlog_send() verifies the client's (slave or mysqlbinlog) capability each time the dump thread reads out a binlog event which FD-parent marked with (A) == CRC32. 5. Testing shall demonstrate the user interface, including the default values, the domain value, OLD<->NEW compatibility, effectiveness, insignificant performance penalty if any. Notice, not all tests can pass currently with the new option --binlog-checksum due to BUG#49741. Observation: write_cache() forced laborious hacking because earlier it had been received changes to make the absolute offsets from the relative. Im my, Andrei's, opinion, that should not have happened and instead handler of show-binlog-events that needed that relative-to-absolute conversion would rather conduct the conversion itself when it's called (instead of to do that per-event) and the binlog to contain events of transaction with their relative offsets.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.