WL#7319: Infrastructure for GTID based delayed replication and replication lag monitoring
Affects: Server-8.0
—
Status: Complete
Replication lag monitoring with the seconds_behind master field in SHOW SLAVE STATUS does not work well with replication in other topologies than simple master- slave, for the following reasons: 1. If there are multiple masters, e.g. in a circular or chained topology, then there is still only one field that measures the delay for the last executed transaction. It would be better to distinguish the delays for each master into separate fields (i.e., a table). 2. In a chained replication setup, there is no way to measure the delay from the nearest master. This would be useful to measure the delay added by the intermediate servers. 3. The delay is measured per-event, from the time the statement started to execute. It would be more truthful to measure the delay *per-transaction*, from the time the transaction *committed*. 4. The timestamp stored in the binary log is relative to the original master's timezone. If servers have different timezones, this causes the delay to be offset by the time zone difference. In a simple master-slave setup, this is compensated because the slave knows the time difference with the nearest master. However, in a chained replication setup the time zone difference with the original master is not known to the slave, so the measured delay will be offset by the timezone difference between the nearest master and the original master. 5. The current event timing information has only second precision, not microsecond. Microseconds would be useful especially when we go forward to improve multi-threaded slave. In this worklog, we create the infrastructure for correcting these problems. In particular, we will introduce two new timestamps that are associated with each transaction (not each event or statement) in the binary log: - The original_commit_timestamp is the microseconds since the epoch when the transaction was committed on the original master. - The immediate_commit_timestamp is the microseconds since the epoch when the transaction was committed on the immediate master.
FUNCTIONAL REQUIREMENTS ----------------------- F1. We will introduce a new SQL variable: NAME: original_commit_timestamp SCOPE: session TYPE: numeric RANGE: [0, 2^55] DEFAULT: 2^55 DESCRIPTION: For internal use by replication. When re-executing a transaction on a slave, this is set to the time when the transaction was committed on the original master, measured in microseconds since the epoch. This allows the original commit timestamp to be propagated throughout a replication topology. F2. mysqlbinlog shall print the original_commit_timestamp as a commented SET statement that will only be executed as a query by 5.8.0+ servers: /*!50800 SET @@SESSION.original_commit_timestamp=XXX*/ F3. mysqlbinlog shall print the immediate_commit_timestamp as a comment in the following format: # immediate_commit_timestamp = N (YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE) The timezone shall be the system timezone. F4. mysqlbinlog shall print the original_commit_timestamp in human-readable form as a comment: # original_commit_timestamp = YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE The timezone shall be the system timezone. F5: Any timestamps recorded in gtid_log_event or written to binary log should be in UTC. In general, the code should always talk of timestamp in UTC. Any user visible interface (mysqlbinlog / P_S) should report time in system timezone. If GTID-MODE=OFF, we store the timestamp in anonymous_gtid_log_event. F6: A slave will log only one warning if the original_commit_timestamp is more recent than the immediate_commit_timestamp. When the timestamps are back to normal, another warning should be logged indicating that the timestamps are valid again. NON-FUNCTIONAL REQUIREMENTS --------------------------- NF1: The performance degradation resulting from this worklog shall be negligible.
I1. A timestamp - the *immediate_commit_timestamp* - shall be generated whenever a transaction is committed, and stored in the Gtid_log_event in the binary log. For a replicated transaction, the immediate_commit_timestamp may differ between the master's binary log and the slave's binary log. I2. A timestamp - the *original_commit_timestamp* - shall be generated when a transaction is committed on the first master, stored in the Gtid_log_event in the binary log, and propagated to binary logs of slaves using the @@session.original_commit_timestamp variable. For a replicated transaction, the original_commit_timestamp must be identical in the master's binary log and in all slaves' binary logs. If the transaction was generated by a group member, the original_commit_timestamp shall be generated when the transaction is ready to be committed. This means that the original_commit_timestamp will never be equal to the immediate_commit_timestamp, even on the original master, but will still be the same in all servers the transaction is replicated. Transactions that consist of a View_change_log_event are the exception: this event is created locally on each group member every time a member joins and it is not replicated. In this case, the original_commit_timestamp is set to zero as there is no single original master. I3. If the two timestamps are equal (on the original master), original_commit_timestamp shall not be stored explicitly in the binary log; instead a single bit will indicate that the timestamp is not included. This saves space. I4. The original_commit_timestamp session variable's default value can only be used to set the session variable, which means that it will not be written in the binary log. Its main purpose is, when computing the original_commit_timestamp's value, to determine whether it is already known or if it still needs to be determined. I5. The possible values for the original_commit_timestamp that can be stored in a GTID/Anonymous_GTID event are: - 0: the value is unknown; - >0 && <2^55: the value is the transaction commit timestamp on the original master. I6. When a transaction is rolled back, we will restore the value of the original_commit_timestamp variable to its default value, 2^55. I7. The layout of the gtid_log_event now is as follows: +------+--------+-------+-------+--------------+---------------+------------+ |unused|SID |GNO |lt_type|last_committed|sequence_number| timestamps*| |1 byte|16 bytes|8 bytes|1 byte |8 bytes |8 bytes | 7/14 bytes | +------+--------+-------+-------+--------------+---------------+------------+ * The worklog adds to the already existing gtid_log_event structure the fields referred to as "timestamps" depicted above, which contain the commit timestamps on the original and on the immediate master. This is how the timestamps section is serialized to a memory buffer: if original_commit_timestamp != immediate_commit_timestamp: +-7 bytes, high bit (1<<55) set----+-7 bytes--------------------+ | immediate_commit_timestamp | original_commit_timestamp | +----------------------------------+----------------------------+ else: +-7 bytes, high bit (1<<55) cleared-+ | immediate_commit_timestamp | +-----------------------------------+ I8. If GTID-MODE=OFF, there is no gtid_log_event, so we store timestamp in anonymous_gtid_log_event.
This worklog does the following: L1) Add two fields original_commit_timestamp and immediate_commit_timestamp to gtid_log_event. These are sent from master to slave thence these variables must be added to the encoding/decoding functions. The commit time is recorded in write_gtid(). So we assume that commit timestamp of a transaction is the time of writing it into the binary log. L2) Add the has_commit_timestamps flag to the Gtid_event class. This flag indicates whether the current gtid_log_event includes at least the immediate_commit_timestamp. If the gtid_log_event does not include any commit timestamp, it means that it was replicated from a server with a MySQL version lower than 5.8 and has to be processed accordingly. L3) Instead of writing 8 bytes in the binary log, we make an optimization and write only 7 bytes each for storing commit timestamps. Also, we use the highest bit (MSB) of this 7 bytes of the immediate_commit_timestamp as a flag to detect if the original_commit_timestamp is same as immediate_commit_timestamp. If the two timestamps are same, we write only one of them in the binary log. The encoding and decoding logic is modified to handle this. Introduced int7korr() and int7store() to read and write 7 bytes to memory respectively. L4) As defined in the functional requirements, we add the information regarding the two timestamps in mysqlbinlog. L5) The gtid_log_event contains only header, no body before this worklog. In this worklog, we introduce the body of gtid_log_event that contains L5.1) only original commit timestamps at the original master for original and immediate commit timestamp, written to binlog are same at this server. L5.2) both original and immediate commit timestamp. The reason we store timestamps as part of body section (and NOT header) is that the max length can be variable depending on points (L5.1) and (L5.2). L6. The heuristic to compute the original_commit_timestamp is the following: 1) When the original_commit_timestamp session variable is set to a value other than UNDEFINED_COMMIT_TIMESTAMP (2^55), it means that either the timestamp is known ( > 0 ) or the timestamp is not known ( == 0 ). 2) When original_commit_timestamp == UNDEFINED_COMMIT_TIMESTAMP, we assume that: a) if this thread is a slave applier, then the timestamp is unknown ( = 0 ); b) else, this is considered a new transaction, original on this server ( = immediate_commit_timestamp); L7) Though not related to this worklog we make a slight change in the names of two methods: - write_header_to_memory() - write_data_header_to_memory() + write_common_header_to_memory() + write_post_header_to_memory() This is only done in aim to make the method names better. L8) Testing: Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology.test Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology_cross_version.test Added mysql-test/suite/sys_vars/t/original_commit_timestamp_basic.test
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.