WL#7319: Infrastructure for GTID based delayed replication and replication lag monitoring
Affects: Server-8.0
—
Status: Complete
Replication lag monitoring with the seconds_behind master field in SHOW SLAVE
STATUS does not work well with replication in other topologies than simple master-
slave, for the following reasons:
1. If there are multiple masters, e.g. in a circular or chained topology, then
there is still only one field that measures the delay for the last executed
transaction. It would be better to distinguish the delays for each master
into separate fields (i.e., a table).
2. In a chained replication setup, there is no way to measure the delay from the
nearest master. This would be useful to measure the delay added by the
intermediate servers.
3. The delay is measured per-event, from the time the statement started to
execute. It would be more truthful to measure the delay *per-transaction*,
from the time the transaction *committed*.
4. The timestamp stored in the binary log is relative to the original master's
timezone. If servers have different timezones, this causes the delay to be
offset by the time zone difference. In a simple master-slave setup, this is
compensated because the slave knows the time difference with the nearest
master. However, in a chained replication setup the time zone difference with
the original master is not known to the slave, so the measured delay will
be offset by the timezone difference between the nearest master and the
original master.
5. The current event timing information has only second precision, not
microsecond. Microseconds would be useful especially when we go forward to
improve multi-threaded slave.
In this worklog, we create the infrastructure for correcting these problems. In
particular, we will introduce two new timestamps that are associated with each
transaction (not each event or statement) in the binary log:
- The original_commit_timestamp is the microseconds since the epoch when the
transaction was committed on the original master.
- The immediate_commit_timestamp is the microseconds since the epoch when the
transaction was committed on the immediate master.
FUNCTIONAL REQUIREMENTS
-----------------------
F1. We will introduce a new SQL variable:
NAME: original_commit_timestamp
SCOPE: session
TYPE: numeric
RANGE: [0, 2^55]
DEFAULT: 2^55
DESCRIPTION: For internal use by replication. When re-executing a
transaction on a slave, this is set to the time when the transaction was
committed on the original master, measured in microseconds since
the epoch. This allows the original commit timestamp to be propagated
throughout a replication topology.
F2. mysqlbinlog shall print the original_commit_timestamp as a commented SET
statement that will only be executed as a query by 5.8.0+ servers:
/*!50800 SET @@SESSION.original_commit_timestamp=XXX*/
F3. mysqlbinlog shall print the immediate_commit_timestamp as a comment in the
following format:
# immediate_commit_timestamp = N (YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE)
The timezone shall be the system timezone.
F4. mysqlbinlog shall print the original_commit_timestamp in human-readable
form as a comment:
# original_commit_timestamp = YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE
The timezone shall be the system timezone.
F5: Any timestamps recorded in gtid_log_event or written to binary log should be
in UTC. In general, the code should always talk of timestamp in UTC.
Any user visible interface (mysqlbinlog / P_S) should report time in system
timezone.
If GTID-MODE=OFF, we store the timestamp in anonymous_gtid_log_event.
F6: A slave will log only one warning if the original_commit_timestamp is
more recent than the immediate_commit_timestamp. When the timestamps are
back to normal, another warning should be logged indicating that the
timestamps are valid again.
NON-FUNCTIONAL REQUIREMENTS
---------------------------
NF1: The performance degradation resulting from this worklog shall be
negligible.
I1. A timestamp - the *immediate_commit_timestamp* - shall be generated whenever
a transaction is committed, and stored in the Gtid_log_event in the binary
log. For a replicated transaction, the immediate_commit_timestamp may differ
between the master's binary log and the slave's binary log.
I2. A timestamp - the *original_commit_timestamp* - shall be generated when a
transaction is committed on the first master, stored in the Gtid_log_event
in the binary log, and propagated to binary logs of slaves using the
@@session.original_commit_timestamp variable. For a replicated transaction,
the original_commit_timestamp must be identical in the master's binary log
and in all slaves' binary logs.
If the transaction was generated by a group member, the
original_commit_timestamp shall be generated when the transaction is ready
to be committed. This means that the original_commit_timestamp will never
be equal to the immediate_commit_timestamp, even on the original master,
but will still be the same in all servers the transaction is replicated.
Transactions that consist of a View_change_log_event are the exception:
this event is created locally on each group member every time a member
joins and it is not replicated. In this case, the original_commit_timestamp
is set to zero as there is no single original master.
I3. If the two timestamps are equal (on the original master),
original_commit_timestamp shall not be stored explicitly in the binary log;
instead a single bit will indicate that the timestamp is not included.
This saves space.
I4. The original_commit_timestamp session variable's default value can only be
used to set the session variable, which means that it will not be written in
the binary log. Its main purpose is, when computing the
original_commit_timestamp's value, to determine whether it is already known
or if it still needs to be determined.
I5. The possible values for the original_commit_timestamp that can be stored in
a GTID/Anonymous_GTID event are:
- 0: the value is unknown;
- >0 && <2^55: the value is the transaction commit timestamp on the original
master.
I6. When a transaction is rolled back, we will restore the value of the
original_commit_timestamp variable to its default value, 2^55.
I7. The layout of the gtid_log_event now is as follows:
+------+--------+-------+-------+--------------+---------------+------------+
|unused|SID |GNO |lt_type|last_committed|sequence_number| timestamps*|
|1 byte|16 bytes|8 bytes|1 byte |8 bytes |8 bytes | 7/14 bytes |
+------+--------+-------+-------+--------------+---------------+------------+
* The worklog adds to the already existing gtid_log_event structure the
fields referred to as "timestamps" depicted above, which contain the commit
timestamps on the original and on the immediate master.
This is how the timestamps section is serialized to a memory buffer:
if original_commit_timestamp != immediate_commit_timestamp:
+-7 bytes, high bit (1<<55) set----+-7 bytes--------------------+
| immediate_commit_timestamp | original_commit_timestamp |
+----------------------------------+----------------------------+
else:
+-7 bytes, high bit (1<<55) cleared-+
| immediate_commit_timestamp |
+-----------------------------------+
I8. If GTID-MODE=OFF, there is no gtid_log_event, so we store timestamp in
anonymous_gtid_log_event.
This worklog does the following:
L1) Add two fields original_commit_timestamp and immediate_commit_timestamp to
gtid_log_event. These are sent from master to slave thence these variables
must be added to the encoding/decoding functions. The commit time is
recorded in write_gtid(). So we assume that commit timestamp of a
transaction is the time of writing it into the binary log.
L2) Add the has_commit_timestamps flag to the Gtid_event class. This flag
indicates whether the current gtid_log_event includes at least the
immediate_commit_timestamp. If the gtid_log_event does not include any
commit timestamp, it means that it was replicated from a server with a
MySQL version lower than 5.8 and has to be processed accordingly.
L3) Instead of writing 8 bytes in the binary log, we make an optimization and
write only 7 bytes each for storing commit timestamps. Also, we use the
highest bit (MSB) of this 7 bytes of the immediate_commit_timestamp as a
flag to detect if the original_commit_timestamp is same as
immediate_commit_timestamp.
If the two timestamps are same, we write only one of them in the binary log.
The encoding and decoding logic is modified to handle this.
Introduced int7korr() and int7store() to read and write 7 bytes to memory
respectively.
L4) As defined in the functional requirements, we add the information regarding
the two timestamps in mysqlbinlog.
L5) The gtid_log_event contains only header, no body before this worklog.
In this worklog, we introduce the body of gtid_log_event that contains
L5.1) only original commit timestamps at the original master for original
and immediate commit timestamp, written to binlog are same at this
server.
L5.2) both original and immediate commit timestamp.
The reason we store timestamps as part of body section (and NOT header)
is that the max length can be variable depending on points (L5.1) and
(L5.2).
L6. The heuristic to compute the original_commit_timestamp is the following:
1) When the original_commit_timestamp session variable is set to a value
other than UNDEFINED_COMMIT_TIMESTAMP (2^55), it means that either the
timestamp is known ( > 0 ) or the timestamp is not known ( == 0 ).
2) When original_commit_timestamp == UNDEFINED_COMMIT_TIMESTAMP, we assume
that:
a) if this thread is a slave applier, then the timestamp is unknown
( = 0 );
b) else, this is considered a new transaction, original on this server
( = immediate_commit_timestamp);
L7) Though not related to this worklog we make a slight change in the names
of two methods:
- write_header_to_memory()
- write_data_header_to_memory()
+ write_common_header_to_memory()
+ write_post_header_to_memory()
This is only done in aim to make the method names better.
L8) Testing:
Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology.test
Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology_cross_version.test
Added mysql-test/suite/sys_vars/t/original_commit_timestamp_basic.test
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.