WL#7319: Infrastructure for GTID based delayed replication and replication lag monitoring

Affects: Server-8.0   —   Status: Complete

Replication lag monitoring with the seconds_behind master field in SHOW SLAVE 
STATUS does not work well with replication in other topologies than simple master-
slave, for the following reasons:
 1. If there are multiple masters, e.g. in a circular or chained topology, then
    there is still only one field that measures the delay for the last executed
    transaction. It would be better to distinguish the delays for each master
    into separate fields (i.e., a table).
 2. In a chained replication setup, there is no way to measure the delay from the
    nearest master. This would be useful to measure the delay added by the
    intermediate servers.
 3. The delay is measured per-event, from the time the statement started to
    execute. It would be more truthful to measure the delay *per-transaction*,
    from the time the transaction *committed*.
 4. The timestamp stored in the binary log is relative to the original master's
    timezone. If servers have different timezones, this causes the delay to be
    offset by the time zone difference. In a simple master-slave setup, this is
    compensated because the slave knows the time difference with the nearest
    master. However, in a chained replication setup the time zone difference with
    the original master is not known to the slave, so the measured delay will
    be offset by the timezone difference between the nearest master and the
    original master.
 5. The current event timing information has only second precision, not
    microsecond. Microseconds would be useful especially when we go forward to
    improve multi-threaded slave.

In this worklog, we create the infrastructure for correcting these problems. In
particular, we will introduce two new timestamps that are associated with each
transaction (not each event or statement) in the binary log:
- The original_commit_timestamp is the microseconds since the epoch when the
  transaction was committed on the original master.
- The immediate_commit_timestamp is the microseconds since the epoch when the
  transaction was committed on the immediate master.
F1. We will introduce a new SQL variable:
      NAME: original_commit_timestamp
      SCOPE: session
      TYPE: numeric
      RANGE: [0, 2^55]
      DEFAULT: 2^55
      DESCRIPTION: For internal use by replication. When re-executing a
        transaction on a slave, this is set to the time when the transaction was
        committed on the original master, measured in microseconds since
        the epoch. This allows the original commit timestamp to be propagated
        throughout a replication topology.

F2. mysqlbinlog shall print the original_commit_timestamp as a commented SET
    statement that will only be executed as a query by 5.8.0+ servers:
    /*!50800 SET @@SESSION.original_commit_timestamp=XXX*/

F3. mysqlbinlog shall print the immediate_commit_timestamp as a comment in the
    following format:
    # immediate_commit_timestamp = N (YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE)
    The timezone shall be the system timezone.

F4. mysqlbinlog shall print the original_commit_timestamp in human-readable
    form as a comment:
    # original_commit_timestamp = YYYY-MM-DD HH:MM:SS.UUUUUU TIMEZONE
    The timezone shall be the system timezone.

F5: Any timestamps recorded in gtid_log_event or written to binary log should be
    in UTC. In general, the code should always talk of timestamp in UTC.
    Any user visible interface (mysqlbinlog / P_S)  should report time in system

    If GTID-MODE=OFF, we store the timestamp in anonymous_gtid_log_event.

F6: A slave will log only one warning if the original_commit_timestamp is
    more recent than the immediate_commit_timestamp. When the timestamps are
    back to normal, another warning should be logged indicating that the
    timestamps are valid again.

NF1: The performance degradation resulting from this worklog shall be
I1. A timestamp - the *immediate_commit_timestamp* - shall be generated whenever
    a transaction is committed, and stored in the Gtid_log_event in the binary
    log. For a replicated transaction, the immediate_commit_timestamp may differ
    between the master's binary log and the slave's binary log.

I2. A timestamp - the *original_commit_timestamp* - shall be generated when a
    transaction is committed on the first master, stored in the Gtid_log_event 
    in the binary log, and propagated to binary logs of slaves using the
    @@session.original_commit_timestamp variable. For a replicated transaction,
    the original_commit_timestamp must be identical in the master's binary log
    and in all slaves' binary logs.
    If the transaction was generated by a group member, the
    original_commit_timestamp shall be generated when the transaction is ready
    to be committed. This means that the original_commit_timestamp will never
    be equal to the immediate_commit_timestamp, even on the original master,
    but will still be the same in all servers the transaction is replicated.
    Transactions that consist of a View_change_log_event are the exception:
    this event is created locally on each group member every time a member
    joins and it is not replicated. In this case, the original_commit_timestamp
    is set to zero as there is no single original master. 

I3. If the two timestamps are equal (on the original master),
    original_commit_timestamp shall not be stored explicitly in the binary log;
    instead a single bit will indicate that the timestamp is not included.
    This saves space.

I4. The original_commit_timestamp session variable's default value can only be
    used to set the session variable, which means that it will not be written in
    the binary log. Its main purpose is, when computing the
    original_commit_timestamp's value, to determine whether it is already known
    or if it still needs to be determined.

I5. The possible values for the original_commit_timestamp that can be stored in
    a GTID/Anonymous_GTID event are:
    - 0: the value is unknown;
    - >0 && <2^55: the value is the transaction commit timestamp on the original

I6. When a transaction is rolled back, we will restore the value of the
    original_commit_timestamp variable to its default value, 2^55.

I7. The layout of the gtid_log_event now is as follows:

  |unused|SID     |GNO    |lt_type|last_committed|sequence_number| timestamps*|
  |1 byte|16 bytes|8 bytes|1 byte |8 bytes       |8 bytes        | 7/14 bytes |

  * The worklog adds to the already existing gtid_log_event structure the
    fields referred to as "timestamps" depicted above, which contain the commit
    timestamps on the original and on the immediate master.
    This is how the timestamps section is serialized to a memory buffer:

    if original_commit_timestamp != immediate_commit_timestamp:

       +-7 bytes, high bit (1<<55) set----+-7 bytes--------------------+
       | immediate_commit_timestamp        | original_commit_timestamp |


       +-7 bytes, high bit (1<<55) cleared-+
       | immediate_commit_timestamp        |

I8. If GTID-MODE=OFF, there is no gtid_log_event, so we store timestamp in

This worklog does the following:

L1) Add two fields original_commit_timestamp and immediate_commit_timestamp to
    gtid_log_event. These are sent from master to slave thence these variables
    must be added to the encoding/decoding functions. The commit time is
    recorded in write_gtid(). So we assume that commit timestamp of a
    transaction is the time of writing it into the binary log.
L2) Add the has_commit_timestamps flag to the Gtid_event class. This flag
    indicates whether the current gtid_log_event includes at least the
    immediate_commit_timestamp. If the gtid_log_event does not include any
    commit timestamp, it means that it was replicated from a server with a
    MySQL version lower than 5.8 and has to be processed accordingly.
L3) Instead of writing 8 bytes in the binary log, we make an optimization and
    write only 7 bytes each for storing commit timestamps. Also, we use the
    highest bit (MSB) of this 7 bytes of the immediate_commit_timestamp as a
    flag to detect if the original_commit_timestamp is same as
    If the two timestamps are same, we write only one of them in the binary log.
    The encoding and decoding logic is modified to handle this.
    Introduced int7korr() and int7store() to read and write 7 bytes to memory
L4) As defined in the functional requirements, we add the information regarding
    the two timestamps in mysqlbinlog.
L5) The gtid_log_event contains only header, no body before this worklog.
    In this worklog, we introduce the body of gtid_log_event that contains
    L5.1) only original commit timestamps at the original master for original
          and immediate commit timestamp, written to binlog are same at this
    L5.2) both original and immediate commit timestamp.

    The reason we store timestamps as part of body section (and NOT header)
    is that the max length can be variable depending on points (L5.1) and
L6. The heuristic to compute the original_commit_timestamp is the following:
    1) When the original_commit_timestamp session variable is set to a value
    other than UNDEFINED_COMMIT_TIMESTAMP (2^55), it means that either the
    timestamp is known ( > 0 ) or the timestamp is not known ( == 0 ).
    2) When original_commit_timestamp == UNDEFINED_COMMIT_TIMESTAMP, we assume
      a) if this thread is a slave applier, then the timestamp is unknown
      ( = 0 );
      b) else, this is considered a new transaction, original on this server
      ( = immediate_commit_timestamp);
L7) Though not related to this worklog we make a slight change in the names
    of two methods:

    - write_header_to_memory()
    - write_data_header_to_memory()
    + write_common_header_to_memory()
    + write_post_header_to_memory()

    This is only done in aim to make the method names better.

L8) Testing:
  Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology.test
  Added mysql-test/suite/rpl/t/rpl_timestamps_line_topology_cross_version.test
  Added mysql-test/suite/sys_vars/t/original_commit_timestamp_basic.test