WL#9211: InnoDB: Clone Replication Coordinates

Affects: Server-8.0 — Status: Complete

EXECUTIVE SUMMARY

This worklog implements support for extracting, propagating and storing consistent replication positions during the process of cloning a server as depicted by WL#9209 and WL#9210. After this worklog is implemented the donor server shall extract and send over to the recipient server consistent replication positions. The recipient shall store and use them to start replicating from a consistent logical point in time with respect to the data it copied from the donor.

USER STORIES

As a MySQL DBA I want to make use of the clone command to simplify provisioning of new replicas, so that I replace my scripts that rely on external backup and restore tools and instead I rely on a couple of commands, e.g., CLONE + START SLAVE.
As a MySQL Group Replication developer, I want to make use of the clone infrastructure with correct replication positions to enhance distributed recovery, so that the group replication recovery phase can automatically decide whether to use incremental recovery based on binary logs or full state transfer based on clone.

Please refer to User Scenarios in HLS section for more details.

SCOPE

The clone operation is a physical copy of data stored in InnoDB. For additional details, please refer to WL#9209 and WL#9210. Within this context, this worklog addresses provisioning (extracting, transmitting and applying) both types of replication positions:

physical (filename, offset)
logical (global transaction identifiers)

This worklog does not address:

Cloning binary log files consistently across donor and recipient (which would be a way to setup replication positions as well).
Adjusting other replication related options after a clone takes place.
Any other provisioning activities.
Cloning servers across different versions. Therefore it has no applicability for upgrades.

The rest of the provisioning activities and details are covered on other worklogs listed in references section.

GLOSSARY

Donor: Server that is being cloned.
Recipient: Server that becomes a clone of Donor server.
GTID: global transaction identifier
GTID_EXECUTED: set of GTID that were handled by a given server.
PHY: physical position

REFERENCES

WL#11636 InnoDB: Clone Remote provisioning
WL#9210 InnoDB: Clone remote replica
WL#9209 InnoDB: Clone local replica

Functional

F-1. The replication positions MUST match data that is cloned.

F-1.1. GTID: If, in the donor, for data set 'D' there is a GTID_EXECUTED 'G', then if a recipient clones 'D' its GTID_EXECUTED shall be 'G' as well.

SELECT @@GLOBAL.GTID_EXECUTED;

F-1.2 PHY: If, in the donor, for data set 'D' there is a '(a,b)' PHY exposed in

SHOW MASTER STATUS;  // File, Position

then if a recipient clones 'D' it SHALL report the same '(a,b)' in

SELECT BINLOG_FILE, BINLOG_POSITION FROM PERFORMANCE_SCHEMA.CLONE_STATUS;

F-2. The recipient MUST be able to start replicating from the donor in the exact position in the replication stream, if the donor still has binary logs available for the recipient to resume replication.

 NOTE: If the donor does not have binary logs available then the
 standard replication handshake error SHALL be emitted. This is not
 a requirement for this worklog, it just means that there is no
 special handling for errors in that case. The normal exception
 operation SHALL take place.

F-3. GROUP REPLICATION: The recipient MUST be able to join a group replication cluster after being provisioned with CLONE, provided that by the time it has applied the snapshot and rejoins the cluster there are enough binary logs in the cluster for it to catch up. I.e., binary logs needed to catch up have not been purged from cluster.

F-4. SEMI-SYNC: The recipient MUST be able to setup semi-sync replication from the donor after being provisioned with CLONE, provided that by the time it has applied the snapshot and connects to the donor, the donor has enough binary logs for it to catch up i.e., binary logs needed to catch up have not been purged from the donor by then.

F-5. ASYNC: The recipient MUST be able to setup asynchronous replication from the donor after being provisioned with CLONE, provided that by the time it has applied the snapshot and connects to the donor, the donor has enough binary logs for it to catch up i.e., binary logs needed to catch up have not been purged from the donor by then.

F-6. If a recipient running with GTID_MODE = ON clones a donor with GTID_MODE != OFF, then CLONE SHALL set both PHY and GTID positions correctly on the recipient.

SELECT @@GLOBAL.GTID_EXECUTED;
SELECT BINLOG_FILE, BINLOG_POSITION FROM PERFORMANCE_SCHEMA.CLONE_STATUS;

F-7. For all innodb transactions with GTID, GTID MUST be durable along with the transaction i.e. in case of a server crash, the GTID set in GTID_EXECUTED table must have all the committed and XA prepared transactions in Innodb after recovery.

SELECT * from MYSQL.GTID_EXECUTED;

NOTE: This is an independent requirement and can be tested without
clone. It is implicit requirement of clone.

Non-functional

NF-1: [Donor Availability] Should not block donor for long time (< 1 second)

NF-2: Sys QA performance test with both binlog and GTID enabled should not show any regression. This should be part of usual system QA run but we stress it here to ensure no performance impact of GTID persistence.

User Interface

No external interfaces are changed.
CLONE SQL command from WL#11636 is used for cloning data
performance_schema.clone_status from WL#11636 is used to get PHY physical position after clone.
performance_schema.clone_status from WL#11636 is used to get GTID_EXECUTED
@@GLOBAL.GTID_EXECUTED can also be used to check GTID_EXECUTED immediately after clone.

User scenarios

US-1: provision a node and use in replication

Preconditions:

donor: a node in GR cluster or master/slave in other replication setup.
recipient: a node with running mysqld server to be added to the cluster.
user: has sufficient privilege for clone operation. For details refer to WL#11636:Clone Remote provisioning

[step-1]: Connect to recipient and execute CLONE SQL command.

mysql> CLONE INSTANCE FROM user@<host-donor>:<port-donor> identified by "xyz";

[step-2]: Wait for recipient server to be available after clone completed successfully.

[step-3]: Check "performance_schema.clone_status" to verify that clone recovery finished successfully.

mysql> SELECT state, error_no, error_message FROM performance_schema.clone_status;

STATE	ERROR_NO	ERROR_MESSAGE
Completed	0

The node is now ready with "gtid_executed" set up to whatever transactions are cloned.

mysql> SELECT @@GLOBAL.GTID_EXECUTED;

@@GLOBAL.GTID_EXECUTED
uuid:interval[:interval], ...

mysql> select gtid_executed from performance_schema.clone_status;

GTID_EXECUTED
uuid:interval[:interval], ...

US-1.1: Add the node to GR cluster.

[step-4]: Configure GR and add node to the cluster. Synchronizes based on GTID.

mysql> CHANGE MASTER TO MASTER_USER= <>, MASTER_PASSWORD=<>
       ...
       FOR CHANNEL 'group_replication_recovery';
mysql> START GROUP_REPLICATION;

US-1.2: Add as slave for ASYNC/SEMI-SYNC replication with GTID

[step-4]: Configure and start as slave.

mysql> CHANGE MASTER TO MASTER_HOST=<host-donr>, MASTER_PORT=<port-donor>,
       ...
       MASTER_AUTO_POSITION = 1;
mysql> START SLAVE USER='<>' PASSWORD='<>';

US-1.3: Add as slave for ASYNC/SEMI-SYNC with binary log position

[step-4] Get cloned binary log position.

mysql> SELECT BINLOG_FILE, BINLOG_POSITION
       FROM performance_schema.clone_status;

BINLOG_FILE	BINLOG_POSITION
binlog.xxxxxx	xxx

[step-5] Configure and start as slave using PHY binary log coordinates from previous step.

mysql> CHANGE MASTER TO MASTER_HOST=<host-donr>, MASTER_PORT=<port-donor>,
   ...
   MASTER_LOG_FILE = 'binlog.xxxxxx'
   MASTER_LOG_POS = xxx;

mysql> START SLAVE USER='<>' PASSWORD='<>';

Design Summary

Clone GTID:

Clone doesn't need to do any synchronization or blocking for getting consistent GTID.
GTIDs are now persistent within Innodb and guaranteed to be consistent with committed transaction.
- GTID is persisted in UNDO log header when a transaction with GTID is committed in Innodb. Same way GTID is persisted for external XA prepare, commit and rollback.
- GTIDs are persisted to mysql.gtid_executed table at intervals by a new innodb background thread "gtid_persistor".
- Purge background is not allowed to purge undo logs till GTIDs are written to mysql.gtid_executed by "gtid persistor".
- During recovery, the GTIDs are extracted from unpurged undo logs and send to gtid_persistor which persists them to gtid_executed table before replication GTID processing at startup.

Clone `PHY` binary log position:

This is relevant when binary log is enabled at donor.

Binary log file name and position are updated in Innodb TRX_SYS page at every transaction commit.
Clone forces ordered commit to ensure binary log order is same as SE order of transactions.
Clone waits for all unordered prepared transactions by binlog to finish.
Clone blocks external XA transaction prepare/commit/rollback operations at end of clone as this operations are unordered. This is done for very short time. It needs to be separately handled because binlog_order_commits doesn't ensure order for XA operations.
On recipient the binary log position PHY is extracted from Innodb TRX_SYS page and set in performance_schema.clone_status.

Security Context

No specific security aspects for this feature. The privilege needed for CLONE is covered in WL#11636.

Upgrade/Downgrade and Cross-Version Replication

One should be able to upgraded/downgrade a cloned instance same way as any database instance.
Clone is not supported across different mysql versions.
- Following error would be thrown as documented in WL#11636.
ER_CLONE_DONOR_VERSION "Clone Donor MySQL version: [V1] is different from Recipient MySQL version [V2]"
- Cannot be used during upgrade downgrade of a cluster.
- Cannot be used in cross version replication.
Undo format change is transparently handled in SE and has no impact on upgrade/downgrade. More details in LLD spec File Format section.
If GTID configuration @@gtid_mode (ON/OFF) is different between donor and recipient, it is allowed to perform clone. The effect is exactly same if one restarts a server after modifying the GTID configuration.

Deployment and Installation

There are some additional meta information added to Innodb system tablespace and undo tablespace. The changes have no effect on upgrade of existing server.
There is no change in redo log. MEB backup tool will be unaffected by the change.

Failure Model Specification

There is no special failure handling for this feature. WL#11636 would ensure that following scenarios are handled.
- recipient crash during clone: After server is restarted by managed service clone_status should be checked and clone needs to be re-issued.
- recipient crash during recovery: Recovery is idempotent. After server is restarted by managed service clone would complete.

Performance

GTID persistence impact should be insignificant on running server.
We ensure that GTID recovery skips unneeded rollback segments and undo logs and impact on recovery time is expected to be minimal.
Commit order for PHY: If binlog_order_commits is ON (current default), there is no impact. Otherwise, it is enforced for short time while clone is finishing i.e. the time it would take to commit the internally prepared transactions by binlog. It is expected to be very low as there is no blocking point after prepare.

BLOCK DIAGRAM

Module and Interface diagram.

@startuml
package "Server" {
() "Get GTID" as get_gtid
() "Save GTID" as save_gtid
[Session(THD)] -down- get_gtid
[Replication] -down- save_gtid
}
package "Innodb" {
() "Add GTID" as add_gtid
() "Get Oldest Transaction" as get_oldest
[Transaction] -up--> get_gtid
[GTID\npersistor] -up--> save_gtid
[GTID\npersistor] -right- add_gtid
[GTID\npersistor] -down- get_oldest
[Transaction] -left--> add_gtid
[undo] -up- Persist
[undo] -left- Remove
[Transaction]--> Persist
[Purge]
[Purge] -right--> Remove
[Purge] -up--> get_oldest
}
@enduml

Interface Specification

I-1: [Replication] Save GTID: Save a set of GTIDs in disk table mysql.gtid_executed. Called byGTID_persistor` to save GTIDs buffered in memory.

Gtid_table_persistor::save()

I-2: [GTID_persistor] Get GTID: Get transaction GTID information from session THD. Called by Transaction in Innodb before commit.

Clone_persist_gtid::get_gtid_info();

I-3: [GTID_persistor] Add GTID: Add GTID to persistor in memory buffer. Called by Transaction in Innodb after commit.

Clone_persist_gtid::add(const Gtid_desc id_desc);

I-4: [GTID_persistor] Get Oldest Transaction Get oldest transaction number for which GTID is not yet written to table. Called by Purge from time to time. It is safe to purge all older transaction undo data.

Clone_persist_gtid::get_oldest_trx_no()

I-5: [Undo] Persist: Persist GTID in undo log. Called by Transaction in Innodb during Commit.

trx_undo_gtid_write()

I-6: [Undo] Remove: Select Undo log of a transaction to be purged. Called by Purge while doing background cleanup.

trx_purge_choose_next_log()

File Format:

Format-1: Undo log header

In transaction undo log, one byte store for XID_EXISTS is changed to bit flag with last BIT storing XID_EXISTS. This we keep it backward compatible and no upgrade issue for old database. Second bit is used to store if GTID exists or not. We use 65 byte extra per transaction for GTID information

#define TRX_UNDO_FLAGS                               \
  20 /*! Transaction UNDO flags in one byte. This is \
     backward compatible as earlier we were storing  \
     either 1 or 0 for TRX_UNDO_XID_EXISTS. */
#define TRX_UNDO_FLAG_XID                    \
  0x01 /*!< TRUE if undo log header includes \
       X/Open XA transaction identification  \
       XID */
#define TRX_UNDO_FLAG_GTID                   \
  0x02 /*!< TRUE if undo log header includes \
       GTID information from replication */

/** GTID version offset */
#define TRX_UNDO_LOG_GTID_VERSION (TRX_UNDO_LOG_XA_HDR_SIZE)

/** GTID offset */
#define TRX_UNDO_LOG_GTID (TRX_UNDO_LOG_XA_HDR_SIZE + 1)

/** Total length of GTID */
#define TRX_UNDO_LOG_GTID_LEN 64

/** Total size of GTID information. */
#define TRX_UNDO_LOG_HDR_SIZE (TRX_UNDO_LOG_GTID + TRX_UNDO_LOG_GTID_LEN)

Format-2: Rollback segment header

We use free space after the slot array end. For old databases it is always zero. So, no upgrade issue. Adding this field will disallow this array to grow but that probably is not possible anyway as we don't keep any version field or array size in header. The only possible way looks to be adding a version in the empty space that follow and possibly add more slot tables if necessary.

#define TRX_RSEG_SLOT_END \
(TRX_RSEG_UNDO_SLOTS + (TRX_RSEG_SLOT_SIZE * TRX_RSEG_N_SLOTS))

/* Maximum transaction number ever added to this rollback 
segment history list. It is always increasing number over 
lifetime starting from zero. The size is 8 bytes. */
#define TRX_RSEG_MAX_TRX_NO TRX_RSEG_SLOT_END

Format-3: TRX_SYS header

We use 8 bytes after binary log position. For old database it is zero and we handle it as no GTID case.

/** Reserve next 8 bytes for transaction number up to which 
GTIDs are persisted to table */
#define TRX_SYS_TRX_NUM_GTID \
  (TRX_SYS_MYSQL_LOG_INFO + TRX_SYS_MYSQL_LOG_NAME + TRX_SYS_MYSQL_LOG_NAME_LEN)

#define TRX_SYS_TRX_NUM_END = (TRX_SYS_TRX_NUM_GTID + 8)

Key design points

We discuss key points of GTID persistence design here.

1. Transaction commit, XA prepare actions

We allocate "update undo" rollback segment slot. Insert undo segments are freed immediately at commit since they are not useful for mvcc. So, for insert only transactions we allocate one extra slot here.
When the transaction is serialized, we get Gtid_info from THD by Clone_persist_gtid::get_gtid_info() and persist the information in update undo log along with transaction end point trx_no.
Once the undo mtr is committed, while committing transaction in memory we add the Gtid_info to the GTID persistor by Clone_persist_gtid::add() under trx_sys_mutex. The persistor adds Gtid_info to in memory list (active list). It is important that we add the in memory information only after committing the undo mtr because we don't want GTID to be persisted before the transaction is committed.
We hold trx_sys_mutex for a bit longer here while removing transaction from serialization list. Need to verify that no significant degradation for standard test run when GTID is enabled.

2. gtid persistor background

The background thread from time to time switches the active list and persists accumulated GTID to gtid_executed table. The interval is currently 1 second or 1024 transactions, whichever is earlier. These thresholds are not exposed to user and thus not configurable. We might need to tune it in future internally. It is also possible to trigger it from other modules. Currently clone triggers it before ending operation.
It holds trx_sys mutex during the switch to block concurrent transaction adding GTIDs.
While switching list, under trx_sys_mutex, it also collects the oldest trx_no which is still active, i.e. whose GTID is not yet added to the active list. After writing GTIDs to the gtid_executed table Clone_persist_gtid::periodic_write(), it updates the oldest transaction number copy that it holds.

3. Purge background:

While constructing the oldest Read View, it consults GTID Persistor via Clone_persist_gtid::get_oldest_trx_no() and lowers the limit in its Read View if necessary. This way purge never purges any transaction which has not yet persisted its GTID to `gtid_executed. table.
This delay could cause more undo to get accumulated. Currently the persistor background thread is invoked both on time and number of transaction threshold and while it could mean more undo disk space should not affect performance.

4. Recovery optimization

The oldest transaction number below which all GTIDs are written to table is persisted in TRX_SYS page. TRX_SYS_TRX_NUM_GTID
The maximum transaction number ever added to a history list. TRX_RSEG_MAX_TRX_NO
During recovery we scan through the transaction undo headers pages in rollback segment history and add any GTID to gtid_persistor. We optimize it to avoid loading too many undo pages.
- Skip rollback segment if TRX_RSEG_MAX_TRX_NO is less than TRX_SYS_TRX_NUM_GTID.
- Within rollback segment, while scanning through history list, stop when the transaction trx_no falls below TRX_SYS_TRX_NUM_GTID.
- Note that the history list is always in order of trx_no.

Data structures/methods

Currently GTID persistor is owned by clone system in Innodb.

class Clone_Sys {
...
  /** @return GTID persistor */
  Clone_persist_gtid *get_gtid_persistor()
...
}

The persistor interface and implementation are in
  storage/innobase/include/clone0repl.h
  storage/innobase/src/clone0repl.cc

/** Serialized GTID information size */
static const size_t GTID_INFO_SIZE = 64;

/** GTID format version. */
static const uint32_t GTID_VERSION = 1;

/** Serialized GTID */ using Gtid_info = std::array;

/** Persist GTID along with transaction commit */
class Clone_persist_gtid {
  ...
  /** Get transaction GTID information */
  get_gtid_info(trx_t *trx, Gtid_desc >id_desc);

  /** Add GTID to in memory list.
  void add(const Gtid_desc >id_desc);

  /** Write GTIDs periodically to disk table. */
  void periodic_write();

  /** Get oldest transaction number for which GTID is not
  persisted to table. Transactions committed after this point
  should not be purged.
  @return oldest transaction number. */
  trx_id_t get_oldest_trx_no();
  ...
}

/* THD class methods for GTID persistence. */
Class THD {
  ...
    /* Defer freeing owned GTID (THD::owned_gtid) and SID (THD::owned_sid)
    till unpinned. Set by SE to consume GTID after binlog operation is over
    for external XA transactions. */
    void THD::pin_gtid()
    void THD::unpin_gtid()

    /* binlog reset operation. When set, SE flush log interface ensures that
    GTIDs are also flushed. */
    void THD::set_log_reset();
    void THD::clear_log_reset();
    bool THD::is_log_reset() const;

    /* SE sets it to guarantees GTID persistence */
    void THD::set_gtid_persisted_by_se();
    void THD::reset_gtid_persisted_by_se();
    bool THD::se_persists_gtid() const;

    /* Checked by SE to identify if XA 2 phase commit. */
    bool THD::is_one_phase_commit();
...
}

/* Replication class methods used by SE to write GTID to table. */

class Gtid_state {
    ...
    update_prev_gtids();
    save_gtids_of_last_binlog_into_table();
    ...
}

class Gtid_table_persistor {
    ...
    fetch_gtids();
    save();
    ...
}