WL#7846: MTS: slave-preserve-commit-order when log-slave-updates/binlog is disabled

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

EXECUTIVE SUMMARY

This worklog implements applier commit order, even for those cases that the worker threads are not logging the changes to the replicas binary log (--log-slave-updates=off).

MOTIVATION

The replication applier will apply the binary log in the order that it was received and stored in the relay log. However, this only means that the transactions are scheduled to execute in the received order. Thus, they can execute and commit asynchronously, therefore, creating a different commit history on the replica than that on the master.

This is fine for DATABASE parallelization as it is for COMMIT_ORDER. The DATABASE parallelization assumes that transactions operating on different schemas are unrelated and non-contending. COMMIT_ORDER assumes that transactions are non-contending and have concurrently committed on the original execution, thence the commit history is not relevant here. However, with WRITESET parallelization, this is not the case. Transactions can commit out-of-order which can create execution histories that look distorted from the application point of view. For example a single threaded workload on the master can be applied and committed in parallel on the replicas, which may lead to views of the data that are different than master.

Note though that this does not break the eventual consistency guarantees that master-slave replication builds on. The consistency guarantees provided are the same: if updates stop, then the system converges.

Regardless, there are options to be able to control the commit history on the replica. The behavior can be controlled by enabling the option slave_preserve_commit_order. This ensures that transactions applied by the replication worker threads, and that are logged to the replica's own replication log, are logged and committed in the same order as they appeared in the relay log. But, this option can only enforce commit order if the replica has the binary log turned on (--log-bin) and applier threads are logging to its own binary log (--log-slave-updates). Thus, if there are binlogless replicas, this option becomes a no-op. It also fails to order the transactions if the slave uses binlog filters, such as binlog-do-db and binlog-ignore-db.

In many cases, people want to use binlogless replicas, and they have no mechanism to ensure commit ordering in that case. This becomes even more problematic once they begin to use WRITESET parallelization.

So, to summarize, the main drivers for this are: - binlogless replicas have no way to ensure commit ordering - to make the most out of WRITESET parallelization, we need to have a way that users can deploy it while preserving commit histories, even if they use a binlogless replica. Once that is done, WRITESET may be the default dependency tracking mechanism on the master.

USER STORIES

As a MySQL DBA I want to turn on parallel replication on a replica that has no binary log enabled, without compromising monotonic writes on my session, to make replication faster.
As a MySQL DBA using parallel replication and ordered commits, I want to disable the binary log without compromising monotonic writes on my session, in order to save disk space.
As a MySQL DBA I want to use WRITESET dependency tracking on my master server, but not create holes in the execution history for each session.
As a MySQL DBA, I do not want to see gaps in gtid_executed variable value, as it easier to read it without gaps and doesn't give impression that the particular transaction got lost.

Functional Requirements:

FR1. The ordered commit (slave_preserve_commit_order) functionality SHALL work regardless of log-bin and log-slave-updates configuration.
FR2. The ordered commit (slave_preserve_commit_order=ON) MUST work even when binlog_transaction_dependency_tracking system variable is set to WRITESET dependency tracking on master and SHALL NOT create holes in the execution history of replicas.
FR3. The MySQL Applier threads SHALL cleanly exit (no fatal error) in case slave sql threads are stopped/killed.
FR4. Lets assume that T1 appears before T2 in the relay log, then T2 should not be externalize before T1. The T2 may externalize after T1, or both externalized atomically together.
FR5. Lets assume that T1 appears before T2 in the relay log, then T2's GTID should not be externalized (added to @@global.gtid_executed) before T1's GTID is externalized. In other words, either T1's GTID must be externalized before T2's GTID, or both externalized atomically together.
FR6. Lets assume that T1 appears before T2 in the relay log, then T2 should not persist before T1. The T2 may persist after T1, or both can persists atomically together.
FR7. If a lock_wait_timeout timesout, the transaction SHALL be retried.
FR8. If the applier thread fails to successfully apply a transaction after the maximum number of retries configured, it SHALL rollback and the subsequent transactions waiting for this one to commit SHALL be rolled back as well.

Non-Functional Requirements:

NF1. This worklog must have no impact on transaction execution performance, due to code changes made in it.
NF2. When binary logging is disabled (log_bin=OFF), the ordered commit disabled (slave_preserve_commit_order=OFF) SHALL give the same or more throughput, compared to ordered commit enabled (slave_preserve_commit_order=ON). This is already the case currently for binary logging enabled (log_bin=ON), and SHALL remain so after this worklog also.
NF3. When ordered commit is enabled (slave_preserve_commit_order=ON), the binary logging disabled (log_bin=OFF) SHALL give the same or more throughput, compared to binary logging enabled (log_bin=ON).This is already the case currently for ordered commit disabled (slave_preserve_commit_order=OFF), and SHALL remain so after this worklog also.

Non-Requirements / Limitations:

NR1. This worklog does not gurantee that transactions persist before getting externalized. So such an anomaly can occur:
1. Concurrent client observes that T1 has committed.
2. Server crashes and restarts.
3. Concurrent client observes that T1 has not committed.
NR2. In this worklog we are not handling transaction rollbacks on replicas. So such an anomaly can occur:
1. The rollbacks are not replicated to replicas, except when a transaction contains a mix of transactional and non-transactional statements, and master using statement-based replication. If such a rollback gets replicated it will cause replica's transaction commit order to diverge from its relay log order.

High-Level Specification

1. INTRODUCTION

The slave_preserve_commit_order option ensures that transactions are externalized on the replica in the same order as they appear in the replica's relay log. This option is useful only when multiple threads (slave_parallel_workers > 1) are executing in parallel on replica, as single thread would read relay log and commit transaction one at a time in order only. With slave_preserve_commit_order option enabled, the executing thread waits for its turn until all previous transactions are committed before committing. But this option can only enforce commit order if the replica has the binary log turned on (--log-bin=ON) and applier threads are logging to its own binary log (--log-slave-updates=ON). The binary log group commit (BGC) feature which works only when the binary log is enabled, improves performance by grouping several writes to the binary log and storage engine instead of writing them one by one. And the current implementation of slave_preserve_commit_order (before WL#7846), it was tightly integrated with binary log group commit (BGC). So disabling binary log disables BGC and in process slave_preserve_commit_order also.

2. SLAVE-PRESERVE-COMMIT-ORDER FUNCTIONALITY - THE BASICS

Suppose we have two transactions T1 and T2 and we assume that T1 appears before T2 in the relay log. The transactions T1 and T2 can begin to execute in parallel, but the executing thread will hang before the commit, until all previous transactions are committed. In this case T2 can execute before T1 but will not commit till T1 is committed.

A commit order queue is used to maintain the transaction order. The worker is registered/pushed into this commit order queue when coordinator dispatches a transaction to the worker. The worker which is on the top of the queue is removed/popped and allowed to commit its transaction, while rest of the workers wait for their respective turn to come on the top to commit.

This functionality of slave_preserve_commit_order explained above on which it was implemented when introduced initially, will remain unchanged after this worklog also.

2.1. ORDER COMMIT DEADLOCK:

The transactions T1 and T2, where T1 appears before T2 in the relay log, while executing in parallel can sometimes stuck in such a deadlock state:
The transaction T2 which is waiting for transaction T1 to commit is holding a lock which is required by transaction T1. The issue is resolved by rollback transaction T1 and retry again later.

The control flow for this deadlock between transaction T1 and T2 is as shown below:

Worker1(T1)                                     Worker2(T2)
===========                                     =============
...                                             ...
                                                Engine acquires lock A
Engine acquires lock A.                         ...
1. found T2 is holding the lock.
2. check if it causes a order commit
   deadlock, if its true:
a. set deadlock flag of worker2.
b. awake worker2 waiting for T1 to commit
   first.
3. waiting for T2 to release lock A.
                                                COMMIT(waiting for T1 to
                                                commit first).
                                                a. awakes due to signal
                                                   from worker1.
                                                b. Found the deadlock
                                                   flag set by worker1
                                                   and then return
                                                   with ER_LOCK_DEADLOCK.

                                                Rollback the transaction
Get lock A and go ahead.
...
                                                Retry the transaction

2.2. ORDER COMMIT ROLLBACK:

While executing transaction if some fatal error happens, it notifies the following transaction to rollback. It notifies by setting rollback flag. The rollback flag set is for all the workers of current channel. When a worker get its turn to commit and finds rollback flag enabled it removes itself from commit order queue and and returns with ER_SLAVE_WORKER_STOPPED_PREVIOUS_THD_ERROR error to rollback itself.

2.3. ORDER COMMIT TRANSACTION STAGES:

enum class enum_transaction_stage { REGISTERED, WAITED, FINISHED };

Each worker while executing transactions in commit order maintains a enum_transaction_stage stage. The meaning of different stages of enum_transaction_stage is as explained below:

REGISTERED

The REGISTERED stage is set when worker is not in the commit order queue, that means either worker is not assigned any transaction or it is doing the last part of the clean-up after committing a transaction.
Initially all worker stage are set to REGISTERED stage.
When a worker is set to FINISHED stage and unregistered/removed from the commit order queue as the transaction is processed (committed/rolled back) then its stage is set to REGISTERED stage.

WAITED

The WAITED stage is set when worker is assigned a new transaction.
The worker then applies the transaction. If during this time, all previous workers has committed, then the worker's stage is changed to FINISHED.
The worker is done applying transaction and ready to commit but finds the previous worker still has not committed, it waits for them. And when all previous workers has committed, then the worker's stage is changed to FINISHED.

FINISHED

The FINISHED stage is set when worker is done applying transaction and ready to commit and all previous workers also has committed.
When worker is waiting for its turn in WAITED state and either of below happens then the stage is set to FINISHED:
- worker after committing, awakes next worker, and after awaking set stage to FINISHED.
- the transaction is signaled to rollback and when its turn comes to commit it set stage to FINISHED.

3. SUMMARY OF THE APPROACH

3.1. POINTS TO INTERCEPT THE COMMIT

To implement ordered commit even when binary log is turned off, the ordered commit is called from storage commit handlers - ha_commit_low(), ha_xa_prepare(),Sql_cmd_xa_commit::process_external_xa_commit() and Sql_cmd_xa_rollback::process_external_xa_rollback().

The mysql handler's ha_commit_low() is called when transaction is committed to storage engine. So all 1pc commit's would invoke ordered commit from ha_commit_low(). In case of 2pc commit the second call to ordered commit gets skipped.
The mysql handler's ha_xa_prepare() is called when XA Transactions are prepared. The XA PREPARE is treated as a transaction in the replication perspective with its own GTID, and so also needs to be ordered on the slave.
The XA ROLLBACK and XA COMMIT for the case when xid corresponds to an external XA transaction, that it a transaction generated outside current session context does not call ha_commit_low() like their internal XA transaction counterparts, so to commit them in order we invoke ordered commit in Sql_cmd_xa_commit::process_external_xa_commit() and Sql_cmd_xa_rollback::process_external_xa_rollback().
The transactions rolled back do not need to be handled in ha_rollback_low, as they won't get replicated to replica, except when a transaction contains a mix of transactional and non-transactional statements, and using statement-based replication. This is added as NR2 limitation of this worklog in section 'Non-Requirements / Limitations'.
The XA ROLLBACK transactions will get handled by ha_commit_low().

3.2. GROUP COMMIT:

When slave_preserve_commit_order enabled, to maintain the same commit order of transactions as present in the relay log, all the calls to ha_commit_low are serialized, and for innodb_flush_log_at_trx_commit=1 fsync too are serialized, i.e. transaction N always commit and fsync before transaction N+1. This breaks innodb_flush_log_at_trx_commit=1 functionality as multiple fsync requests are grouped together to improve performance and avoid one flush for each commit.

When the binary log is enabled, the binary log group commit feature introduced in MySQL 5.6 does take care of grouping multiple fsync requests. But as in this worklog we want slave_preserve_commit_order feature to work even when binary log is disabled, it becomes a issue. As this will reduce performance of the server as one flush each commit will increase number of fsync calls which is a costly operation. So to improve performance, we have used some functionaity of binary log group commit in this worklog.

The binary log group commit has three stages: FLUSH, SYNC and COMMIT. And every stage has its own queue and a leader. Every new transaction's thread is added to the queue and first thread in the queue is choosen as leader. Except leader all other threads wait till their transaction is committed. Its leader job to flush, sync and then commit for all other threads.

FLUSH STAGE The threads are added to flush queue and first thread becomes leader. The leader thread flush transactions to the log of storage engine (in InnoDB its in redo log) and flush thread caches of all threads waiting in flush queue to the binary log.

SYNC STAGE In the SYNC stage, threads are added to sync queue and first thread becomes leader. The leader sync binary log file to disk.

COMMIT STAGE In the COMMIT stage, threads are added to commit queue and first thread becomes leader. the leader commit all transactions in order. This stage is skipped if we do not need to order the commits and each thread have to execute its transaction commit itself.

For the slave_preserve_commit_order we do not need SYNC and COMMIT STAGE and only need partial FLUSH STAGE. We want to avoid one flush each commit, and want to decrease number of fsync calls i.e. we want to group multiple flush transactions to the log of storage engine into single flush. And for that we would make use of FLUSH STAGE functionality to group multiple threads waiting in FLUSH STAGE queue and flush all the waiting thread's transactions to the log of the storage engine together.

Note: The introduction of GROUP COMMIT Functionality for commit order will help improve performance even when binary logging is disabled(log-bin=OFF) in comparison to binary logging enabled (log-bin=ON), even when commit order is enabled (slave_preserve_commit_order=ON).

4. CONTROL FLOW GRAPH FOR SLAVE-PRESERVE-COMMIT-ORDER

The control flow graph for slave-preserve-commit-order where each point below explains the details of each node of the control flow graph:

4.1 Coordinator Thread Tasks:

C1. The workers are started for executing replication transactions in parallel. The system variable 'slave_parallel_workers' defines the number of worker threads to start.
C2. Commit Order Queue of size equal to number of worker threads is initialized with initial transaction stage set to REGISTERED for each worker thread.
C3. The Coordinator thread reads next transaction in the relay log and assigns it to least occupied worker. The worker is added to Commit Order Queue and its transaction stage is changed to WAITED. The worker is signalled to wake up.

4.2 Worker Thread Tasks:

W1. The worker threads waits for transaction to be assigned from Coordinator thread to execute.
W2. The worker thread (waiting in step W1) wakes up when signalled to execute new assigned transaction. It executes all events till it encounters COMMIT event.
W3. If there is any error in transaction execution goto step W4, else goto step W6.
W4. If the error is transient, and can be safely rolled back, and has not exceeded the maximum number of retries, then rollback and go back to W2 to retry the transaction, else goto step W5.
W5. If the transaction can't be retried stop this worker thread and signals next worker which has its turn to commit, change this worker's ordered commit status to REGISTERED and set rollback error so other workers when reaches step W8 can also proceed to stop their worker thread.
W6. When it encounters COMMIT event, it waits for it turn to commit i.e. to reach top of the commit order queue.
W7. If it is not its turn to commit and worker has to wait but a deadlock is found then goto step W3, so that transaction can be rolled back and retried later.
Note: Check Deadlock section above for more details about deadlock.
W8. If it is its turn to commit, but some older worker thread has set rollback error, then:

the transaction stage is set to FINISHED
signals next worker waiting for its turn to proceed
set error ER_SLAVE_WORKER_STOPPED_PREVIOUS_THD_ERROR and proceed to stop this worker thread.

W9. If it is its turn to commit, then it changes the transaction stage to FINISHED.
W10 i. If binlogging is disabled for this thread then set IGNORE_DURABILITY so that transaction is not flushed to the storage engine immediately on commit.
W10 ii. else skip setting HA_IGNORE_DURABILITY and goto step W11.
W11. The transaction is committed but is not flushed if the durability property for the transaction is set to HA_IGNORE_DURABILITY.
W12. If there is error in transaction execution or deadlock found goto step W4, else goto step W13.
W13. The worker thread removes itself from the queue and signals next transaction in commit order queue. It also change its own transaction stage to FINISHED.
W14. The worker is added to the commit order flush queue. The first worker added to the queue also called 'leader' is not blocked, while all the other threads also called 'followers' waits for leader to flush their transaction to the storage engine and externalize gtids. While the flush stage is shared by both binlog and commit order threads, they both add there threads to their respective flush queues (Commit Order flush queue and Binlog flush queue). They share flush stage so that they can choose a single leader when both flush queue are not empty and can flush more transaction data together.
If both binlog and commit order flush queue are empty and first worker thread is a commit order thread, then it is choosen as a leader and goes to step W15, else if it is follower its goes to step W16.
W15. Then the leader which is commit order thread if finds that binlog flush queue is not empty then signals first thread in binlog flush queue to become leader and it itself become follower and goes to step W17, else it remains commit order leader and goes to step W16.
W16. Commit order leader thread does the following tasks:
- fetches entire commit order flush queue which is list of workersand empties it
- flush transactions to the storage engine in a group
- externalize gtids of all follower threads
- awake all waiting follower threads
W17. The follower thread waits for signal from leader thread and on getting signal goes to step W18.
W18. The worker thread again switches to step W1, and waits for event from Coordinator thread to execute next transaction from binary log.

4.3 Binary Log Thread Tasks:

For this control flow graph we are only interested in FLUSH STAGE as its shared by binlog order and commit order threads.

B1. The thread enters binlog ordered commit.
B2. The thread enters Stage#0: Commit Order Stage which ensures commit order for transactions flushing to binary log is exactly in the order as they are written into the relay log.
B3. The threads enters Stage#1: Binlog Flush Stage where new threads enter and flushed together by leader (first thread to enter queue). The thread is added to binlog flush queue. If thread is not leader (first thread in binlog flush queue), then it waits for signal from leader thread.
B4. If first thread is leader but commit order flush queue is not empty it waits for signal from commit order leader thread to become leader of both binlog and commit order flush queue.
B5. The binlog leader thread does not block in binlog flush queue and returns then does following tasks:
1. fetches the entire binlog and commit order flush queue and empties it, so that the next batch has a new leader.
2. then the current thread which is leader flushes logs to the storage engine
3. Binary order leader thread externalize gtid's and unblock commit order list waiting threads.
B6. Commit queue of thread's transactions and add the GTIDs of the transactions to gtid_executed.
B7. Unblock all waiting threads of binlog order list as there transaction has been committed and gtid added to executed_gtid.
1. Binlog follower receive's signal once leader has committed its transaction in case COMMIT STAGE is not skipped.
2. In case COMMIT STAGE is skipped, leader does not commit for binlog followers they are signalled to wake up and execute its transaction commit itself.

Interface Specification

Configuration/sysvar changes:

I-1: ER_DONT_SUPPORT_SLAVE_PRESERVE_COMMIT_ORDER:

ER_DONT_SUPPORT_SLAVE_PRESERVE_COMMIT_ORDER SHALL happen only when slave-parallel-type = DATABASE and slave-preserve-commit-order is enabled. This is an existing error code and not added in this worklog.

Modifications in source files:

handler.h/handler.cc

M1) ha_commit_low():

a. The call to Commit_order_manager::wait() is added before storage engine commit handler function to make the transaction wait for its turn to commit. The Commit_order_manager::wait() is called only below condition is true which can be understood as follows:

When the binlog is enabled, this will be done from MYSQL_BIN_LOG::ordered_commit and should not be done here. Therefore, we have the condition thd->is_current_stmt_binlog_disabled().
This function is usually called once per statement, with all=false. We should not preserve the commit order when this function is called in that context. Therefore, we have the condition ending_trans(thd, all).
Statements such as ANALYZE/OPTIMIZE/REPAIR TABLE will call ha_commit_low multiple times with all=true from within mysql_admin_table, mysql_recreate_table, and handle_histogram_command. After returing to mysql_execute_command, it will call ha_commit_low a final time. It is only in this final call that we should preserve the commit order. Therefore, we set the flag thd->is_intermediate_commit_without_binlog while executing mysql_admin_table, mysql_recreate_table, and handle_histogram_command, clear it when returning from those functions, and check the flag here in ha_commit_low().
In all the above cases, we should make the current transaction fail early in case a previous transaction has rolled back. Therefore, we also invoke the commit order manager in case get_rollback_status returns true.

if ((!thd->is_intermediate_commit_without_binlog &&
     thd->is_current_stmt_binlog_disabled() && ending_trans(thd, all)) ||
    Commit_order_manager::get_rollback_status(thd)) {
  if (Commit_order_manager::wait(thd)) {
    error = 1;
    /*
      Remove applier thread from waiting in Commit Order Queue and
      allow next applier thread to be ordered.
    */
    Commit_order_manager::wait_and_finish(thd, error);
    goto err;
  }
  is_applier_wait_enabled = true;
}

b. After commit threads are removed from wait and next thread is signalled, serializing commit calls. The Commit_order_manager::wait_and_finish() is called only when Commit_order_manager::wait() was called before commit for the thread.

 /*
   After ensuring externalization order for applier thread, remove it
   from waiting (Commit Order Queue) and allow next applier thread to
   be ordered.
 */
 if (is_applier_wait_enabled) {
   Commit_order_manager::wait_and_finish(thd, error);
 }

M2) ha_xa_prepare():

a. The call to Commit_order_manager::wait() is added before storage engine prepare handler function to make the XA transaction wait for its turn to be in prepared state. In case error occurs while executing Commit_order_manager::wait() the transaction is rolled back and error is returned from ha_xa_prepare().

 /* Ensure externalization order for applier threads. */
 if (Commit_order_manager::wait(thd)) {
   thd->commit_error = THD::CE_NONE;
   ha_rollback_trans(thd, true);
   error = 1;
   gtid_error = true;
   goto err;
 }

b. After prepare the threads are removed from wait and next thread signalled, serializing prepare calls.

 /*
   After ensuring externalization order for applier thread, remove it
   from waiting (Commit Order Queue) and allow next applier thread to
   be ordered.
 */
 Commit_order_manager::wait_and_finish(thd, error);

Note: The function ha_prepare() has been renamed to ha_xa_prepare() as it deals with XA transactions only.

xa.cc

M3) ha_prepare() -> ha_xa_prepare()

The function ha_prepare() has been renamed to ha_xa_prepare() so all the calls ha_prepare() in xa.cc are replaced to ha_xa_prepare().

M4) Sql_cmd_xa_commit::process_external_xa_commit() and Sql_cmd_xa_rollback::process_external_xa_rollback():

a. The thread acquired metadata locks during process_external_xa_commit() and process_external_xa_rollback(), so we create a savepoint before calling Commit_order_manager::wait() and rollback to savepoint in case there is some error.

 /*
   Metadata locks taken during XA COMMIT should be released when
   there is error in commit order execution, so we take a savepoint
   and rollback to it in case of error.
 */
 MDL_savepoint mdl_savepoint = thd->mdl_context.mdl_savepoint();

b. The call to Commit_order_manager::wait() is added before storage engine commit/rollback handler function i.e. ha_commit_or_rollback_by_xid() to make the transaction wait for its turn to commit/rollback.

 /* Ensure externalization order for applier threads. */
 if (Commit_order_manager::wait(thd)) {
   /*
     In case of error, wait_and_finish() checks if transaction can be
     retried and if it can be retried then only it remove itself from
     waiting (Commit Order Queue) and allow next applier thread to be
     ordered.
   */
   Commit_order_manager::wait_and_finish(thd, true);

   need_clear_owned_gtid = true;
   gtid_error = true;
   gtid_state_commit_or_rollback(thd, need_clear_owned_gtid, !gtid_error);
   thd->mdl_context.rollback_to_savepoint(mdl_savepoint);
   return true;
 }

c. After commit/rollback threads are removed from wait and next thread is signalled, serializing commit/rollback calls.

 /*
   After ensuring externalization order for applier thread, remove it
   from waiting (Commit Order Queue) and allow next applier thread to
   be ordered.
 */
 Commit_order_manager::wait_and_finish(thd, res);

rpl_slave_commit_order_manager.h/rpl_slave_commit_order_manager.cc

M5. enum class enum_transaction_stage

enum class enum_transaction_stage { REGISTERED, WAITED, FINISHED };

The older enumerators replaced with new names as follows: QUEUED -> REGISTERED WAITED -> WAITED DONE -> FINISHED

M6. Commit_order_manager::wait():

This function serialize's commit/prepare calls, by making caller wait for its turn.

Algorithm:

The worker enters commit order queue(wait state), if ordered commit status is set to WAITED.
In above cases if it doesn't enter commit order queue (wait state) and if the worker executing transaction is signalled to rollback then function wait_for_its_turn() returns with error (Return value= true).
The worker waits in the commit order queue (wait state) with statusWAITED till: - the worker is not in the front of the commit order queue, or, - a deadlock is not found, or, - while executing transaction some error has not occurred and its not marked to roll back.
In case a deadlock is found or worker executing transaction is signalled to roll back then error (Return value= true) is returned.
If the worker reaches front of the commit order queue then ordered commit status is set to FINISHED.
If binlogging is disabled for the thread and transaction is not signalled to rollback then durability_property is set to HA_IGNORE_DURABILITY so that transaction is not flushed at commit stage and multiple transactions can be flushed together later in group.

Return value: The function wait_for_its_turn() will return following return value for follwing mentioned cases.

It SHALL return true:

whenever a deadlock is found.
whenever while executing transaction some error has occurred and its marked to rollback.
whenever the previous worker has found some error and has signalled all the workers to rollback.

It SHALL return false:

when worker is not signlled to rollback, and,
whenever a worker is not registered for waiting i.e. worker status is not REGISTERED, or
whenever a worker has already waited i.e. it waited and reached top of the commit order queue and status has been changed from WAITED to FINISHED, or,

  /**
    Wait for its turn to commit or unregister.

    @param[in] worker The worker which is executing the transaction.

    @return
      @retval false  All previous transactions succeed, so this transaction can
                     go ahead and commit.
      @retval true   One or more previous transactions rollback, so this
                     transaction should rollback.
  */
  bool wait(Slave_worker *worker);

M7) Commit_order_manager::wait_and_finish():

This function unregisters/removes the transaction from the commit order queue.

Algorithm: Check 4.2 Worker Thread Tasks step W12 to W20 in hls.

/**
  Wait for its turn to unregister and signal the next one to go ahead. In
  case error happens while processing transaction, notify the following
  transaction to rollback.

  @param[in] thd    The THD object of current thread.
  @param[in] error  If true failure in transaction execution
*/
static void wait_and_finish(THD *thd, bool error);

M8. Commit_order_manager::finish_one():

This function is called to unregister the worker from the commit order queue and signal the next worker to awake.

/**
  Unregister the thread from the commit order queue and signal
  the next thread to awake.

  @param[in] worker     The worker which is executing the transaction.
*/
void finish_one(Slave_worker *worker);

M9. Commit_order_manager::flush_engine_and_signal_threads():

This function is called to flush transaction in group. It adds transaction thread to commit order flush queue and the first transaction i.e. leader flushes for all the queued threads together and then awake them from their wait.

Algorithm: Check 4.2 step W17 to W21 for more details.

/**
  Flush record of transactions for all the waiting threads and then
  awake them from their wait. It also calls
  gtid_state->update_commit_group() which updates both the THD and the
  Gtid_state for whole commit group to reflect that the transaction set
  of transactions has ended.

  @param[in] worker  The worker which is executing the transaction.
*/
void flush_engine_and_signal_threads(Slave_worker *worker);

M10. Commit_order_manager::finish():

It processes transaction thread only when transaction stage is WAITED.

Algorithm: Check 4.2 step W15 for more details.

/**
  Unregister the transaction from the commit order queue and signal the next
  one to go ahead.

  @param[in] worker     The worker which is executing the transaction.
*/
void finish(Slave_worker *worker);

M11. Commit_order_manager::get_rollback_status():

/**
  Get rollback status.

  @return
    @retval true   Transactions in the queue should rollback.
    @retval false  Transactions in the queue shouldn't rollback.
*/
bool get_rollback_status();

M12. Commit_order_manager::set_rollback_status():

/**
  Set rollback status to true.
*/
void set_rollback_status();

M13. Commit_order_manager::unset_rollback_status():

/**
  Unset rollback status to false.
*/
void unset_rollback_status();

static void check_and_report_deadlock(THD *thd_self, THD *thd_wait_for);

M14. Commit_order_manager::wait():

 This is wrapper of internal wait() function with similar name but
 different worker argument.

/**
  Wait for its turn to commit or unregister.

  @param[in] thd  The THD object of current thread.

  @return
    @retval false  All previous transactions succeed, so this transaction can
                   go ahead and commit.
    @retval true   The transaction is marked to rollback.
*/
static bool wait(THD *thd);

M15. Commit_order_manager::get_rollback_status():

 This is wrapper of internal get_rollback_status() function with similar
 name but different worker argument.

/**
  Get transaction rollback status.

  @param[in] thd    The THD object of current thread.

  @return
    @retval true   Current transaction should rollback.
    @retval false  Current transaction shouldn't rollback.
*/
static bool get_rollback_status(THD *thd);

M16. Commit_order_manager::finish_one():

 This is wrapper of internal finish_one() function with similar
 name but different worker argument.

/**
  Unregister the thread from the commit order queue and signal
  the next thread to awake.

  @param[in] thd    The THD object of current thread.
*/
static void finish_one(THD *thd);

M16. Commit_order_manager::assert_mutex_owner():

 This function asserts that m_mutex is locked by caller.

/**
   Assert that the mutex is held.
*/
void assert_mutex_owner();

M17. Commit_order_manager::lock_mutex():

 This function locks m_mutex mutex.

/**
   Lock mutex
*/
void lock_mutex();

M18. Commit_order_manager::unlock_mutex():

 This function release lock on m_mutex mutex.

/**
   Unlock mutex
*/
void unlock_mutex();

M19.

The following functions has been removed or replaced:

wait_for_its_turn(): Its replaced by wait().
unregister_trx() is removed.
report_rollback() and report_commit() are replaced by wait_and_finish(). The wait_and_finish() error argument now determines functionality of older report_commit() and report_rollback().
The commit_order_manager_check_deadlock() is replaced by Commit_order_manager::check_and_report_deadlock() internal to class Commit_order_manager.

M20. has_commit_order_manager():

 The has_commit_order_manager function is moved from binlog.cc to
 rpl_slave_commit_order_manager.cc.

binlog.cc

M21. MYSQL_BIN_LOG::MYSQL_BIN_LOG(uint *sync_period, bool relay_log):

A new boolean argument relay_log is added to differentiate between binlog and relay log object. It needs to differentiate as Commit_stage_manager class is used only for binlog.

M22. MYSQL_BIN_LOG::init_thd_variables():

Its used to initiate thread variables used in MYSQL_BIN_LOG::ordered_commit().

/**
  Set thread variables used while flushing a transaction.

  @param[in] thd  thread whose variables need to be set
  @param[in] all   This is @c true if this is a real transaction commit, and
               @c false otherwise.
  @param[in] skip_commit
               This is @c true if the call to @c ha_commit_low should
               be skipped (it is handled by the caller somehow) and @c
               false otherwise (the normal case).
*/
void init_thd_variables(THD *thd, bool all, bool skip_commit);

M23. MYSQL_BIN_LOG::fetch_and_process_flush_stage_queue():

Fetch and empty flush queues and flush all the queued thread transaction and unblocks all commit order threads.

/**
  Fetch and empty BINLOG_FLUSH_STAGE and COMMIT_ORDER_FLUSH_STAGE flush
  queues and flush transactions to the disk, and unblock threads executing
  slave preserve commit order.

  @param[in] check_and_skip_flush_logs
               if false then flush prepared records of transactions to the
               log of storage engine.
               if true then flush prepared records of transactions to the
               log of storage engine only if COMMIT_ORDER_FLUSH_STAGE queue
               is non-empty.

  @return Pointer to the first session of the BINLOG_FLUSH_STAGE stage
          queue.
*/
THD *fetch_and_process_flush_stage_queue(
    const bool check_and_skip_flush_logs = false);

M24. MYSQL_BIN_LOG::ordered_commit():

The call to wait_for_its_turn() is substituted with Commit_order_manager::wait() and a new Stage #0 is created for it.

Stage #0: ensure commit order for transactions flushing to binary log

The new Stage #0 does not need separate StageID as Commit_order_manager maintains it own queue and its own order for the commit.

M25. MYSQL_BIN_LOG::gtid_end_transaction:

The update_on_commit() is called for threads for whom slave-preserve-commit-order is disabled, as gtid is externalized in group for threads for whom slave-preserve-commit-order is enabled.

M26. The has_commit_order_manager function is moved from binlog.cc to rpl_slave_commit_order_manager.cc.

rpl_slave.cc:

M27. handle_slave_sql:

The following changes are done in handle_slave_sql():

remove condition which stops setting system variable slave_preserve_commit_order unless log-bin and log-slave-updates are not set
remove error ER_DONT_SUPPORT_SLAVE_PRESERVE_COMMIT_ORDER which gets set when for multithreaded slave LOGICAL_CLOCK is used and neither log-bin nor log-slave-updates is set.

rpl_rli.cc:

M28. Relay_log_info::Relay_log_info()

The MYSQL_BIN_LOG::relay_log initalization pass argument relay_log as true so Commit_stage_manager object initalization and destruct can be skipped.

rpl_commit_stage_manager.h/rpl_commit_stage_manager.cc

The Stage_manager class is moved to new files rpl_commit_stage_manager.h/ rpl_commit_stage_manager.cc from binlog.cc/binlog.h and renamed to Commit_stage_manager as now flush queue which is now needed and shared by both binlog and commit order. The Commit_stage_manager class maintains the commit stages for group commit.

M29. Commit_stage_manager::Mutex_queue::init():

The mutexes are now initalized in Commit_stage_manager::init() instead of Commit_stage_manager::Mutex_queue::init(). So instead of earlier PSI_mutex_key variable now intialized mutex is passed to function.

 void init(mysql_mutex_t *lock);

M30. Commit_stage_manager::get_leader():

Returns first member of the queue.

 /**
    Fetch the first thread of the queue.

   @return first thread of the queue.
 */
 THD *get_leader() { return m_first; }

M31. Commit_stage_manager::m_is_initialized:

The new variable is used to check if class already initialized.

 /** check if Commit_stage_manager variables already initalized. */
 bool m_is_initialized;

M32. Commit_stage_manager::fetch_and_empty_acquire_lock():

A wrapper over fetch_and_empty() which directs it to acquire queue lock before fetching and emptying the queue threads.

 /**
   Fetch the entire queue for a stage. It is a wrapper over
   fetch_and_empty() and acquires queue lock before fetching
   and emptying the queue threads.

   @return  Pointer to the first session of the queue.
 */
 THD *fetch_and_empty_acquire_lock();

M33. Commit_stage_manager::fetch_and_empty_skip_acquire_lock():

A wrapper over fetch_and_empty() which directs it to skip acquiring queue lock before fetching and emptying the queue threads.

 /**
   Fetch the entire queue for a stage. It is a wrapper over
   fetch_and_empty(). The caller must acquire queue lock before
   calling this function.

   @return  Pointer to the first session of the queue.
 */
 THD *fetch_and_empty_skip_acquire_lock();

M34. Commit_stage_manager::fetch_and_empty:

The mutex lock and unlock is now removed and now its caller responsibility to acquire lock.

M35. Commit_stage_manager::StageID:

The FLUSH_STAGE is now replaced by BINLOG_FLUSH_STAGE and COMMIT_ORDER_FLUSH_STAGE.

  /**
     Constants for queues for different stages.
   */
  enum StageID {
    BINLOG_FLUSH_STAGE,
    SYNC_STAGE,
    COMMIT_STAGE,
    COMMIT_ORDER_FLUSH_STAGE,
    STAGE_COUNTER
  };

M36. Commit_stage_manager::fetch_queue_acquire_lock():

Its is a wrapper over private fetch_and_empty_acquire_lock().

  /**
    Fetch the entire queue and empty it. It acquires queue lock before fetching
    and emptying the queue threads.

    @param[in]  stage             Stage identifier for the queue to append to.

    @return Pointer to the first session of the queue.
  */
  THD *fetch_queue_acquire_lock(StageID stage);

M37. Commit_stage_manager::fetch_queue_acquire_lock():

Its is a wrapper over private fetch_and_empty_skip_acquire_lock().

  /**
    Fetch the entire queue and empty it. The caller must acquire queue lock
    before calling this function.

    @param[in]  stage             Stage identifier for the queue to append to.

    @return Pointer to the first session of the queue.
  */
  THD *fetch_queue_skip_acquire_lock(StageID stage);

M38. Commit_stage_manager::signal_done():

The function is called by only binlog leader thread.

  /**
    The function is called after follower thread are processed by leader,
    to unblock follower threads.

    @param first   the thread list which needs to ne unblocked
    @param stage   Stage identifier current thread belong to.
  */
  void signal_done(THD *queue, StageID stage = BINLOG_FLUSH_STAGE);

M39. Commit_stage_manager::process_final_stage_for_ordered_commit_group():

The function is called by only binlog leader thread.

 /**
   This function gets called after transactions are flushed to the engine
   i.e. after calling ha_flush_logs, to unblock commit order thread list
   which are not needed to wait for other stages.

   @param first     the thread list which needs to ne unblocked
 */
 void process_final_stage_for_ordered_commit_group(THD *first);

M40. Commit_stage_manager::lock_queue():

  /**
    Wrapper on Mutex_queue lock(), acquires lock on stage queue.

    @param[in]  stage  Stage identifier for the queue to append to.
  */
  void lock_queue(StageID stage) { m_queue[stage].lock(); }

M41. Commit_stage_manager::unlock_queue():

  /**
    Wrapper on Mutex_queue unlock(), releases lock on stage queue.

    @param[in]  stage  Stage identifier for the queue to append to.
  */
  void unlock_queue(StageID stage) { m_queue[stage].unlock(); }

M42. Commit_stage_manager::m_stage_cond_leader:

The binlog leader thread waits on this condition variable when commit order queue is not empty, and gets awake when signalled to wake up from commit order flush queue leader later.

  /**
     The binlog leader waits on this condition variable till it is indicated
     to wake up. If binlog flush queue gets first thread in the queue but
     by then commit order flush queue has already elected leader. The the
     first thread of binlog queue waits on this condition variable and get
     signalled to wake up from commit order flush queue leader later.
  */
  mysql_cond_t m_stage_cond_leader;

M43. Commit_stage_manager::m_stage_cond_binlog:

The binlog follower thread waits on this condition variable.

  /**
     Condition variable to indicate that the binlog threads can wake up
     and continue.
  */
  mysql_cond_t m_stage_cond_binlog;

M44. Commit_stage_manager::m_stage_cond_commit_order:

The commit order follower thread waits on this condition variable.

  /**
     Condition variable to indicate that the flush to storage engine
     is done and commit order threads can again wake up and continue.
  */
  mysql_cond_t m_stage_cond_commit_order;

M45. Commit_stage_manager::m_queue_lock:

It holds locks for different stage queues. The binlog and commit order flush stage shares same locks.

  /** Mutex used for the stage level locks */
  mysql_mutex_t m_queue_lock[STAGE_COUNTER - 1];

M46. Commit_stage_manager::init():

It initalizes m_stage_cond_leader, m_stage_cond_binlog and m_stage_cond_commit_order condition variables. It also initalizes mutex for flush, sync and commit stage queue. The binlog flush stage and commit order flush stage share same mutex.

M47. Commit_stage_manager::deinit():

It destroys the m_stage_cond_leader, m_stage_cond_binlog and m_stage_cond_commit_order condition variables. It also destroys mutex for flush, sync and commit stage queue.

rpl_rli_pdb.cc

M28. report_rollback() / report_commit():

The call to function report_rollback() and report_commit() has been replaced by wait_and_finish(). The wait_and_finish() error argument now determines functionality of report_commit() and report_rollback().

M49. Slave_worker::check_and_report_end_of_retries():

The code inside retry_transaction which determines whether transaction can be retried is moved to new check_and_report_end_of_retries().

  /**
    Checks if the transaction can be retried, and if not, reports an error.

    @param[in] thd          The THD object of current thread.

    @return
      @retval returns std::tuple<bool, bool, uint> where each element has
              following meaning:

              first element of tuple is function return value and determines
                false  if the transaction should be retried
                true   if the transaction should not be retried

              second element of tuple determines:
                the function will set the value to true, in case the retry
                should be "silent". Silent means that the caller should not
                report it in performance_schema tables, write to the error
                log, or sleep. Currently, silent is used by NDB only.

              third element of tuple determines:
                If the caller should report any other error than that stored
                in thd->get_stmt_da()->mysql_errno(), then this function
                will store that error in this third element of the tuple.

  */
  std::tuple<bool, bool, uint> check_and_report_end_of_retries(THD *thd);

M50. Slave_worker::found_commit_order_deadlock_skip_acquire_lock():

The found_order_commit_deadlock() is replaced by found_commit_order_deadlock_skip_acquire_lock() and found_commit_order_deadlock_acquire_lock.

  /**
     Return true if slave-preserve-commit-order is enabled and an
     earlier transaction is waiting for a row-level lock held by this
     transaction. The caller must acquire commit order manager lock
     before calling this function.
  */
  bool found_commit_order_deadlock_skip_acquire_lock();

M51. Slave_worker::found_commit_order_deadlock_acquire_lock():

The found_order_commit_deadlock() is replaced by found_commit_order_deadlock_skip_acquire_lock() and found_commit_order_deadlock_acquire_lock. The only change is found_commit_order_deadlock_acquire_lock acquires commit order manager mutex before accessing m_commit_order_deadlock variable.

  /**
     Return true if slave-preserve-commit-order is enabled and an
     earlier transaction is waiting for a row-level lock held by this
     transaction.
  */
  bool found_commit_order_deadlock_acquire_lock();

M52. Slave_worker::flush_info():

The error code in calls to write_info() and flush_info() is not overwritten by ER_RPL_ERROR_WRITING_SLAVE_WORKER_CONFIGURATION.

M53.

The following function variable are renamed: report_order_commit_deadlock() -> report_commit_order_deadlock() reset_order_commit_deadlock() -> reset_commit_order_deadlock() m_order_commit_deadlock -> m_commit_order_deadlock

sql/thd_raii.c

M54 Disable_intermediate_commit_without_binlog class

Added a new RAII class Disable_intermediate_commit_without_binlog to disable invocation of commit order for intermediate commits. The statements such as ANALYZE/OPTIMIZE/REPAIR TABLE will call ha_commit_low multiple times with all=true from within mysql_admin_table, mysql_recreate_table, and handle_histogram_command. After returing to mysql_execute_command, it will call ha_commit_low a final time. It is only in this final call that we should preserve the commit order. Therefore, we set the flag thd->is_intermediate_commit_without_binlog while executing mysql_admin_table, mysql_recreate_table, and handle_histogram_command, clear it when returning from those functions, and check the flag in ha_commit_low().

 /**
   RAII class for temporarily turning ON is_intermediate_commit_without_binlog
   variable.
   Currently being used for ANALYZE/OPTIMIZE/REPAIR TABLE statements to
   disable invocation of commit order for intermediate commits.
 */

  Disable_intermediate_commit_without_binlog {
  public:
   Disable_intermediate_commit_without_binlog(THD *thd)
       : m_thd(thd),
         m_skip_disable_intermediate_commit_without_binlog(
             thd->is_intermediate_commit_without_binlog) {
     thd->is_intermediate_commit_without_binlog = true;
   }

   Disable_intermediate_commit_without_binlog(THD *thd, Alter_info *alter_info)
       : Disable_intermediate_commit_without_binlog(thd) {
     thd->is_intermediate_commit_without_binlog =
         !(alter_info->flags & Alter_info::ALTER_ADMIN_PARTITION);
   }

   ~Disable_intermediate_commit_without_binlog() {
     m_thd->is_intermediate_commit_without_binlog =
         m_skip_disable_intermediate_commit_without_binlog;
   }

  private:
   THD *const m_thd;
   bool m_skip_disable_intermediate_commit_without_binlog;
 };

sql/sql_admin.cc

M55 mysql_admin_table() and Sql_cmd_analyze_table::handle_histogram_command():

Set thd->is_intermediate_commit_without_binlog by invoking Disable_intermediate_commit_without_binlog() RAII class which would prevent intermediate commits to invoke commit order. And will reset thd->is_intermediate_commit_without_binlog when object goes out of scope.

 Disable_intermediate_commit_without_binlog interm_commit_without_binlog(thd);

sql/sql_table.cc

M56 mysql_recreate_table():

Set thd->is_intermediate_commit_without_binlog by invoking Disable_intermediate_commit_without_binlog() class which would prevent intermediate commits to invoke commit order. And will reset thd->is_intermediate_commit_without_binlog when object goes out of scope.

 Disable_intermediate_commit_without_binlog interm_commit_without_binlog(thd);