WL#9556: Writeset-based MTS dependency tracking on master

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

EXECUTIVE SUMMARY

This worklog introduces an option to the MySQL server that allows a user to
choose whether to put information in the binary log that allows the slave to
parallelize based on commit timestamps or transaction write sets.

Parallelizing on write sets has potentially much more parallelism since it does
not depend on the commit history, and as such, applying binary logs on a slave
may explore better the underlying computing hardware (cpu cores) and ultimately
this means that replication can become a lot faster.

This builds on infrastructure that was already developed as part of Group
Replication and as such the amount of changes needed is small for a 5.7 GA server.

USER STORIES

- As a DBA, I want my server to apply relay-logs as quick as possible, even for
  low concurrent workloads, so that I am able to catch up more quickly with the
  master or group.

- As a DBA, I want to create a replication chain without decreasing the
  performance of the parallel applier down the chain, so that my replication
  topology is able to propagate changes quickly to the entire set of servers.

- As a DBA, I want my new server, that has joined the replication group, to do
  recovery as fast as it can by applying changes in parallel, so that it becomes
  ONLINE quickly.

BACKGROUND

MySQL 5.7 introduced the ability to have the replication applier work in
parallel with any kind of workload, with the basic assumption that two
transactions can execute in parallel on the slave if they ran in parallel on the
master. The Logical Clock (LC) scheduler of the Multi-threaded Slave Applier
(MTS) was devised for this, using the group commit mechanism to mark each
transaction with a group commit parent, which is the first transaction committed
in the same group commit. Two transactions that shared the same commit parent
can run in parallel on the slave (MySQL  5.7.3 increased the parallelism window
a bit beyond the size of a single binary log group-commit).

The Logical Clock scheduler allowed replication in MySQL 5.7 to use the
multi-threaded applier on intra-schema, unlike 5.6 where it could only be used
with one thread per schema. But there is a fundamental limit to the parallelism
that can be achieved by this approach, which is the number of transactions that
are estimated to have been executed in parallel on the master. If the number of
clients - or even the number of transactions that group commit together given
the way the parallelism is detected - is small the slave will not be able to run
multiple transactions in parallel.

So this approach works well when the number of client threads is high and/or the
size of the group commits are large, but low-concurrency in the writing clients
and very fast storage systems restrict its effectiveness. A particularly
troubling side-effect of this approach is that in multi-level replication
topologies the parallelization potential is gradually lost. Eliminating this
loss is in fact one of the motivations for Binlog Servers [2].

To support Group Replication (GR) the ability to generate the write-sets for
each transaction was also introduced in MySQL 5.7. The write-set of a
transaction is a set of hashes that identify each row in the database changed by
it. The write-sets are used by GR to detect which transactions can run in
parallel between members servers - using the certification mechanism – but the
same idea is also used in each member to detect which transactions can be
applied in parallel in the slaves.

GR takes advantage of the Logical Clock scheduler logic in MTS, but instead of
using the commit group as the source of the commit parent, it uses the write-set
to track and find the last transaction that changed each row and uses that
transaction as the commit parent for transactions that change those same rows.
Since a transaction can change rows changed previously by multiple transactions,
the latest of those transactions is used as the commit parent and the
slave-preserve-order must be used to manage multiple dependencies.

MOTIVATION

This worklog proposes implementing an option in the generation of the Logical
Clock commit parent for MTS that takes advantage of the write-sets. It uses
writesets to determine the parallelization window instead of commit timestamps.
This will:

- bring to the asynchronous replication infrastructure what is already done in
  Group Replication after recovery has finished;

- speed up the catch up of servers that are added to a replication topology or
  even to a group (in group replication), especially on low concurrent workloads.
  Considering some important use cases, this is a major competitive advantage
  over other forks of MySQL out there;

- enable a faster catchup mechanism, as low-concurrency workloads can be quickly
  inserted onto a slave/server at much higher rates then the original server
  executed them. This will be possible because transaction dependencies can be
  evaluated by the data touched (write set) and not by the transactions commit
  timestamps (group commit).

- allow parallelization information to remain optimal in a replication topology,
  instead of loosing parallelization potential as transactions traverse servers
  in a replication chain as it happens with commit timestamps parallelization
  scheme. This is a huge advantage, especially for some of our users that
  have noticed that the parallelization potential diminishes as transactions
  flow in the replication topology.

- build on the existing write set extraction mechanism in the server and as such
  the design proposed results in an implementation that is very contained in its
  implications and does not leave a memory footprint. The new parallelization is
  controlled through a new option which is not active by default, leaving room
  for a clean backport if needed.

PERFORMANCE BENEFITS

In terms of performance there are several benefits in the GR MTS approach,
resulting mostly from the fact that transactions are detected as able to run in
parallel by the data they touch and not by the moment they execute. Transactions
committed in very different moments can be executed in parallel if they don't
touch the same rows and from master to slave there is no loss of parallelism
because of it.
In particular, there is no need to have parallelism on the master to be able to
execute a workload in parallel on the slave, and even single-threaded workloads
can benefit from parallelism when being applied in bulk on the slaves.

For GR this would be particularly effective for performing recovery when a new
member joins a GR group, but presently GR performs recovery using the
asynchronous replication infrastructure and the binary logs of the donor members
– which use the unchanged Logical Clock logic. So, instead of taking advantage
of the write-sets generated, only the commit order is used to generate the
commit parent of the transactions – with the aforementioned limitations.

To test the potential gains of this approach a Sysbench RW benchmark was
performed where the LC MTS did not present any benefit: with a single client
thread. However, to present the same workload on the master and on the slaves
the RW test was modified so that only the writes were done on the master,
something we call Sysbench Write-only.

The test was ran 5 minutes on the master and then the resulting binary log (RBR)
was applied on the slave using the single-threaded applier (STS) and using the
multi-threaded applier (MTS) with a varying number of threads (the server used,
siv01, has 36 cores).

+----------------+---------+-----------+-----------+
|Test            |  Run    |  Speedup  |   Speedup |
|                |  Time   | vs client |   vs STS  |
+----------------+---------+-----------+-----------+
|Client (master) |  05:00  |     1,00  |     0,27  |
+----------------+---------+-----------+-----------+
|STS 0           |  01:20  |     3,73  |     1,00  |
|MTS 1 thread    |  01:26  |     3,47  |     0,93  |
|MTS 2 threads   |  00:53  |     5,58  |     1,50  |
|MTS 4 threads   |  00:30  |     9,94  |     2,66  |
|MTS 6 threads   |  00:22  |    13,15  |     3,52  |
|MTS 8 threads   |  00:22  |    13,40  |     3,59  |
|MTS 10 threads  |  00:21  |    13,77  |     3,69  |
|MTS 12 threads  |  00:21  |    14,06  |     3,77  |
|MTS 16 threads  |  00:19  |    15,27  |     4,09  |
|MTS 20 threads  |  00:20  |    14,87  |     3,98  |
|MTS 24 threads  |  00:19  |    15,09  |     4,04  |
|MTS 32 threads  |  00:19  |    15,09  |     4,04  |
+----------------+---------+-----------+-----------+

The results clearly show that the GR MTS approach can present significant
benefits even on the single-threaded client case, with speedups larger than 4x,
results come close to the best that can be achieved with LC on many threads if
we exclude the additional speedup coming from the group commit. So, as a result
of a slight increase in execution time per transaction (for calculating the
write-sets), we can apply on the slave 4 times faster even a single-threaded
workload, the most difficult workloads for MTS.

REFERENCES

[1] WL#8440 Group Replication: Parallel applier support
[2]
http://blog.booking.com/evaluating_mysql_parallel_replication_3-benchmarks_in_production.html
FUNCTIONAL REQUIREMENTS

FR1. A new variable named binlog-transaction-dependency-tracking shall be added,
     with the possible modes being COMMIT_ORDER, WRITESET and WRITESET_SESSION,
     teh default is COMMIT_ORDER.

FR2. If binlog-transaction-dependency-tracking=COMMIT_ORDER the parallelization
     information shall be generated from the commit timestamps in the master.

FR3. If binlog-transaction-dependency-tracking=WRITESET the parallelization
     information shall be generated in the master from the write set (any two
     transactions that write to different tuples can be parallelizable),
     replacing the value of the commit_parent in the binary log.

FR4. If binlog-transaction-dependency-tracking=WRITESET_SESSION the parallelization
     information shall be generated in the master from the write set, but two
     updates from the same session cannot be reordered, replacing the value of
     the commit_parent in the binary log.

FR5. If the user attempts to set binlog-transaction-dependency-tracking to WRITESET
     or WRITESET_SESSION while the option transaction-write-set-extraction is set
     to OFF, then an ERROR (ER_WRONG_USAGE) is thrown and the variable will remain
     unchanged.

FR6. If the user attempts to change transaction-write-set-extraction
     while binlog-transaction-dependency-tracking is set to WRITESET or
     WRITESET_SESSION, will also trigger the ERROR (ER_WRONG_USAGE) and
     the variable will remain unchanged.

FR7. The binlog-transaction-dependency-tracking variable shall be changeable while
     the binlog is in use and the values will take effect on the next
     transaction to be written to the binary log.

FR8. Tranactions without writesets will return in WRITESET and WRITESET_SESSION
     modes what would be returned in COMMIT_ORDER mode.

FR9. The WRITESET and WRITESET_SESSION modes will not deliver transaction
     dependencies that are newer then those that would be returned in
     COMMIT_ORDER.

FR10.A new variable named binlog-transaction-dependency-history-size shall be added,
     with possible values ranging from 1 to 1 million, the default will be
     kept at 25000.

FR11.The binlog-transaction-dependency-history-size will define how many row hashes
     will be kept in memory to look back in time and figure out which transaction
     changed some row last. Once the history size if reached, the history map
     will be cleared (no step by step removals for performance reasons).

FR12.Tables without primary keys should be supported by reverting to
     COMMIT_ORDER. 

FR13.Tables with foreign key constraints should be supported by reverting to
     COMMIT_ORDER and clearing history when necessary, which is when there are 
     transactions that affect tables the are parents of any FK relation.

FR14.FILTERED transactions and EMPTY transactions shall change the commit_parent
     and reset the history.

FR15.SAVEPOINTS: A transaction shall count as if it updates a non-index table
     or a foreign key parent table, even if all such updates are rolled back
     within a ROLLBACK TO SAVEPOINT.

NON-FUNCTIONAL REQUIREMENTS

NF1. The performance degradation on the server shall be inferior to 5% compared to a
     server with transaction-write-set-extraction set to a value other than OFF.

NF2. The structure of information sent on the binlog shall not change because
     binlog-transaction-dependency-tracking=WRITESET or
     binlog-transaction-dependency-tracking=WRITESET_SESSION.
     It shall be supported by current MySQL 5.7 slaves and allow those to
     automatically take advantage of it and not affect upgrade/downgrade
     capabilities.
APPROACH

The approach to take is to track in a map/hash the last transaction that used
each row hash (which represents a row in the database), up to a maximum history
log. Whenever a new transaction is processed, the writesets it uses are searched
in the history and the most recent transaction that used any of the writesets in
this transaction becomes the commit parent of the new transaction.

If the search returns no commit_parent, a variable that represents the oldest
transaction that can be a commit_parent is used. This variable also allows the
history to be cleared regularly and introduces support for transaction that
have DDL and need to be executed sequentially, while also supporting the switch
between commit order and writesets as source for the logical clock.

The following diagram provides an overview:

trx1 ---------- B, R(a), W(a), C ----------------------------------->
trx2 ---------------------B, R(b), W(b) C -------------------------->
trx3 ----------------------------- B, R(a), W(a), R(b), W(b) C ----->

* hashes:
  - trx1 commits => { a=trx1 }
  - trx2 commits => { a=trx1, b=trx2 }
  - trx3 commits => { a=trx3, b=trx3 }

* writesets:
  - trx1 = { a }
  - trx2 = { b }
  - trx3 = { a, b }

* Clocks/timestamps:
  - trx1 LC= { Nil , trx1 }
  - trx2 LC= { Nil , trx2 }
  - trx3 LC= { trx2, trx3 }

* Notes
  - trx3 depends on trx2 and trx1, but only trx2 is captured as dependency.
  - trx3 takes the most recently committed transaction that touched one of
    its items in the writeset

* The lower timestamp is calculated similarly to what is
  depicted on the oversimplified routine below

  on_commit(trx, ws):
     max=0
     foreach i in ws:
       if hash[i] > max or max == 0:
         max= hash[i]
       hash[i]= trx.commit_seq
     return max

DETAILS

* New system variable to control which method to capture dependencies
  between transactions:

- Name: binlog-transaction-dependency-tracking
  - Input Values: { COMMIT_ORDER | WRITESET | WRITESET_SESSION}
  - Default: COMMIT_ORDER
  - Description: This option controls whether transaction dependencies are
    established using write sets or commit timestamps. A server applying
    binary logs uses this information to determine which transactions can
    be applied in parallel.
    If this variable is equal to WRITESET_SESSION the master will store the
    last transaction issues by each client and will not allow reordering
    of transactions issued by the same client.
  - Dynamic: yes.
  - Scope: Global
  - Type: Enumeration

- Name: binlog-transaction-dependency-history-size
  - Input Values: [1, 1000000]
  - Default: 25000
  - Description: This option controls the maximum size of the row hash
    history that is kept to track transaction dependencies.
  - Dynamic: yes.
  - Scope: Global
  - Type: Enumeration

* Observability
  - It relies on the infrastructure that we have in place already and the
    results are observable with the tools that mine the binary log.

* Cross-version replication
  - Current MySQL 5.7 servers can automatically take advantage of the new
    commit_parents generated, even if they are not aware of how they were
    generated. There is no extra requirements other than support for the
    LOGICAL_CLOCK scheduler in MTS (so MySQL 5.6 will not use it).

* Rolling-upgrades
  - The upgrade procedures do not change as a results of this worklog.
    The slaves of masters which are configured with the variable
    binlog-transaction-dependency-tracking != COMMIT_ORDER
    may see a different history from the master, but that already happens
    between sessions if slave-preserve-commit-order is not ON.

* Group Replication
  - Servers in GR fulfill all the requirements for this worklod and can use
    the WRITE_SET option without new restrictions.
    If the members are configured with COMMIT_ORDER the recovery that happens
    when new members join the group will not take advantage of the WRITE_SET
    source to improve the parallelism on the node. Also, the asynchronous
    replication slaves that connect to it will also not be able to take
    advantage of the performance improvements.

* DDL
  DDL must be handle sequentialy on the slaves, so all transactions that don't
  have writeset (such as DDLs) are configured to run sequentially on the slave.

CAVEATS

* Using the WRITESET to generate the commit_parent makes it possible that the
  slaves see a state in the database that was never seen in the master.
  To reduce this effect the WRITESET_SESSION value can be used at the cost of
  significantly reducing the parallelism that can be achieved, in particular
  when for transactions ran be the same client.

* This worklog deals with DDLs, FK, filters and savepoints by reverting to
  commit order, which is sub-optimal way. Later we will deal specifically with
  optimizing the commit_parent generation for such cases.
The MYSQL_BIN_LOG::write_gtid function is responsible both for writing the GTID
of each transaction to the binary log if GTIDs are active, and for writing the
(relative_last_committed, relative_sequence_number) pair that is used by MTS to
track transactions that can be executed in parallel.

At present the (relative_last_committed, relative_sequence_number) pair are
calculated taking into account only the binary log group commit order, but after
this worklog the user will be able to select a different source for the
relative_last_committed based on the write-sets touched by each transaction.

At the point in the code path that MYSQL_BIN_LOG::write_gtid is executed the
transaction write-set is already filled with a hash code for each row the
transaction touches - in case this is enabled. As such, in case the user
selects the WRITE_SET logical clock source those hash codes will be used to
track the last transaction to have touched each row and use that transaction as
the relative_last_committed value of the aforementioned pair.

In case GR is running the last transactions to use each row can be extracted
from the certification_info structure. However, to support the WRITE_SET source
without GR one needs to implement an independent map to track it.

To support DDL operations and to limit the memory usage for rows that are
seldomly used, there is a running cursor that tracks the minimum commit_parent,
which is increased whenever there is an operation that must run sequentially and
whenever there memory to store that last transaction to store a row was
reclaimed. So, this map is cleared whenever there is a transaction that has no
write-set, such as DDL, or after a maximum number of rows is stored in it.

The implementation of the GR MTS approach for replication in general is
straightforward, taking into account what was already done for GR, with the core
changes residing on the MYSQL_BIN_LOG::write_gtid function (binlog.cc).