WL#9556: Writeset-based MTS dependency tracking on master

Affects: Server-8.0   —   Status: Complete

EXECUTIVE SUMMARY

This worklog introduces an option to the MySQL server that allows a user to
choose whether to put information in the binary log that allows the slave to
parallelize based on commit timestamps or transaction write sets.

Parallelizing on write sets has potentially much more parallelism since it does
not depend on the commit history, and as such, applying binary logs on a slave
may explore better the underlying computing hardware (cpu cores) and ultimately
this means that replication can become a lot faster.

This builds on infrastructure that was already developed as part of Group
Replication and as such the amount of changes needed is small for a 5.7 GA server.

USER STORIES

- As a DBA, I want my server to apply relay-logs as quick as possible, even for
  low concurrent workloads, so that I am able to catch up more quickly with the
  master or group.

- As a DBA, I want to create a replication chain without decreasing the
  performance of the parallel applier down the chain, so that my replication
  topology is able to propagate changes quickly to the entire set of servers.

- As a DBA, I want my new server, that has joined the replication group, to do
  recovery as fast as it can by applying changes in parallel, so that it becomes
  ONLINE quickly.

BACKGROUND

MySQL 5.7 introduced the ability to have the replication applier work in
parallel with any kind of workload, with the basic assumption that two
transactions can execute in parallel on the slave if they ran in parallel on the
master. The Logical Clock (LC) scheduler of the Multi-threaded Slave Applier
(MTS) was devised for this, using the group commit mechanism to mark each
transaction with a group commit parent, which is the first transaction committed
in the same group commit. Two transactions that shared the same commit parent
can run in parallel on the slave (MySQL  5.7.3 increased the parallelism window
a bit beyond the size of a single binary log group-commit).

The Logical Clock scheduler allowed replication in MySQL 5.7 to use the
multi-threaded applier on intra-schema, unlike 5.6 where it could only be used
with one thread per schema. But there is a fundamental limit to the parallelism
that can be achieved by this approach, which is the number of transactions that
are estimated to have been executed in parallel on the master. If the number of
clients - or even the number of transactions that group commit together given
the way the parallelism is detected - is small the slave will not be able to run
multiple transactions in parallel.

So this approach works well when the number of client threads is high and/or the
size of the group commits are large, but low-concurrency in the writing clients
and very fast storage systems restrict its effectiveness. A particularly
troubling side-effect of this approach is that in multi-level replication
topologies the parallelization potential is gradually lost. Eliminating this
loss is in fact one of the motivations for Binlog Servers [2].

To support Group Replication (GR) the ability to generate the write-sets for
each transaction was also introduced in MySQL 5.7. The write-set of a
transaction is a set of hashes that identify each row in the database changed by
it. The write-sets are used by GR to detect which transactions can run in
parallel between members servers - using the certification mechanism – but the
same idea is also used in each member to detect which transactions can be
applied in parallel in the slaves.

GR takes advantage of the Logical Clock scheduler logic in MTS, but instead of
using the commit group as the source of the commit parent, it uses the write-set
to track and find the last transaction that changed each row and uses that
transaction as the commit parent for transactions that change those same rows.
Since a transaction can change rows changed previously by multiple transactions,
the latest of those transactions is used as the commit parent and the
slave-preserve-order must be used to manage multiple dependencies.

MOTIVATION

This worklog proposes implementing an option in the generation of the Logical
Clock commit parent for MTS that takes advantage of the write-sets. It uses
writesets to determine the parallelization window instead of commit timestamps.
This will:

- bring to the asynchronous replication infrastructure what is already done in
  Group Replication after recovery has finished;

- speed up the catch up of servers that are added to a replication topology or
  even to a group (in group replication), especially on low concurrent workloads.
  Considering some important use cases, this is a major competitive advantage
  over other forks of MySQL out there;

- enable a faster catchup mechanism, as low-concurrency workloads can be quickly
  inserted onto a slave/server at much higher rates then the original server
  executed them. This will be possible because transaction dependencies can be
  evaluated by the data touched (write set) and not by the transactions commit
  timestamps (group commit).

- allow parallelization information to remain optimal in a replication topology,
  instead of loosing parallelization potential as transactions traverse servers
  in a replication chain as it happens with commit timestamps parallelization
  scheme. This is a huge advantage, especially for some of our users that
  have noticed that the parallelization potential diminishes as transactions
  flow in the replication topology.

- build on the existing write set extraction mechanism in the server and as such
  the design proposed results in an implementation that is very contained in its
  implications and does not leave a memory footprint. The new parallelization is
  controlled through a new option which is not active by default, leaving room
  for a clean backport if needed.

PERFORMANCE BENEFITS

In terms of performance there are several benefits in the GR MTS approach,
resulting mostly from the fact that transactions are detected as able to run in
parallel by the data they touch and not by the moment they execute. Transactions
committed in very different moments can be executed in parallel if they don't
touch the same rows and from master to slave there is no loss of parallelism
because of it.
In particular, there is no need to have parallelism on the master to be able to
execute a workload in parallel on the slave, and even single-threaded workloads
can benefit from parallelism when being applied in bulk on the slaves.

For GR this would be particularly effective for performing recovery when a new
member joins a GR group, but presently GR performs recovery using the
asynchronous replication infrastructure and the binary logs of the donor members
– which use the unchanged Logical Clock logic. So, instead of taking advantage
of the write-sets generated, only the commit order is used to generate the
commit parent of the transactions – with the aforementioned limitations.

To test the potential gains of this approach a Sysbench RW benchmark was
performed where the LC MTS did not present any benefit: with a single client
thread. However, to present the same workload on the master and on the slaves
the RW test was modified so that only the writes were done on the master,
something we call Sysbench Write-only.

The test was ran 5 minutes on the master and then the resulting binary log (RBR)
was applied on the slave using the single-threaded applier (STS) and using the
multi-threaded applier (MTS) with a varying number of threads (the server used,
siv01, has 36 cores).

+----------------+---------+-----------+-----------+
|Test            |  Run    |  Speedup  |   Speedup |
|                |  Time   | vs client |   vs STS  |
+----------------+---------+-----------+-----------+
|Client (master) |  05:00  |     1,00  |     0,27  |
+----------------+---------+-----------+-----------+
|STS 0           |  01:20  |     3,73  |     1,00  |
|MTS 1 thread    |  01:26  |     3,47  |     0,93  |
|MTS 2 threads   |  00:53  |     5,58  |     1,50  |
|MTS 4 threads   |  00:30  |     9,94  |     2,66  |
|MTS 6 threads   |  00:22  |    13,15  |     3,52  |
|MTS 8 threads   |  00:22  |    13,40  |     3,59  |
|MTS 10 threads  |  00:21  |    13,77  |     3,69  |
|MTS 12 threads  |  00:21  |    14,06  |     3,77  |
|MTS 16 threads  |  00:19  |    15,27  |     4,09  |
|MTS 20 threads  |  00:20  |    14,87  |     3,98  |
|MTS 24 threads  |  00:19  |    15,09  |     4,04  |
|MTS 32 threads  |  00:19  |    15,09  |     4,04  |
+----------------+---------+-----------+-----------+

The results clearly show that the GR MTS approach can present significant
benefits even on the single-threaded client case, with speedups larger than 4x,
results come close to the best that can be achieved with LC on many threads if
we exclude the additional speedup coming from the group commit. So, as a result
of a slight increase in execution time per transaction (for calculating the
write-sets), we can apply on the slave 4 times faster even a single-threaded
workload, the most difficult workloads for MTS.

REFERENCES

[1] WL#8440 Group Replication: Parallel applier support
[2]
http://blog.booking.com/evaluating_mysql_parallel_replication_3-benchmarks_in_production.html