WL#9556: Writeset-based MTS dependency tracking on master
Affects: Server-8.0
—
Status: Complete
EXECUTIVE SUMMARY This worklog introduces an option to the MySQL server that allows a user to choose whether to put information in the binary log that allows the slave to parallelize based on commit timestamps or transaction write sets. Parallelizing on write sets has potentially much more parallelism since it does not depend on the commit history, and as such, applying binary logs on a slave may explore better the underlying computing hardware (cpu cores) and ultimately this means that replication can become a lot faster. This builds on infrastructure that was already developed as part of Group Replication and as such the amount of changes needed is small for a 5.7 GA server. USER STORIES - As a DBA, I want my server to apply relay-logs as quick as possible, even for low concurrent workloads, so that I am able to catch up more quickly with the master or group. - As a DBA, I want to create a replication chain without decreasing the performance of the parallel applier down the chain, so that my replication topology is able to propagate changes quickly to the entire set of servers. - As a DBA, I want my new server, that has joined the replication group, to do recovery as fast as it can by applying changes in parallel, so that it becomes ONLINE quickly. BACKGROUND MySQL 5.7 introduced the ability to have the replication applier work in parallel with any kind of workload, with the basic assumption that two transactions can execute in parallel on the slave if they ran in parallel on the master. The Logical Clock (LC) scheduler of the Multi-threaded Slave Applier (MTS) was devised for this, using the group commit mechanism to mark each transaction with a group commit parent, which is the first transaction committed in the same group commit. Two transactions that shared the same commit parent can run in parallel on the slave (MySQL 5.7.3 increased the parallelism window a bit beyond the size of a single binary log group-commit). The Logical Clock scheduler allowed replication in MySQL 5.7 to use the multi-threaded applier on intra-schema, unlike 5.6 where it could only be used with one thread per schema. But there is a fundamental limit to the parallelism that can be achieved by this approach, which is the number of transactions that are estimated to have been executed in parallel on the master. If the number of clients - or even the number of transactions that group commit together given the way the parallelism is detected - is small the slave will not be able to run multiple transactions in parallel. So this approach works well when the number of client threads is high and/or the size of the group commits are large, but low-concurrency in the writing clients and very fast storage systems restrict its effectiveness. A particularly troubling side-effect of this approach is that in multi-level replication topologies the parallelization potential is gradually lost. Eliminating this loss is in fact one of the motivations for Binlog Servers [2]. To support Group Replication (GR) the ability to generate the write-sets for each transaction was also introduced in MySQL 5.7. The write-set of a transaction is a set of hashes that identify each row in the database changed by it. The write-sets are used by GR to detect which transactions can run in parallel between members servers - using the certification mechanism – but the same idea is also used in each member to detect which transactions can be applied in parallel in the slaves. GR takes advantage of the Logical Clock scheduler logic in MTS, but instead of using the commit group as the source of the commit parent, it uses the write-set to track and find the last transaction that changed each row and uses that transaction as the commit parent for transactions that change those same rows. Since a transaction can change rows changed previously by multiple transactions, the latest of those transactions is used as the commit parent and the slave-preserve-order must be used to manage multiple dependencies. MOTIVATION This worklog proposes implementing an option in the generation of the Logical Clock commit parent for MTS that takes advantage of the write-sets. It uses writesets to determine the parallelization window instead of commit timestamps. This will: - bring to the asynchronous replication infrastructure what is already done in Group Replication after recovery has finished; - speed up the catch up of servers that are added to a replication topology or even to a group (in group replication), especially on low concurrent workloads. Considering some important use cases, this is a major competitive advantage over other forks of MySQL out there; - enable a faster catchup mechanism, as low-concurrency workloads can be quickly inserted onto a slave/server at much higher rates then the original server executed them. This will be possible because transaction dependencies can be evaluated by the data touched (write set) and not by the transactions commit timestamps (group commit). - allow parallelization information to remain optimal in a replication topology, instead of loosing parallelization potential as transactions traverse servers in a replication chain as it happens with commit timestamps parallelization scheme. This is a huge advantage, especially for some of our users that have noticed that the parallelization potential diminishes as transactions flow in the replication topology. - build on the existing write set extraction mechanism in the server and as such the design proposed results in an implementation that is very contained in its implications and does not leave a memory footprint. The new parallelization is controlled through a new option which is not active by default, leaving room for a clean backport if needed. PERFORMANCE BENEFITS In terms of performance there are several benefits in the GR MTS approach, resulting mostly from the fact that transactions are detected as able to run in parallel by the data they touch and not by the moment they execute. Transactions committed in very different moments can be executed in parallel if they don't touch the same rows and from master to slave there is no loss of parallelism because of it. In particular, there is no need to have parallelism on the master to be able to execute a workload in parallel on the slave, and even single-threaded workloads can benefit from parallelism when being applied in bulk on the slaves. For GR this would be particularly effective for performing recovery when a new member joins a GR group, but presently GR performs recovery using the asynchronous replication infrastructure and the binary logs of the donor members – which use the unchanged Logical Clock logic. So, instead of taking advantage of the write-sets generated, only the commit order is used to generate the commit parent of the transactions – with the aforementioned limitations. To test the potential gains of this approach a Sysbench RW benchmark was performed where the LC MTS did not present any benefit: with a single client thread. However, to present the same workload on the master and on the slaves the RW test was modified so that only the writes were done on the master, something we call Sysbench Write-only. The test was ran 5 minutes on the master and then the resulting binary log (RBR) was applied on the slave using the single-threaded applier (STS) and using the multi-threaded applier (MTS) with a varying number of threads (the server used, siv01, has 36 cores). +----------------+---------+-----------+-----------+ |Test | Run | Speedup | Speedup | | | Time | vs client | vs STS | +----------------+---------+-----------+-----------+ |Client (master) | 05:00 | 1,00 | 0,27 | +----------------+---------+-----------+-----------+ |STS 0 | 01:20 | 3,73 | 1,00 | |MTS 1 thread | 01:26 | 3,47 | 0,93 | |MTS 2 threads | 00:53 | 5,58 | 1,50 | |MTS 4 threads | 00:30 | 9,94 | 2,66 | |MTS 6 threads | 00:22 | 13,15 | 3,52 | |MTS 8 threads | 00:22 | 13,40 | 3,59 | |MTS 10 threads | 00:21 | 13,77 | 3,69 | |MTS 12 threads | 00:21 | 14,06 | 3,77 | |MTS 16 threads | 00:19 | 15,27 | 4,09 | |MTS 20 threads | 00:20 | 14,87 | 3,98 | |MTS 24 threads | 00:19 | 15,09 | 4,04 | |MTS 32 threads | 00:19 | 15,09 | 4,04 | +----------------+---------+-----------+-----------+ The results clearly show that the GR MTS approach can present significant benefits even on the single-threaded client case, with speedups larger than 4x, results come close to the best that can be achieved with LC on many threads if we exclude the additional speedup coming from the group commit. So, as a result of a slight increase in execution time per transaction (for calculating the write-sets), we can apply on the slave 4 times faster even a single-threaded workload, the most difficult workloads for MTS. REFERENCES [1] WL#8440 Group Replication: Parallel applier support [2] http://blog.booking.com/evaluating_mysql_parallel_replication_3-benchmarks_in_production.html
FUNCTIONAL REQUIREMENTS FR1. A new variable named binlog-transaction-dependency-tracking shall be added, with the possible modes being COMMIT_ORDER, WRITESET and WRITESET_SESSION, teh default is COMMIT_ORDER. FR2. If binlog-transaction-dependency-tracking=COMMIT_ORDER the parallelization information shall be generated from the commit timestamps in the master. FR3. If binlog-transaction-dependency-tracking=WRITESET the parallelization information shall be generated in the master from the write set (any two transactions that write to different tuples can be parallelizable), replacing the value of the commit_parent in the binary log. FR4. If binlog-transaction-dependency-tracking=WRITESET_SESSION the parallelization information shall be generated in the master from the write set, but two updates from the same session cannot be reordered, replacing the value of the commit_parent in the binary log. FR5. If the user attempts to set binlog-transaction-dependency-tracking to WRITESET or WRITESET_SESSION while the option transaction-write-set-extraction is set to OFF, then an ERROR (ER_WRONG_USAGE) is thrown and the variable will remain unchanged. FR6. If the user attempts to change transaction-write-set-extraction while binlog-transaction-dependency-tracking is set to WRITESET or WRITESET_SESSION, will also trigger the ERROR (ER_WRONG_USAGE) and the variable will remain unchanged. FR7. The binlog-transaction-dependency-tracking variable shall be changeable while the binlog is in use and the values will take effect on the next transaction to be written to the binary log. FR8. Tranactions without writesets will return in WRITESET and WRITESET_SESSION modes what would be returned in COMMIT_ORDER mode. FR9. The WRITESET and WRITESET_SESSION modes will not deliver transaction dependencies that are newer then those that would be returned in COMMIT_ORDER. FR10.A new variable named binlog-transaction-dependency-history-size shall be added, with possible values ranging from 1 to 1 million, the default will be kept at 25000. FR11.The binlog-transaction-dependency-history-size will define how many row hashes will be kept in memory to look back in time and figure out which transaction changed some row last. Once the history size if reached, the history map will be cleared (no step by step removals for performance reasons). FR12.Tables without primary keys should be supported by reverting to COMMIT_ORDER. FR13.Tables with foreign key constraints should be supported by reverting to COMMIT_ORDER and clearing history when necessary, which is when there are transactions that affect tables the are parents of any FK relation. FR14.FILTERED transactions and EMPTY transactions shall change the commit_parent and reset the history. FR15.SAVEPOINTS: A transaction shall count as if it updates a non-index table or a foreign key parent table, even if all such updates are rolled back within a ROLLBACK TO SAVEPOINT. NON-FUNCTIONAL REQUIREMENTS NF1. The performance degradation on the server shall be inferior to 5% compared to a server with transaction-write-set-extraction set to a value other than OFF. NF2. The structure of information sent on the binlog shall not change because binlog-transaction-dependency-tracking=WRITESET or binlog-transaction-dependency-tracking=WRITESET_SESSION. It shall be supported by current MySQL 5.7 slaves and allow those to automatically take advantage of it and not affect upgrade/downgrade capabilities.
APPROACH The approach to take is to track in a map/hash the last transaction that used each row hash (which represents a row in the database), up to a maximum history log. Whenever a new transaction is processed, the writesets it uses are searched in the history and the most recent transaction that used any of the writesets in this transaction becomes the commit parent of the new transaction. If the search returns no commit_parent, a variable that represents the oldest transaction that can be a commit_parent is used. This variable also allows the history to be cleared regularly and introduces support for transaction that have DDL and need to be executed sequentially, while also supporting the switch between commit order and writesets as source for the logical clock. The following diagram provides an overview: trx1 ---------- B, R(a), W(a), C -----------------------------------> trx2 ---------------------B, R(b), W(b) C --------------------------> trx3 ----------------------------- B, R(a), W(a), R(b), W(b) C -----> * hashes: - trx1 commits => { a=trx1 } - trx2 commits => { a=trx1, b=trx2 } - trx3 commits => { a=trx3, b=trx3 } * writesets: - trx1 = { a } - trx2 = { b } - trx3 = { a, b } * Clocks/timestamps: - trx1 LC= { Nil , trx1 } - trx2 LC= { Nil , trx2 } - trx3 LC= { trx2, trx3 } * Notes - trx3 depends on trx2 and trx1, but only trx2 is captured as dependency. - trx3 takes the most recently committed transaction that touched one of its items in the writeset * The lower timestamp is calculated similarly to what is depicted on the oversimplified routine below on_commit(trx, ws): max=0 foreach i in ws: if hash[i] > max or max == 0: max= hash[i] hash[i]= trx.commit_seq return max DETAILS * New system variable to control which method to capture dependencies between transactions: - Name: binlog-transaction-dependency-tracking - Input Values: { COMMIT_ORDER | WRITESET | WRITESET_SESSION} - Default: COMMIT_ORDER - Description: This option controls whether transaction dependencies are established using write sets or commit timestamps. A server applying binary logs uses this information to determine which transactions can be applied in parallel. If this variable is equal to WRITESET_SESSION the master will store the last transaction issues by each client and will not allow reordering of transactions issued by the same client. - Dynamic: yes. - Scope: Global - Type: Enumeration - Name: binlog-transaction-dependency-history-size - Input Values: [1, 1000000] - Default: 25000 - Description: This option controls the maximum size of the row hash history that is kept to track transaction dependencies. - Dynamic: yes. - Scope: Global - Type: Enumeration * Observability - It relies on the infrastructure that we have in place already and the results are observable with the tools that mine the binary log. * Cross-version replication - Current MySQL 5.7 servers can automatically take advantage of the new commit_parents generated, even if they are not aware of how they were generated. There is no extra requirements other than support for the LOGICAL_CLOCK scheduler in MTS (so MySQL 5.6 will not use it). * Rolling-upgrades - The upgrade procedures do not change as a results of this worklog. The slaves of masters which are configured with the variable binlog-transaction-dependency-tracking != COMMIT_ORDER may see a different history from the master, but that already happens between sessions if slave-preserve-commit-order is not ON. * Group Replication - Servers in GR fulfill all the requirements for this worklod and can use the WRITE_SET option without new restrictions. If the members are configured with COMMIT_ORDER the recovery that happens when new members join the group will not take advantage of the WRITE_SET source to improve the parallelism on the node. Also, the asynchronous replication slaves that connect to it will also not be able to take advantage of the performance improvements. * DDL DDL must be handle sequentialy on the slaves, so all transactions that don't have writeset (such as DDLs) are configured to run sequentially on the slave. CAVEATS * Using the WRITESET to generate the commit_parent makes it possible that the slaves see a state in the database that was never seen in the master. To reduce this effect the WRITESET_SESSION value can be used at the cost of significantly reducing the parallelism that can be achieved, in particular when for transactions ran be the same client. * This worklog deals with DDLs, FK, filters and savepoints by reverting to commit order, which is sub-optimal way. Later we will deal specifically with optimizing the commit_parent generation for such cases.
The MYSQL_BIN_LOG::write_gtid function is responsible both for writing the GTID of each transaction to the binary log if GTIDs are active, and for writing the (relative_last_committed, relative_sequence_number) pair that is used by MTS to track transactions that can be executed in parallel. At present the (relative_last_committed, relative_sequence_number) pair are calculated taking into account only the binary log group commit order, but after this worklog the user will be able to select a different source for the relative_last_committed based on the write-sets touched by each transaction. At the point in the code path that MYSQL_BIN_LOG::write_gtid is executed the transaction write-set is already filled with a hash code for each row the transaction touches - in case this is enabled. As such, in case the user selects the WRITE_SET logical clock source those hash codes will be used to track the last transaction to have touched each row and use that transaction as the relative_last_committed value of the aforementioned pair. In case GR is running the last transactions to use each row can be extracted from the certification_info structure. However, to support the WRITE_SET source without GR one needs to implement an independent map to track it. To support DDL operations and to limit the memory usage for rows that are seldomly used, there is a running cursor that tracks the minimum commit_parent, which is increased whenever there is an operation that must run sequentially and whenever there memory to store that last transaction to store a row was reclaimed. So, this map is cleared whenever there is a transaction that has no write-set, such as DDL, or after a maximum number of rows is stored in it. The implementation of the GR MTS approach for replication in general is straightforward, taking into account what was already done for GR, with the core changes residing on the MYSQL_BIN_LOG::write_gtid function (binlog.cc).
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.