WL#6314: MTS: Prepared transactions slave parallel applier
Affects: Server-5.7 — Status: Complete — Priority: Low
In this WL we will propose a method of enabling inter-transaction multi-threaded slaves. After this feature is implemented the slave will be able to apply transactions in parallel, even within a database, as long as they are non contending(have disjoint read and write set). When executing multiple transaction in parallel on the master, there might be several transactions being committed as the same time. This leads to contention on the slave side, but we can leverage this to execute transactions in parallel on the slave. We can execute transactions in parallel if we knew that the y do not interfere with each other, i.e. their read/write sets are disjoint. All the transactions committing together, are guaranteed to be modifying disjoint read and write sets and will not interfere with each other and can be applied in parallel. This allows us to avoid parsing the read and write sets for each of the transactions, which is normally costly. RATIONALE ========= The throughput of the slave applier infrastructure/procedure is not optimal since it does not fully realizes the potential of its multi-threaded nature currently. Transactions are only applied in parallel if they do not touch the same schema(s). This worklog lifts this limitation by implementing a mechanism to apply transactions in parallel, even within the same database. The only requirement is that at commit time they are non-conflicting, thence are free to commit in the same snapshot of the database. A detailed analysis of possible policies for determining which transactions can execute in parallel on the slave is done in the attached text file intra_schema_mts_policies. In the first version of this worklog we are implementing Policy 4 from this file.
1 General logic ================= Since MySQL is using a lock-based scheduler, all threads that are in the prepare phase but have not as yet committed can be executed in parallel on the slave without violating the consistency. All transactions should be marked with a logical time-stamp, which identifies the last transaction that was committed when the current transaction entered the prepare stage. Details of this logical time stamp is given in the next section. On the slave side all the transactions with the same time-stamp can execute in parallel. 2 Master Side ================ On master side the commit parent time-stamping can be done by using a Lamport clock We implement a Logical clock for commit parent timestamp in the SQL engine layer. The logic of the same is given by the following pseudocode. Global:: Logical_clock commit_clock; 2.1 in prepare stage: >> Fetch the time-stamp from the commit_clock, this will be stored as the commit parent of the transaction. 2.2 in commit stage: /* after the transaction is written to the binlog before the low-level commit */ >> step the commit_clock; 3 Slave Side =============== On the slave side, the coordinator thread will group the events based on the commit parent (i.e. transactions with same commit parent will be in the same group). All transaction in a group can be executed in parallel. 3.1 Event scheduling ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The events are scheduled to worker threads by picking a worker from the list of idle threads. if none are found the coordinator waits. 3.2 Problems ~~~~~~~~~~~~~~~ Since Coordinator waits after each group, in case the groups are small, the over-head of scheduling the events and waiting for the workers to finish may override the performance improvement while applying events in parallel. The best performance can be guaranteed when the number of clients doing writes ion master is high. 3.3 Proposed changes ~~~~~~~~~~~~~~~~~~~~~~~ 1. We will use the existing infrastructure of the slave workers and the coordinator. The change however will be to ignore the database partitioning information. 2. The thread association with a database will no longer be used. We will schedule the tasks in a group by assigning the threads in a round-robin method. 3. The coordinator will be blocked to make sure that the previous group has been applied before the event in the next group is scheduled. During this time coordinator will do periodic check-pointing 3 New Options: =============== 3.1. On slave we should have a system variable slave_parallel_type=[logical_clock|database] The option can only be changed after a stop slave; REPLICATION CONSISTENCY ANALYSIS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ BACKGROUND ========== - Replication Consistency Criteria  1. Linearizabile or Atomic Consistency If a single client forwards a transaction Ta to a replica Ri and gets the result of Ta , any other transaction Tb sent later by this same client to any other replica Rj should be able to read the updates caused by Ta , assuming that no other transaction is submitted to the system between Ta and Tb . 2. Sequential Consistency The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. So, it can be implemented using FIFO total order for applying all write operations in system replicas. Note that this does not avoid the problem outlined in , since sequential consistency ensures that all updates will be applied following the same sequence in all replicas. However, if replica Rj is overloaded and holds a long queue of pending updates (to be applied in the database), it might serve the first read accesses of Tb before applying the updates of Ta and, of course, before locally committing Ta. 3. Causal consistency (Cache Consistency) This model only requires that accesses are sequentially consistent on a per-item basis. There are some replication protocols [...] that are able to comply with the requirements of this model but provide a consistency slightly higher, but that does not correspond to any already specified model. Such protocols are based on total order update propagation, but they allow that writeset application breaks such total order when writesets do not conflict (i.e., there are no write-write conflicts) with any of the previously-delivered but not-yet-committed transactions. Note that this ensures a per-item sequential consistency (as requested in the cache model), but also a per-transaction-writeset consistency (i.e., we can not commit half of a writeset WSA before writeset WSB and the other half of WSA afterward), although not a complete sequential consistency. ANALISYS ======== 1. MySQL Asynchronous Replication and Single-threaded Applier - Details All backups/slaves execute the same transactions in the same order. No two different slaves execute the same two transactions in a different order. - End user impact Eventually, the user will see the same execution history on every slave. The commit history will match that of the master. - Consistency Criterion Sequential Consistency. 2. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier - Details All backups/slaves executing transactions T1 and T2 on schema S will will apply T1 and T2 on the same order. In other words, no two different slaves executing the same two transactions on the same schema will commit them in a different order. Transactions changing different schemas are considered concurrent and can commit in a different order at two different slaves. - End user impact Eventually, and if updates stop on the master, the state on the slaves will converge to the same state, which matches that of the master. While updates are ongoing, different execution histories can be observed on different slaves, and may be different from the execution history on the master. Execution histories differ only w.r.t. databases. Application invariants and semantics that require sequential consistency between all servers in the replication topology may be broken, but only if these semantics/invariants cross-reference schemas. - Consistency Criterion Causal Consistency. Causality is determined by the schema on which transactions operate. 3. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier - Details All backups/slaves executing transactions T1 and T2 marked as having prepared on different commit parents will apply T1 and T2 on the same order among themselves. In other words, no two different slaves executing the same two transactions that prepared on different commit parents will commit them in a different order. Two transactions prepared on the same commit parent can commit in different order at different slaves. - End user impact Eventually, and if updates stop on the master, the state on the slaves will converge to the same state, which matches that of the master. While updates are ongoing, different execution histories can be observed on different slaves, and may be different from the execution history on the master. Execution histories differ w.r.t. transactions that are concurrent and prepared on the same commit parent. Application invariants and semantics that require sequential consistency between all servers in the replication topology may be broken. - Consistency Criterion Causal consistency. Causality is determined by the snapshot on which transactions prepare. 4. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier - Details All backups/slaves executing transactions T1 and T2 marked as having prepared on the same or different commit parents will apply T1 and T2 on the same order among themselves. In other words no two backups/slaves will externalize commit transactions in a different order. - End user impact Eventually, the user will see the same execution history on every slave. The commit history will match that of the master. - Consistency Criterion Sequential consistency. REFERENCES ==========  http://dl.acm.org/citation.cfm?id=1693858  http://web.iti.upv.es/~fmunyoz/research/pdf/TR-ITI-SIDI-2009003.pdf  http://dl.acm.org/citation.cfm?id=1141442
1. CHANGES TO DATA-STRUCTURES ============================= 1.1. On the slave side ============================= class Slave_job_group ++ ulonglong parent_seqno; ulonglong total_seqno; class Relay_log_info ++ ulonglong mts_last_known_parent_group_id; ++ ulonglong mts_last_seen_commit_parent; /*lock for attaching and setaching tables from temporary tables.*/ ++ mts_temp_table_LOCK class Gtid_log_event ++ ulonglong commit_parent class Query_log_event ++ ulonglong commit_parent 1.2. On the master side ================================ struct IO_CACHE ++ ulonglong commit_seq_ts int commit_seq_offset 2. COMMIT Timestamp in events: =============================== We store the commit parent timestamp in the GTID events or in the first Query log event of the transaction(if GTID is disabled). - For GTID log event we store the commit parent after the GID and GNO - For Query log event we store the commit parent in the status var section. initially when the commit timestamp is available it is stored in the corresponding cache since we have different caches for transactional and non- transactional events. Also we may have non-transactional events interleaved between transactional events (in which case, the non-transactional events will be written to the binary log before the transactional events.) and they will commit saperately on the binary-log. 3. Coordinator scheduling =========================== The coordinator stores the last seen commit parent and compares it with the current transaction's commit parent. If the two are found to be the same then the transaction is scheduled immediately(i.e.) in the same group. If the two are found to be different then the coordinator schedules the event after the current job group has finished. 4. WORKER EXECUTION ====================== The worker works the same way as for DB partitioned MTS. However the the temporary tables are handled differently. Steps involved in handling temporary tables. 1. While assigning the worker thread to an event, the coordinator stores the thread id and server id in worker_thd::pseudo_thread_id and worker_thd::server_id respectively 2. Upon the commencement of the transaction execution, temporary tables are attached to the worker thread. This is done by sequentially checking the temporary tables of the coordinator for the server id and the thread id. 3. All the tables in the coordinator's list matching the pseudo_thread_id of the worker is attached to the worker. 4. upon transaction commit, the temporary tables of the worker are returned to the coordinator. 5. Since coordinator->temporary_tables is shared between the slave workers, we need to isolate both attaching and detaching process. for this reason we add a new lock mts_temp_table_LOCK which should be locked before attaching or detaching the temporary tables. 5. Handling special situations/cases ===================================== 5.1 Handling change in topology ================================== In case a Format descriptor event is encountered rli->mts_last_seen_commit_parent to uninitialized, which forces the coordinator to start a new group. 5.2. handling skipped events ============================== Assuming the commit parent number is stored in an N-bit integer, When exactly 2^N-1 transactions are skipped, so that the commit parent number wraps around and reaches the exact same value as the last non-skipped transaction. ***we will not have this problem since coordinator starts a new group every time a format descriptor event is encountered. And since the format descriptor events are never skipped, we will never have this scenario. 5.3 Handling online alteration of schema and ISOLATION levels ============================================================== Schemas may be altered online(e.g. ALTER TABLE...) while executing DMLs. This means we will have interleaved DDls and DMLs in the binary log. But the current implementation will take care of the same since a schema alter and a DML on the same schema cannot run simultaneously on the slave since they will never be a part of the same binlog group. Also In case of different isolation levels on master and slaves, it is only possible to have Row based replication between the master and the slave. This means that in tricky situations like snapshot isolation, mixed mode will switch to RBR and there will be no problem. Furthermore in case of SBR the slave SQL thread reports an error.
Copyright (c) 2000, 2016, Oracle Corporation and/or its affiliates. All rights reserved.