WL#6314: MTS: Prepared transactions slave parallel applier
Affects: Server-5.7
—
Status: Complete
In this WL we will propose a method of enabling inter-transaction multi-threaded slaves. After this feature is implemented the slave will be able to apply transactions in parallel, even within a database, as long as they are non contending(have disjoint read and write set). When executing multiple transaction in parallel on the master, there might be several transactions being committed as the same time. This leads to contention on the slave side, but we can leverage this to execute transactions in parallel on the slave. We can execute transactions in parallel if we knew that the y do not interfere with each other, i.e. their read/write sets are disjoint. All the transactions committing together, are guaranteed to be modifying disjoint read and write sets and will not interfere with each other and can be applied in parallel. This allows us to avoid parsing the read and write sets for each of the transactions, which is normally costly. RATIONALE ========= The throughput of the slave applier infrastructure/procedure is not optimal since it does not fully realizes the potential of its multi-threaded nature currently. Transactions are only applied in parallel if they do not touch the same schema(s). This worklog lifts this limitation by implementing a mechanism to apply transactions in parallel, even within the same database. The only requirement is that at commit time they are non-conflicting, thence are free to commit in the same snapshot of the database. A detailed analysis of possible policies for determining which transactions can execute in parallel on the slave is done in the attached text file intra_schema_mts_policies. In the first version of this worklog we are implementing Policy 4 from this file.
1 General logic
=================
Since MySQL is using a lock-based scheduler, all threads that are in the prepare
phase but have not as yet committed can be executed in parallel on the slave
without violating the consistency.
All transactions should be marked with a logical time-stamp, which identifies
the last transaction that was committed when the current transaction entered the
prepare stage. Details of this logical time stamp is given in the next section.
On the slave side all the transactions with the same time-stamp can execute in
parallel.
2 Master Side
================
On master side the commit parent time-stamping can be done by using a Lamport
clock
We implement a Logical clock for commit parent timestamp in the SQL engine
layer.
The logic of the same is given by the following pseudocode.
Global:: Logical_clock commit_clock;
2.1 in prepare stage:
>> Fetch the time-stamp from the commit_clock, this will be stored as the
commit parent of the transaction.
2.2 in commit stage: /* after the transaction is written to the binlog before
the low-level commit */
>> step the commit_clock;
3 Slave Side
===============
On the slave side, the coordinator thread will group the events based on the
commit parent (i.e. transactions with same commit parent will be in the same
group). All transaction in a group can be executed in parallel.
3.1 Event scheduling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The events are scheduled to worker threads by picking a worker from the list of
idle threads. if none are found the coordinator waits.
3.2 Problems
~~~~~~~~~~~~~~~
Since Coordinator waits after each group, in case the groups are small, the
over-head of scheduling the events and waiting for the workers to finish may
override the performance improvement while applying events in parallel.
The best performance can be guaranteed when the number of clients doing writes
ion master is high.
3.3 Proposed changes
~~~~~~~~~~~~~~~~~~~~~~~
1. We will use the existing infrastructure of the slave workers and the
coordinator. The change however will be to ignore the database partitioning
information.
2. The thread association with a database will no longer be used. We will
schedule the tasks in a group by assigning the threads in a round-robin
method.
3. The coordinator will be blocked to make sure that the previous group has been
applied before the event in the next group is scheduled. During this time
coordinator will do periodic check-pointing
3 New Options:
===============
3.1. On slave we should have a system variable
slave_parallel_type=[logical_clock|database]
The option can only be changed after a stop slave;
REPLICATION CONSISTENCY ANALYSIS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BACKGROUND
==========
- Replication Consistency Criteria [1][2]
1. Linearizabile or Atomic Consistency
If a single client forwards a transaction Ta to a replica Ri and
gets the result of Ta , any other transaction Tb sent later by
this same client to any other replica Rj should be able to read
the updates caused by Ta , assuming that no other transaction is
submitted to the system between Ta and Tb .
2. Sequential Consistency
The result of any execution is the same as if the operations of
all the processors were executed in some sequential order, and
the operations of each individual processor appear in this
sequence in the order specified by its program.
So, it can be implemented using FIFO total order for applying all
write operations in system replicas. Note that this does not
avoid the problem outlined in [24], since sequential consistency
ensures that all updates will be applied following the same
sequence in all replicas. However, if replica Rj is overloaded
and holds a long queue of pending updates (to be applied in the
database), it might serve the first read accesses of Tb before
applying the updates of Ta and, of course, before locally
committing Ta.
3. Causal consistency (Cache Consistency)
This model only requires that accesses are sequentially
consistent on a per-item basis.
There are some replication protocols [...] that are able to
comply with the requirements of this model but provide a
consistency slightly higher, but that does not correspond to any
already specified model. Such protocols are based on total order
update propagation, but they allow that writeset application
breaks such total order when writesets do not conflict (i.e.,
there are no write-write conflicts) with any of the
previously-delivered but not-yet-committed transactions. Note
that this ensures a per-item sequential consistency (as requested
in the cache model), but also a per-transaction-writeset
consistency (i.e., we can not commit half of a writeset WSA
before writeset WSB and the other half of WSA afterward),
although not a complete sequential consistency.
ANALISYS
========
1. MySQL Asynchronous Replication and Single-threaded Applier
- Details
All backups/slaves execute the same transactions in the same
order. No two different slaves execute the same two transactions
in a different order.
- End user impact
Eventually, the user will see the same execution history on every
slave. The commit history will match that of the master.
- Consistency Criterion
Sequential Consistency.
2. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier
- Details
All backups/slaves executing transactions T1 and T2 on schema S
will will apply T1 and T2 on the same order. In other words, no
two different slaves executing the same two transactions on the
same schema will commit them in a different order. Transactions
changing different schemas are considered concurrent and can
commit in a different order at two different slaves.
- End user impact
Eventually, and if updates stop on the master, the state on the
slaves will converge to the same state, which matches that of the
master. While updates are ongoing, different execution histories
can be observed on different slaves, and may be different from the
execution history on the master. Execution histories differ only
w.r.t. databases.
Application invariants and semantics that require sequential
consistency between all servers in the replication topology may
be broken, but only if these semantics/invariants cross-reference
schemas.
- Consistency Criterion
Causal Consistency. Causality is determined by the schema on
which transactions operate.
3. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier
- Details
All backups/slaves executing transactions T1 and T2 marked as
having prepared on different commit parents will apply T1 and T2
on the same order among themselves. In other words, no two
different slaves executing the same two transactions that
prepared on different commit parents will commit them in a
different order. Two transactions prepared on the same commit
parent can commit in different order at different slaves.
- End user impact
Eventually, and if updates stop on the master, the state on the
slaves will converge to the same state, which matches that of the
master. While updates are ongoing, different execution histories
can be observed on different slaves, and may be different from
the execution history on the master. Execution histories differ
w.r.t. transactions that are concurrent and prepared on the same
commit parent.
Application invariants and semantics that require sequential
consistency between all servers in the replication topology may
be broken.
- Consistency Criterion
Causal consistency. Causality is determined by the snapshot on
which transactions prepare.
4. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier
- Details
All backups/slaves executing transactions T1 and T2 marked as
having prepared on the same or different commit parents will
apply T1 and T2 on the same order among themselves. In other words
no two backups/slaves will externalize commit transactions in a
different order.
- End user impact
Eventually, the user will see the same execution history on every
slave. The commit history will match that of the master.
- Consistency Criterion
Sequential consistency.
REFERENCES
==========
[1] http://dl.acm.org/citation.cfm?id=1693858
[2] http://web.iti.upv.es/~fmunyoz/research/pdf/TR-ITI-SIDI-2009003.pdf
[24] http://dl.acm.org/citation.cfm?id=1141442
1. CHANGES TO DATA-STRUCTURES
=============================
1.1. On the slave side
=============================
class Slave_job_group
++ ulonglong parent_seqno;
ulonglong total_seqno;
class Relay_log_info
++ ulonglong mts_last_known_parent_group_id;
++ ulonglong mts_last_seen_commit_parent;
/*lock for attaching and setaching tables from temporary tables.*/
++ mts_temp_table_LOCK
class Gtid_log_event
++ ulonglong commit_parent
class Query_log_event
++ ulonglong commit_parent
1.2. On the master side
================================
struct IO_CACHE
++ ulonglong commit_seq_ts
int commit_seq_offset
2. COMMIT Timestamp in events:
===============================
We store the commit parent timestamp in the GTID events or in the first Query
log event of the transaction(if GTID is disabled).
- For GTID log event we store the commit parent after the GID and GNO
- For Query log event we store the commit parent in the status var section.
initially when the commit timestamp is available it is stored in the
corresponding cache since we have different caches for transactional and non-
transactional events. Also we may have non-transactional events interleaved
between transactional events (in which case, the non-transactional events will
be written to the binary log before the transactional events.) and they will
commit saperately on the binary-log.
3. Coordinator scheduling
===========================
The coordinator stores the last seen commit parent and compares it with the
current transaction's commit parent. If the two are found to be the same then
the transaction is scheduled immediately(i.e.) in the same group.
If the two are found to be different then the coordinator schedules the
event after the current job group has finished.
4. WORKER EXECUTION
======================
The worker works the same way as for DB partitioned MTS. However the the
temporary tables are handled differently.
Steps involved in handling temporary tables.
1. While assigning the worker thread to an event, the coordinator stores the
thread id and server id in worker_thd::pseudo_thread_id and
worker_thd::server_id respectively
2. Upon the commencement of the transaction execution, temporary tables are
attached to the worker thread. This is done by sequentially checking the
temporary tables of the coordinator for the server id and the thread id.
3. All the tables in the coordinator's list matching the pseudo_thread_id of the
worker is attached to the worker.
4. upon transaction commit, the temporary tables of the worker are returned to
the coordinator.
5. Since coordinator->temporary_tables is shared between the slave workers, we
need to isolate both attaching and detaching process. for this reason we add a
new lock mts_temp_table_LOCK which should be locked before attaching or
detaching the temporary tables.
5. Handling special situations/cases
=====================================
5.1 Handling change in topology
==================================
In case a Format descriptor event is encountered
rli->mts_last_seen_commit_parent to uninitialized, which forces the coordinator
to start a new group.
5.2. handling skipped events
==============================
Assuming the commit parent number is stored in an N-bit integer,
When exactly 2^N-1 transactions are skipped, so that the commit parent number
wraps around and reaches the exact same value as the last non-skipped
transaction.
***we will not have this problem since coordinator starts a new group every time
a format descriptor event is encountered. And since the format descriptor
events are never skipped, we will never have this scenario.
5.3 Handling online alteration of schema and ISOLATION levels
==============================================================
Schemas may be altered online(e.g. ALTER TABLE...) while executing DMLs. This
means we will have interleaved DDls and DMLs in the binary log. But the current
implementation will take care of the same since a schema alter and a DML on the
same schema cannot run simultaneously on the slave since they will never be a
part of the same binlog group.
Also In case of different isolation levels on master and slaves, it is only
possible to have Row based replication between the master and the slave. This
means that in tricky situations like snapshot isolation, mixed mode will switch
to RBR and there will be no problem. Furthermore in case of SBR the slave SQL
thread reports an error.
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.