WL#6314: MTS: Prepared transactions slave parallel applier

Affects: Server-5.7 — Status: Complete

Description
Dependent Tasks
High Level Architecture
Low Level Design

In this WL we will propose a method of enabling inter-transaction 
multi-threaded slaves. After this feature is implemented the slave will
be able to apply transactions in parallel, even within a database, as long
as they are non contending(have disjoint read and write set). 

When executing multiple transaction in parallel on the master, there
might be several transactions being committed as the same time. This leads to 
contention on the slave side, but we can leverage this to  execute transactions in 
parallel on the slave. We can execute transactions in parallel if we knew that the 
y do not interfere with each other, i.e. their read/write sets are disjoint. All 
the transactions committing together, are guaranteed to be modifying disjoint read 
and write sets and will not interfere with each other and can be applied in 
parallel. This allows us to avoid parsing the read and write sets for each of the 
transactions, which is normally costly.

RATIONALE
=========

The throughput of the slave applier infrastructure/procedure is not optimal
since it does not fully realizes the potential of its multi-threaded nature
currently. Transactions are only applied in parallel if they do not touch the
same schema(s).

This worklog lifts this limitation by implementing a mechanism to apply
transactions in parallel, even within the same database. The only requirement
is that at commit time they are non-conflicting, thence are free to commit
in the same snapshot of the database. 

A detailed analysis of possible policies for determining which transactions can 
execute in parallel on the slave is done in the attached text file 
intra_schema_mts_policies. In the first version of this worklog we are 
implementing Policy 4 from this file.

WL#6813: MTS: ordered commits (sequential consistency)

1 General logic
=================
Since MySQL is using a lock-based scheduler, all threads that are in the prepare
phase but have not as yet committed can be executed in parallel on the slave
without violating the consistency.

All transactions should be marked with a logical time-stamp, which identifies 
the last transaction that was committed when the current transaction entered the 
prepare stage. Details of this logical time stamp is given in the next section.
On the slave side all the transactions with the same time-stamp can execute in 
parallel. 

2 Master Side 
================
On master side the commit parent time-stamping can be done by using a Lamport 
clock

We implement a Logical clock for commit parent timestamp in the SQL engine 
layer.

The logic of the same is given by the following pseudocode.

Global:: Logical_clock commit_clock; 

2.1 in prepare stage:
   >> Fetch the time-stamp from the commit_clock, this will be stored as the 
commit parent of the transaction.
2.2 in commit stage: /* after the transaction is written to the binlog before 
the low-level commit */
   >> step the commit_clock; 


3 Slave Side 
===============
On the slave side, the coordinator thread will group the events based on the 
commit parent (i.e. transactions with same commit parent will be in the same 
group). All transaction in a group can be executed in parallel.

3.1  Event scheduling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The events are scheduled to worker threads by picking a worker from the list of 
idle threads. if none are found the coordinator waits. 

3.2 Problems 
~~~~~~~~~~~~~~~
Since Coordinator waits after each group, in case the groups are small, the 
over-head of scheduling the events and waiting for the workers to finish may 
override the performance improvement  while applying events in parallel.
The best performance can be guaranteed when the number of clients doing writes 
ion master is high.

3.3 Proposed changes 
~~~~~~~~~~~~~~~~~~~~~~~
1. We will use the existing infrastructure of the slave workers and the
   coordinator. The change however will be to ignore the database partitioning
   information.
2. The thread association with a database will no longer be used. We will
   schedule the tasks in a group by assigning the threads in a round-robin
   method.
3. The coordinator will be blocked to make sure that the previous group has been
   applied before the event in the next group is scheduled. During this  time
   coordinator will do periodic check-pointing

3 New Options:
===============
3.1. On slave we should have a system variable 

slave_parallel_type=[logical_clock|database]

The option can only be changed after a stop slave;


REPLICATION CONSISTENCY ANALYSIS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

BACKGROUND
==========

- Replication Consistency Criteria [1][2]

  1. Linearizabile or Atomic Consistency

     If a single client forwards a transaction Ta to a replica Ri and
     gets the result of Ta , any other transaction Tb sent later by
     this same client to any other replica Rj should be able to read
     the updates caused by Ta , assuming that no other transaction is
     submitted to the system between Ta and Tb .
     
  2. Sequential Consistency

     The result of any execution is the same as if the operations of
     all the processors were executed in some sequential order, and
     the operations of each individual processor appear in this
     sequence in the order specified by its program.

     So, it can be implemented using FIFO total order for applying all
     write operations in system replicas. Note that this does not
     avoid the problem outlined in [24], since sequential consistency
     ensures that all updates will be applied following the same
     sequence in all replicas. However, if replica Rj is overloaded
     and holds a long queue of pending updates (to be applied in the
     database), it might serve the first read accesses of Tb before
     applying the updates of Ta and, of course, before locally
     committing Ta.

  3. Causal consistency (Cache Consistency)

     This model only requires that accesses are sequentially
     consistent on a per-item basis.

     There are some replication protocols [...] that are able to
     comply with the requirements of this model but provide a
     consistency slightly higher, but that does not correspond to any
     already specified model. Such protocols are based on total order
     update propagation, but they allow that writeset application
     breaks such total order when writesets do not conflict (i.e.,
     there are no write-write conflicts) with any of the
     previously-delivered but not-yet-committed transactions.  Note
     that this ensures a per-item sequential consistency (as requested
     in the cache model), but also a per-transaction-writeset
     consistency (i.e., we can not commit half of a writeset WSA
     before writeset WSB and the other half of WSA afterward),
     although not a complete sequential consistency.

ANALISYS
========

1. MySQL Asynchronous Replication and Single-threaded Applier

   - Details
 
     All backups/slaves execute the same transactions in the same
     order. No two different slaves execute the same two transactions
     in a different order.

   - End user impact
   
     Eventually, the user will see the same execution history on every
     slave. The commit history will match that of the master.

   - Consistency Criterion
     
     Sequential Consistency.
 
2. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier

   - Details

     All backups/slaves executing transactions T1 and T2 on schema S
     will will apply T1 and T2 on the same order. In other words, no
     two different slaves executing the same two transactions on the
     same schema will commit them in a different order. Transactions
     changing different schemas are considered concurrent and can
     commit in a different order at two different slaves.

   - End user impact

     Eventually, and if updates stop on the master, the state on the
     slaves will converge to the same state, which matches that of the
     master.  While updates are ongoing, different execution histories
     can be observed on different slaves, and may be different from the
     execution history on the master. Execution histories differ only
     w.r.t. databases.

     Application invariants and semantics that require sequential
     consistency between all servers in the replication topology may
     be broken, but only if these semantics/invariants cross-reference
     schemas.

   - Consistency Criterion

     Causal Consistency. Causality is determined by the schema on
     which transactions operate.

3. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier 

   - Details

     All backups/slaves executing transactions T1 and T2 marked as
     having prepared on different commit parents will apply T1 and T2
     on the same order among themselves. In other words, no two
     different slaves executing the same two transactions that
     prepared on different commit parents will commit them in a
     different order. Two transactions prepared on the same commit 
     parent can commit in different order at different slaves.

   - End user impact

     Eventually, and if updates stop on the master, the state on the
     slaves will converge to the same state, which matches that of the
     master.  While updates are ongoing, different execution histories
     can be observed on different slaves, and may be different from
     the execution history on the master. Execution histories differ
     w.r.t. transactions that are concurrent and prepared on the same
     commit parent.

     Application invariants and semantics that require sequential
     consistency between all servers in the replication topology may
     be broken.

   - Consistency Criterion

     Causal consistency. Causality is determined by the snapshot on
     which transactions prepare.

4. MySQL 5.6 Asynchronous Replication and Multi-threaded Applier


   - Details

     All backups/slaves executing transactions T1 and T2 marked as
     having prepared on the same or different commit parents will 
     apply T1 and T2 on the same order among themselves. In other words
     no two backups/slaves will externalize commit transactions in a 
     different order.

   - End user impact

     Eventually, the user will see the same execution history on every
     slave. The commit history will match that of the master.

   - Consistency Criterion

     Sequential consistency. 

REFERENCES
==========

[1] http://dl.acm.org/citation.cfm?id=1693858
[2] http://web.iti.upv.es/~fmunyoz/research/pdf/TR-ITI-SIDI-2009003.pdf
[24] http://dl.acm.org/citation.cfm?id=1141442

1. CHANGES TO DATA-STRUCTURES  
=============================


1.1. On the slave side
=============================
  class Slave_job_group

    ++  ulonglong parent_seqno;
        ulonglong total_seqno;

  class Relay_log_info
  
    ++  ulonglong mts_last_known_parent_group_id;
    ++  ulonglong mts_last_seen_commit_parent;
        /*lock for attaching and setaching tables from temporary tables.*/
    ++  mts_temp_table_LOCK

  class Gtid_log_event
    ++  ulonglong commit_parent

  class Query_log_event
    ++  ulonglong commit_parent

1.2. On the master side
================================
    struct IO_CACHE
     ++ ulonglong commit_seq_ts
        int commit_seq_offset

  

2. COMMIT Timestamp in events:
===============================
We store the commit parent timestamp in the GTID events or in the first Query 
log event of the transaction(if GTID is disabled).

- For GTID log event  we store the commit parent after the GID and GNO
- For Query log event we store the commit parent in the status var section.

initially when the commit timestamp is available it is stored in the 
corresponding cache since we have different caches for transactional and non-
transactional events. Also we may have non-transactional events interleaved 
between transactional events (in which case, the non-transactional events will 
be written to the binary log before the transactional events.) and they will 
commit saperately on the binary-log.


3. Coordinator scheduling
===========================
The coordinator stores the last seen commit parent and compares it with the 
current transaction's commit parent. If the two are found to be the same then 
the transaction is scheduled immediately(i.e.) in the same group.
If the two are found to be different then the coordinator schedules the 
event after the current job  group has finished.

4. WORKER EXECUTION
======================
The worker works the same way as for DB partitioned MTS. However the the 
temporary tables are handled differently.
 
Steps involved in handling temporary tables.

1. While assigning the worker thread to an event, the coordinator stores the 
thread id and server id in worker_thd::pseudo_thread_id and 
worker_thd::server_id respectively
2. Upon the commencement of the transaction execution, temporary tables are 
attached to the worker thread. This is done by sequentially checking the 
temporary tables of the coordinator for the server id and the thread id.
3. All the tables in the coordinator's list matching the pseudo_thread_id of the 
worker is attached to the worker.
4. upon transaction commit, the temporary tables of the worker are returned to 
the coordinator.
5. Since coordinator->temporary_tables is shared between the slave workers, we 
need to isolate both attaching and detaching process. for this reason we add a 
new lock mts_temp_table_LOCK which should be locked before attaching or 
detaching the temporary tables.

5. Handling special situations/cases
=====================================  
   
5.1 Handling change in topology
==================================
In case a Format descriptor event is encountered 
rli->mts_last_seen_commit_parent to uninitialized, which forces the coordinator 
to start a new group. 

5.2. handling skipped events
==============================
Assuming the commit parent number is stored in an N-bit integer,
When exactly 2^N-1 transactions are skipped, so that the commit parent number 
wraps around and reaches the exact same value as the last non-skipped 
transaction.

***we will not have this problem since coordinator starts a new group every time 
a format descriptor event is encountered. And  since the format descriptor  
events are never skipped, we will never have this scenario.

5.3 Handling online alteration of schema and ISOLATION levels
==============================================================
Schemas may be altered online(e.g. ALTER TABLE...) while executing DMLs. This 
means we will have interleaved DDls and DMLs in the binary log. But the current 
implementation will take care of the same since a schema alter and a DML on the 
same schema cannot run simultaneously on the slave since they will never be a 
part of the same binlog group.

Also In case of different isolation levels on master and slaves, it is only 
possible to have Row based replication between the master and the slave. This 
means that in tricky situations like snapshot isolation, mixed mode will switch 
to RBR and there will be no problem. Furthermore in case of SBR  the slave SQL 
thread reports an error.