WL#5569: Replication events parallel execution via hashing per database name

Affects: Server-5.6 — Status: Complete

Description
Dependent Tasks
High Level Architecture
Low Level Design

High-Level Description

MOTIVATION
----------
In many cases the user application logically partitions data per
databases allowing some flexible patterns of applying replication
events on the slave.
In one important use case updating within a single database must
follow the master total order but any two databases don't have to
coordinate their applying.


OBJECTIVE
---------
Given the user application specifics of the logical partitioning per
database, the goal is largely the same as in wl#943, WL#4648 that is
to scale up slave system performance via engaging multiple threads to
apply replication events. Improvement in performance effectively
decreases lagging the slave behind.

A prototype implementation subject to few technical-wise limitations
is considered in WL#5563.


FEATURES
--------
Parallelization by db name is not concerned in the binlog format of
  events;

Partitioned to different worker threads transactions do not have
  commit coordination;

Parallelization by db name is transparent to the engine type;

Automatic switching from PARALLEL concurrent to temporal sequential mode
  in special cases when events can't be executed concurrently in parallel.
  Once a special case is handled, the slave automatically restores 
  the concurrent PARALLEL mode.
 
  NOTE_1: An event that updates multiple databases do not necessary mean
  such switching.

  NOTE_2: There are few event types like Format_Descriptor, Rotate
          that are executed by Single-Threaded Slave standard algorithm
          (that is by Coordinator).
	  They do not introduce any throttle to concurrency while binary
	  logs are not rotated frequently.

Internal Goals
--------------
To create a framework facilitating further extending towards support
of more use cases of partitioning, e.g per table.


Conceptual limitations and differences from the single thread
--------------------------------------------------------------

a. This framework relies on WL#2775 implementing a base for automatic
   recovery. The transactional storage keeping the descriptor of last-updated
   info is extended to record also an array of last-executed group positions
   by the worker threads (see more in Recovery section).

b. Status info displaying via SHOW-SLAVE-STATUS will largely remain the same.
   In particular that relates to the monotonic grow of
   EXEC_MASTER_LOG_POS. However the parameter will turn into a
   Low-Water-Mark. LWM indicates that before (in the master logging order
   sense) the last committed transaction or an executed
   group of event there is no uncommitted yet transactions/groups.
   But there can be some committed after the LWM.
   SECONDS_BEHIND_MASTER can't be updated per each event and updating
   happens in intervals along with EXEC_MASTER_LOG_POS (see LLD for more 
   details).

c. Additional info on status of individual workers can be provided via 
   mysql.SLAVE_WORKER_INFO table of WL#5599.

d. May not be supported:

    - transaction retry after a temporary failure
    - start slave UNTIL
    - foreign keys constraint if the referenced table is in 
      a different database (see wl#6007). Currently, 
      MTS user has to make sure that possible FK-dependencies do not cross
      over the explicit table databases set. Otherwise it could lead
      deadlocks involving slave worker threads and data inconsistency.
      
    - pre-5.0 master server version LOAD-DATA related events

d.  BUG#37426 fulnerable master can't be detected by MTS. This is a technical
    issue, could be lifted away.
    

Recovery model
--------------
The parallel execution mode does not change in principle recovery for DDL
or any other Non-transactional events.
In this case the recovery routine is deemed to be manual.

For the reason of the simple possible recovery, when the user expects
failures and got accustomed to some specific to his application recovery
procedure, DDL can (optionally) force the slave to apply it in the SEQUENTIAL
mode with waiting first for the previous Parallelly distributed load has
been completed by the Workers.


This worklog requires some extensions to be done to WL#2775 slave
system tables for automatic recovery.  The relay-log-info table is
supposed to be augmented (WL#5599) to contain each worker last
executed group position. Another change is meant for the last executed
group position of the single SQL thread which is converted into the
low-water mark of executed group positions among all Workers, that
indicates no gaps in committed groups exist before it.

Depending on the binary log events format the following methods can be applied

1. (Not to be implemented here) Idempotent recovery when binlog_format == ROW

   The low-water-mark last executed group position that the coordinator
   maintains alone can suffice.
   It is regarded as an anchor to restart events applying same way as
   the last executed group position of the single SQL thread does.
   Possible errors like Dup-key, Key-not-found few more are benign and will
   be ignored.
   The idempotent applying policy changes to the user preferred one
   as soon as the low-water mark has reached the max group position among
   workers.
   This type of recovery does not require the Worker pool and can be conducted
   solely by the Coordinator.

   Notice a great advantage of this method although bound to the ROW-format is 
   ransparent to the engine type.

2. General recovery regardless of the binlog-format (read more on that in
   WL#5599).

   This method leaves aside the mentioned DDL and
   Non-transactional Query-log-events and applies to
   the Transactional Rows- and Query -log-events.

   This type of recovery involve information that Workers left
   in their info-table.
   In short, MTS recovery reads the Workers info constructs 
   a set of not-applied transactions that follow LWM.
   Starting from the low-water-mark the Coordinator parses events
   to ignore those that were committed and apply the rest.


Favorable environment
----------------------

While Statement-format events including Query-log-event and all types of
engine will be supported, the primary concern is in achieving performance
scaling up with the following conditions:

- Innodb storage engine;
- ROW binlog format;
- Transactions operates on disjoint sets of databases (i.e.,
  they do not conflict and can commit independently).

Sketch of Architecture
----------------------
It is assumed for now that any replication event can be partitioned.
Query-log or Rows-log events to different databases fall into that
category with partitioning per db name.
Such partitioning model suggests to equipping the single SQL thread
system by a worker thread pool. The SQL thread changes its current multi-purpose
role to serve as their Coordinator.
Interaction between Coordinator and Workers follow to the producer-consumer
model, pretty close to what had been implemented in the WL#4648
(in fact that piece of work can be reused).

Pseudo-code for Coordinator:

// Coordinator the former SQL thread remains to operate as in the single thread env
// to stay in read-execute loop but instead of executing an event
// through apply_event_and_update_pos() it distributes tasks to workers

while(!killed)
  exec_relay_log_event();

where

  exec_relay_log_event()
  {

    if (workers_state() is W_ERROR)
      return error;

    ev= read();

    switch (mts_execution_mode(ev))

    case TEMPORARY_SEQUENTIAL:
         wait_for_workers_to_finish();

    case PARALLEL_CONCURRENT:
         engage_worker(ev, w_thd[worker_id]);
	 break;

    // Some of events type like Format_Descriptor, Rotate Coordinator
    // executes itself. Depending on the event property its execution 
    // can require synchronization with Workers.

    case SEQUENTIAL_SYNC:
         wait_for_workers_to_finish();
  
    case SEQUENTIAL_ASYNC:
         apply_event_and_update_pos();
  }
}

Pseudo-code for Worker:

// The worker stays in wait-execute loop

while (!killed)
{
  ev= get_event();     // wait for an event to show up
  exec_res= ev->apply_event(rli);
  ev->update_pos();    // updating workers info-table with the task's group position
  report(exec_res);    // asynch status communication with Coord
}

The worker pool is activated via START SLAVE.
STOP SLAVE terminates all threads including the worker pool.

WL#2775: System tables for master.info, relay_log.info
WL#5563: MTS: Prototype for Slave parallelized by db name
WL#5754: Query event parallel execution

Requirements:
=============

R0. To provide the user interface for the parallel execution mode including

    A. @@global.slave_parallel_workers

       that specifies the Number of worker threads.
       Zero number of workers, the default, designates no worker pool and therefore
       the legacy sequential execution mode or Single-Threaded-Slave (STS).

    Use cases:

    U.A: set @@global.slave_parallel_workers= `value`;

    (see LLD section for the details of the variable definition)

R1. To provide transparent detection of limitations for parallelization 
    and performing corresponding actions for such events due
    to the following cases:

    A: the hash function mapping conflicts;


       U.A.1: a transaction spawning to multiple databases

              Let's consider multi-statement transactions  T_1, T_2, T_3, where

              T_1: begin; update db1; commit;
	      T_2: begin; update db1; commit;
              T_3: begin; update db1; update db2; commit;

              Verify the correct applying takes place, the parallel exec mode
resumes
              after T_3 has committed.

       U.A.2: a query crossing databases

              set autocommit = ON;
              update db1.table, db2.table ...;

              In case the number of databases exceed a maximum the whole hash
              will be locked for a Worker executing the query that reduces
              parallelism.

    B: the exceptional set of events to be executed by Coordinator;

       U.B: Limited to the sequential execution by Coordinator events include
all type of
            events in the list:

            STOP_EVENT,
            ROTATE_EVENT,
            FORMAT_DESCRIPTION_EVENT,
            INCIDENT_EVENT

    C: events from an old master with reduced parallelism;

       U.C: A query event from MTS-unaware old master are treated as if the query
            crosses the maximum databases of the case U.A.2.

    D: events that can't be supported in MTS;

       There are cases when MTS starts reading events not from the boundary of
       a group. Unfortunately that can be a legacy log file created by an old
server.
       
       U.D: LOAD-DATA events generated by 4.0,4.1 master. MTS execution will stop
            with an error.


R3. To deliver MTS implementation that scales up.
    To annotate as favorable so least permant use cases.

     use case:

     U: to demonstrate when the max performance is achieved as a
        function of parameters including:
        the number of computer computing units, databases, Workers,
        the binary log format, engine types.

NOTE_1: the crash-recovery requirement is addressed by WL#5599.

NOTE_2: monitoring status is considered by other WL:s.

       As for the individual per Worker so the aggregate info
       on the overall performance incl LWM.
       A P_S Worker table (WL#5631)
       holds all the per-Worker info including execution times,
       the last executed group by a worker, the total performed tasks by a
       worker, the current assigned queues and status of the queues.
       Applying progress of an individual Worker can be monitored by
       mysql.SLAVE_WORKER_INFO table of WL#5599


NOTE_3: an option to set up database distribution manually is not implemented here

       The manual partitioning would allow the user to bind two or
       more certain databases to be served through one specific queue
       thereby to accomplish many *dependent* databases to one Worker
       association.
       A similar idea is to implement priority mechanism for applying
       events from different queues.



Replication events applying phases, tasks decomposition, mapping
----------------------------------------------------------------

The sequential execution of an event breaks into five basic activities
as the following diagram represents

+--------------------------------------------------------+--------------------+
|                                                        |                    |
| +---------------+   +--------------+   +-------------+ |                    |
| |               |   |              |   |             | |                    |
| |     Read      |   |  Construct   |   |     Skip    | |   COORDINATOR      |
| |    Event    --+---+>           --+---+->  Until    | |   STAGES           |
| |  (relay-log)  |   |    Event     |   |    Delay    | |                    |
| |               |   |              |   |             | | (There is only     |
| +-----^---------+   +--------------+   +-------+-----+ |  one instance      |
|       !                                        |       |  of a coordinator) |
|-------+----------------------------------------+-------+--------------------+
|       |                                        |       |                    |
|       |             +--------------+   +-------+-----+ |                    |
|       |             |              |   |       V     | |   WORKERS          |
|       |             |              |   |             | |   STAGES           |
|       +-------------+   COMMIT    <+---+-- EXECUTE   | |                    |
|                     |              |   |             | | (multiple workers  |
|                     |              |   |             | |  can exist at the  |
|                     +--------------+   +-------------+ |  same time)        |
|                                                        |                    |
+--------------------------------------------------------+--------------------+


- Coordinator: reads events from relay log, instantiates them,
               handles the skip/until/delay stages and then
               schedules them to the correct worker

- Worker:      execute events to commit statements and
               whole transactions

The relation between coordinator and workers is: 1-N, where 1 is
the one and only coordinator thread and N is the number of threads
in the workers pool.

Agglomeration of Read, Construct and Skip-Until-Delay to be executed within
one thread is reasonable considering the three tasks are light and some host of
codes can be reused.

Execute and Commit tasks fit for delegating to Worker threads.
Indeed, splitting Execute into pieces can not be really
backed up by any reasonable requirement.
The Commit task is actually a part of the Execute and
the last executed group position is a field of RLI table to be updated and
committed within - at the end of - the master transaction.
So agglomeration of the two to carry out by one Worker is a reasonable choice.

Communication mechanism between C and W consists of an assignment
Worker Queue (WQ) and Assigned Partition Hash (APH) which correspond
to the downward arrow from C to W of the diagram.
For recovery purpose there is yet another Groups Assignment Queue
(GAQ) that corresponds to the upward arrow.

WQ holds scheduled to execution events belonging to a certain
partition. Coordinator calls a hash function on an event instance in
order to determine a partition identifier P_id.
A partition to a Worker mapping is controlled with the APH hash.
The hash contains (P_id, N, W_id) tuples, where N - a number of ongoing
*groups* of events. 

APH is complemented with Current Group Assigned Partitions object
(CGAP) on the Coordinator's side that lists all P_id found during the
current group assigning. Multiple events of a group map to one P_id as
long as one database is updated. CGAP items can be many if many databases
are updated by the group. Each item of the list forces either to
insert or to update a tuple in the APH at scheduling time as
following.

On the Coordinator side, after computing P_id Coordinator first searches CGAP
for an existing item.
If the search is positive APH does not change.
If the search is negative first a new record with P_id is inserted into
CGAP. Further the P_id-keyed tuple is searched in APH.
If not found a new (P_id, 1, W_id) tuple is inserted, W_id is determined as a new
Worker's id if the max pool size allows to thread it out, and to the
least occupied Worker id if not.
Otherwise the found record is updated with incrementing N++.
At last if the event is the current group terminator, all items in the CGAP are
discarded.

On the Worker's side when the current group is being committed N-- is
decremented in all the group's tuples that Coordinator inserted or
updated. A tuple is (optionally) removed from APH once N becomes zero.
The identifiers of the group are collected in Current Group Executed
Partitions (CGEP) list. That is the Worker's side complementary to APH
object that behaves analogically to the Coordinator's CGAP.
That is each item of the CGEP list forces to update N and possible to
delete a corresponding tuple from APH at the group committing time.

An event propagation is depicted  as the following


 				      +---------+----------+
	 ++-------------++            | Queue_1 + Worker_1 |
	 ||             ||	      +---------+----------+
+---+	 || Coordinator || hash(E)    +---------+----------+
| E +--->||             |+----------->| Queue_2 + Worker_2 |
+---+	 ||             ||	      +---------+----------+
	 ++-------------++	               ...
				      +---------+----------+
				      | Queue_n | Worker_n |
				      +---------+----------+ 


where hash(E): E -> P_id, Queue_i contains zero or more P_id.

A worker that was attached to the P_id executes the event and updates
the RLI recovery table if the event is the current group terminator.
Although workers don't have any communication between each other
they coordinate their commits to RLI as explained further.

WQ has a maximum size threshold.
The limited WQ makes Coordinator to wait until the actual size of
the target WQ decreases within the limit.

Notice, there are few constraints to this type of the
inter-transaction parallelization:

-  the entire transaction is mapped to one Worker at most
-  the few types of Log_event class (see R1.uc) are executed in the Sequential mode.
   Those types represent internal replication facilities and can't
   have any significant impact on performance (see LLD for further
   details).
-  workers still have common resources such as RLI recovery table
   and therefore is subject to few implicit synchronizations.

The Commit task has a strong connection with WL#5599. When a Worker
commits to RLI it is supposed to update its own last committed and
possibly the common Low-Water-Mark.
That is the Worker's role includes carrying out a group commit to RLI.
In order to facilitate to that, C and {W} maintain yet another Groups
Assignment Queue (GAQ) that the Worker updates atomically with RLI.
GAQ contains tokens of each being processed group of
events by different Workers.
The queue item consists of 
a sequence number of the group in the master binlog order,
the partition id, the assigned Worker and a status -  Committed or Not.
The queue's head is a next to the Low-Water-Mark (NLWM) group.
It separates out already committed groups without gaps from the rest
that form thereby a sequence of still ongoing and committed started
from the not-yet-committed NLWM.
C enqueues an item to the GAQ once it
recognizes a new group of events is being read out. W marks the group
as Committed and if its assignment index follows immediately the index of
the current NLWM, W rearranges the head of the queue as in the
following example.
Let's look at use case of 3 Workers identified as W#1, W#2, W#3.
On the following diagram, let NLWM the head of GAQ contains a group
being executed by Worker #2.
Assume further that Coordinator distributed 5 groups as the following

|   0     |     1    |      2   |   3    |    4
+---------+----------+----------+--------+----------+
|  W #2   |  W #1    |   W #1   | W #2   |  W #3    |    
|         |          |          |        |          |
+---|-----+----------+-----+----+--------+----------+

here the first number row are sequence numbers in the master binlog
order.

Suppose the first of W #1 and the task of W #3
got committed at time Worker #2 is committing.
W #2 has to rearrange  GAQ

|   0     |     1    |      2   |   3    |    4
+---------+----------+----------+--------+----------+----
|  W #2   |  W #1    |   W #1   | W #2   |  W #3    | ...
|committin| Committed|          |        | committed|
+---|-----+----------+-----+----+--------+----------+---
    | 			   |
    +--- NLWM------------->+

to discard its and the next to it committed groups which is the group
with the sequence number 1. The new NLWM will shift to point at the
group 2 and the queue becomes

|   2     |   3    |    4     |
+---------+--------+----------+
|   W #1  |  W #2  |  W #3    |
|         |        | committed|
+---------+--------+----------+

Now in case W #1 commits the 2nd group

|   2     |   3    |    4     |
+---------+--------+----------+
|   W #1  |  W #2  |  W #3    |
|committin|        | committed|
+---------+--------+----------+

the queue transforms to

|   3    |    4     |
+--------+----------+
|  W #2  |  W #3    |
|        | committed|
+--------+----------+

and at last when W #2 commits it becomes empty.





Invariants of Parallel and Sequential execution modes
-----------------------------------------------------

					 ++-------------++
	  +-----------------+		 ||             ||
	  |                 |Execute	 || Worker      ||
	  |   Event         +------------+|             ||
	  |                 |		 ||             ||
	  |                 |		 ++------+------++
	  |                 |			 |  distribute
	  | do_apply_event()|Execute	 ++------+------++
	  |                 +------------++             ||
	  +-----------------+		 || Coordinator ||
					 ||             ||
					 ||             ||
					 ++-------------++



Whichever of Coordinator or Worker executes an event it must
call Event::do_apply_event() method. 
The method is the core of the Execute task.
The diagram above explains an event can be applied through
Coordinator::distribute + Worker::execute (the Parallel exec mode) or
Coordinator::Execute. The event classes hierarchy does not receive
any changes due to the new Parallel execution mode thanks to Worker
execution context design.

Error Reporting (to the End User)
=================================

In multi-threaded execution mode a Worker can hit an
error. At that point, it propagates that error to the
Coordinator. The most recent error that a Worker found, and that
forced the slave to stop, will eventually be displayed in "SHOW
SLAVE STATUS".

However, an important note needs to be raised here. When a Worker
stops because of an error, there may be the case that there will
be gaps in the execution history (some other Workers may have
already executed events that were serialized, by the master,
ahead of this one; some other, serialized before, might not even
have finished their execution). Thus apart from the Worker error
that is reported in SHOW SLAVE STATUS, we also issue a warning:

  "The slave Coordinator and Worker threads are stopped, possibly
   leaving data in inconsistent state. The following restart
   shall restore consistency automatically. There might be
   exceptional situations in the recovery caused by combination
   of non-transactional storage for either of Coordinator or
   Workers info tables and updating non-transactional data tables
   or DDL queries. In such cases you have to examine your
   data (see documentation for de tails)."

We choose to go with this approach because of SHOW SLAVE STATUS
limitation, which is not prepared to show more than one error,
thus we don't want to shadow the original cause.

Future Work (Feature Request):

  In the future we can improve the MTS reporting capabilities by
  implementing an information/performance schema table which the
  user can query to get details on individual the Worker internal
  state. Thus, we can promote the warning mentioned above to
  error, show it in SHOW SLAVE STATUS and then exhibit the
  original cause (and the subsequent list of errors issued during
  the entire execution of the offending statement, until the
  Worker stops) in the [I|P]_S table.

This section elaborates on the following fine details of the project:

I.   Basic objects, lifetimes, states, cooperation

  0.  Coordinator
  1.  Worker
  2.  Worker pool, lifetime, states
  3.  Coordinator <-> Worker communications;
      WQ, GAQ interfaces

  4.  The load distribution hash function and its supplementary mechanisms

II.  Execution modes and flow; use cases and their handling

  0.  WQ overflow protection
  1.  Fallback to the sequential exec mode
  2.  Deferred scheduling of BEGIN;

III. Interfaces for recovery

  0. GAQ from the RLI-commit point of view for WL#5599
  1. Executing recovery at starting the slave.

IV.   User interfaces



I.   Basic objects, lifetimes, states, cooperation
===================================================

0.  Coordinator, roles
----------------------

    Is implemented on the base of the former SQL thread in a way
    of WL#5563 prototype.
    After Coordinator instantiates an event, it finds out its 
    partition/database information. For instance Coordinator does not
    sql-parse a Query-log-event but rather relies on the info prepared
    on the master side.
    Coordinator decides if an event needs the sequential execution
    if the partition id can't be computed (more than one database with
    the automatic hashing) or the event type belongs to the exceptional kinds
(see HLS).
    If event is parallelizable, Coordinator calls a hash function
    to produce P_id the identifier of a partition/database.
    APH is searched for an existing record keyed by P_id, the found
    record is updated or a new is inserted. 
    In the update case there exists a Worker and it remains to handle
    the partition.
    In the insert case Coordinator the least loaded Worker is assigned.
    A separate constant activity is to fill in GAQ each time a new
    group is recognized, see further details of handling BEGIN.

    Other activities include:

    Invoking recovery routine at the slave sql, at bootstrap;
    Watching the pool status if any Worker has failed, per event. A
      failure noticed triggers marking with KILLED the remained
      workers and the Coordinator's exit;
    Shutdown the Worker pool, at exit.

1.  Worker
----------

    Worker is implemented as a system thread that handles events alike 
    the sequential mode SQL thread.
    Its execution context is Slave_worker class based on
    Relay_log_info that corresponds to the execution context of the
    SQL thread or Coordinator.
    
    Worker calls a new Log_event::do_apply_event_worker(Slave_worker *w)
    that casts to Log_event::do_apply_event(w) for all but
    Xid_log_event subclasses of Log_event.

    Xid_log_event::do_apply_event(Slave_worker *) for execution the event by Worker 
    is different from do_apply_event(Relay_log_info *) not just
    because of the type of the argument but rather logics of modifying
    the slave info (recovery) table (see WL#5599).

    The passed to do_apply_event() argument can call its own
    methods. Those methods need to be declared virtual in
    Relay_log_info base class and defined specifically for
    Slave_worker.
    Example: report().


2.  Worker pool, lifetime, states
---------------------------------

    Worker pool is a DYNAMIC_ARRAY of THD instances each attached to OS thread.
    Worker and Coordinator maintain a WQ as a communication channel.
    W and WQ associate 1-to-1, that justifies making WQ as a
    member of the Worker class.
    The pool is initialized with number of members according to
    @@global.slave_parallel_workers at time the SQL/Coordinator thread
    starts.

    The pool is destroyed by Coordinator at its shutdown time.

    A worker resides in a wait-execute loop; wait is for
    an event in its WQ, execution is through ev->do_apply_event().
    Commit or XID group terminators or a self-terminating event
    like DDL of Query_log_event force to update the Worker info table.
    Committing the group affects GAQ as described in HLS part.

    A worker leaves the loop in case of being killed or errored out.
      KILLED is "hard" to force the instant exit like after
      KILL query. 

      Optionally not to be implemented here, there could be soft KILLED
      to make the Worker to process all its assignments first and exist upon that.
      The soft would be a facility to stop the slave with no gaps
      after LWM (see STOP SLAVE).


3.  Coordinator <-> Worker communications; WQ, GAQ, APH interfaces
--------------------------------------------------------------

    WQ, GAQ are implemented through array of a fixed size in a manner
    of the Circular buffer. Here is a template definition:

     typedef struct queue
     {
          DYNAMIC_ARRAY Q;  // placeholder

	  mysql_mutex_t lock;
	  mysql_cond_t  cond;

	  public:

     	  uint en_queue(item*);
          item* head_queue();
	  item* de_queue();
	  bool update(uint index, item*)

     } Slave_queue;

     `item*' stands for a reference to an opaque object corresponding
     to the task assignment (WQ) or the group of events descriptor (GAQ).

     Coordinator calls WQ.en_queue(), Worker - head_queue() and
     de_queue(). Phasing the dequeue operation into two is required by
     specifics of the Execute logics.

     Coordinator calls ind= GAQ.en_queue() when a new BEGIN found in the
     stream of read-out events. Later when the partition id and the Worker id
     got determined it calls GAQ.update(ind, {id, W_id, Not-committed}).
     Worker calls GAQ.update(ind, {id, W_id, Committed}) and GAQ.de_queue()
     as many times as it's necessary if NLWM needs moving.

     APH is a hash with a semaphor (mutex/cond var). The mutex is
     to be held by a caller for any operation. 
     Coordinator searches for, inserts or updates a record. 
     A Worker updates or deletes.
     Coordinator can wait for a condition like `N' usage counters zeros
     out (see fallback in II.1).

     A possible optimization (not yet implemented) for operations on WQ, GAQ,
APH is 
     to implement spin-locking, and for APH - record-level locking.
     Latest (by Tue May 10 15:38:55 EEST 2011) benchmarking suggests this approach.


4.  The load distribution hash function and its supplementary mechanisms
------------------------------------------------------------------------

    The database partitioning info in Rows-event is available for
    Coordinator for granted but not for Query-log-event.
    Early preparation of the hash info on the master side is done to
    the Query-log-event class through a new Status variable of the class.
    Each time a query is logged all databases that it has updated are
    put into a list of the new status var.
    Coordinator checks the list through a new method similar to the existing 
    Log_event::get_db() and calls the hash function as
    explained in p.1.


II.  Execution modes and flow; use cases and their handling
=====================================================

0.  WQ Overflow protection and lessen C to W synchronization
-------------------------------------------------------------

C always waits for a Worker if it can't assign a task due to 
the limited size of its WQ.
Whenever it is proved all Workers have been assigned with sufficient
bulk of load C goes to sleep for a small amount of time
automatically computed basing on statistics of WQ:s processing.


1.  Fallback to the sequential exec mode
----------------------------------------

This part is extensively highlighted in HLS section.
When an event (e.g Query-log-event) contains a db in the partition info that is not
handled by current Worker but instead has been assigned to some other,
C does the following:

a. marks the record of APH with a flag requesting to signal in the
   cond var when `N' the usage counter drops to zero by the other Worker;
b. waits for the other Worker to finish tasks on that partition and
   gets the signal;
c. updates the APH record to point to the first Worker (naturally, N := 1),
   scheduled the event, and goes back into the parallel mode


2.  A special treatment to BEGIN Query-log-event
-------------------------------------------------

BEGIN Query_log_event is regraded as the beginning of a group of
events token.
It allocates a new item in GAQ which will stay anonymous until the next event
reveals the partition/database id.



III. Interfaces for recovery
============================


0. GAQ from the RLI-commit point of view for WL#5599 
----------------------------------------------------

Recovery implementation part which uses cases include

     a. crash
     b. stop slave
     c. a slave thread is stopped
     d. a slave thread is errored out

Commit Query-log-event or XID event forces the Worker to
rearrange NLWM that is to remove first items of GAQ as garbage.

1. Executing recovery at starting the slave
-------------------------------------------

see recovery after the temporary failure in section II as well.
Read details on this section in WL#5599.


IV.  USER Interfaces
====================

0.  User interfaces to Coordinator
----------------------------------

     a. set @@global.slave_parallel_workers= [0..ULONG],

        Cmd line: YES
        Scope:    GLOBAL
        Dynamic:  YES      (does not affect ongoing slave session)
        Range:    0..1024
   
	In case @@global.slave_parallel_workers == 0, execution will
        be in the legacy sequential mode where SQL thread executes
	events.
	Changes to the variable are effective after STOP SLAVE
	that is the running pool size can't change.
   
     b. @@global.slave_pending_jobs_size_max

        Cmd line: YES
        Scope:    GLOBAL
        Dynamic:  YES      (does not affect ongoing slave session)
        Range:    0..ULONG

        Max total size of all events in processing by all Workers
        The least possible value must be not less than the master side
        max_allowed_packet.

     c. Recovery related (WL#5599), including
     @@global.slave_checkpoint_group                                            
  
        Cmd line: YES
        Scope:    GLOBAL
        Dynamic:  YES      (does not affect ongoing slave session)
        Range:    0..ULONG

     @@global.slave_checkpoint_period
 
        Cmd line: YES
        Scope:    GLOBAL
        Dynamic:  YES      (does not affect ongoing slave session)
        Range:    0..ULONG

     d. At STOP SLAVE Coordinator waits till all assigned group of events
        have been processed. Waiting is limited to a timeout (1 minute
        similarly to a Single-Threaded-Slave case).

     e. KILL Query|connection Coordinator|Worker has the immediate affect
        that is a Worker may not finish up the current ongoing group
   	thereby letting the committed groups to have some gaps.

     Changes to Show-Slave-Status parameters 

     f. The last executed group of events coordinates as described by 
        EXEC_MASTER_LOG_POS et all are not update per each group (transaction)
        but rather per some number of processed group of
        events. The number can't be greater than
        @@global.slave_checkpoint_group and anyway SBM updating
        rate does not exceed @@global.slave_checkpoint_period.

     g. Notice, that Seconds_behind_master (SBM) updating policy is slightly
        different for MTS.
        The status parameter is updated similarly to EXEC_MASTER_LOG_POS et all
        of p.f.
        SBM is set to a new value after processing the terminal event
        (e.g Commit) of a group.
        Coordinator resets SBM when notices no more groups left neither
        to read from Relay-log nor to process by Workers.

      

1.  User interfaces to Workers
------------------------------

     a. Monitoring through a new P_S table of wl#5631

     b. (not to be implemented here) Manual performance tuning

        - to freeze (unfreeze) the Worker's communication channel, that is
	  to make it work through the current set of partitions

	- to combine multiple considered to be mutually dependent
          partitions/databases into one WQ

	- to start multiple workers at the
          slave bootstrap and manually associate each with a set of
	  partitions/databases