WL#5569: Replication events parallel execution via hashing per database name
Affects: Server-5.6
—
Status: Complete
High-Level Description MOTIVATION ---------- In many cases the user application logically partitions data per databases allowing some flexible patterns of applying replication events on the slave. In one important use case updating within a single database must follow the master total order but any two databases don't have to coordinate their applying. OBJECTIVE --------- Given the user application specifics of the logical partitioning per database, the goal is largely the same as in wl#943, WL#4648 that is to scale up slave system performance via engaging multiple threads to apply replication events. Improvement in performance effectively decreases lagging the slave behind. A prototype implementation subject to few technical-wise limitations is considered in WL#5563. FEATURES -------- Parallelization by db name is not concerned in the binlog format of events; Partitioned to different worker threads transactions do not have commit coordination; Parallelization by db name is transparent to the engine type; Automatic switching from PARALLEL concurrent to temporal sequential mode in special cases when events can't be executed concurrently in parallel. Once a special case is handled, the slave automatically restores the concurrent PARALLEL mode. NOTE_1: An event that updates multiple databases do not necessary mean such switching. NOTE_2: There are few event types like Format_Descriptor, Rotate that are executed by Single-Threaded Slave standard algorithm (that is by Coordinator). They do not introduce any throttle to concurrency while binary logs are not rotated frequently. Internal Goals -------------- To create a framework facilitating further extending towards support of more use cases of partitioning, e.g per table. Conceptual limitations and differences from the single thread -------------------------------------------------------------- a. This framework relies on WL#2775 implementing a base for automatic recovery. The transactional storage keeping the descriptor of last-updated info is extended to record also an array of last-executed group positions by the worker threads (see more in Recovery section). b. Status info displaying via SHOW-SLAVE-STATUS will largely remain the same. In particular that relates to the monotonic grow of EXEC_MASTER_LOG_POS. However the parameter will turn into a Low-Water-Mark. LWM indicates that before (in the master logging order sense) the last committed transaction or an executed group of event there is no uncommitted yet transactions/groups. But there can be some committed after the LWM. SECONDS_BEHIND_MASTER can't be updated per each event and updating happens in intervals along with EXEC_MASTER_LOG_POS (see LLD for more details). c. Additional info on status of individual workers can be provided via mysql.SLAVE_WORKER_INFO table of WL#5599. d. May not be supported: - transaction retry after a temporary failure - start slave UNTIL - foreign keys constraint if the referenced table is in a different database (see wl#6007). Currently, MTS user has to make sure that possible FK-dependencies do not cross over the explicit table databases set. Otherwise it could lead deadlocks involving slave worker threads and data inconsistency. - pre-5.0 master server version LOAD-DATA related events d. BUG#37426 fulnerable master can't be detected by MTS. This is a technical issue, could be lifted away. Recovery model -------------- The parallel execution mode does not change in principle recovery for DDL or any other Non-transactional events. In this case the recovery routine is deemed to be manual. For the reason of the simple possible recovery, when the user expects failures and got accustomed to some specific to his application recovery procedure, DDL can (optionally) force the slave to apply it in the SEQUENTIAL mode with waiting first for the previous Parallelly distributed load has been completed by the Workers. This worklog requires some extensions to be done to WL#2775 slave system tables for automatic recovery. The relay-log-info table is supposed to be augmented (WL#5599) to contain each worker last executed group position. Another change is meant for the last executed group position of the single SQL thread which is converted into the low-water mark of executed group positions among all Workers, that indicates no gaps in committed groups exist before it. Depending on the binary log events format the following methods can be applied 1. (Not to be implemented here) Idempotent recovery when binlog_format == ROW The low-water-mark last executed group position that the coordinator maintains alone can suffice. It is regarded as an anchor to restart events applying same way as the last executed group position of the single SQL thread does. Possible errors like Dup-key, Key-not-found few more are benign and will be ignored. The idempotent applying policy changes to the user preferred one as soon as the low-water mark has reached the max group position among workers. This type of recovery does not require the Worker pool and can be conducted solely by the Coordinator. Notice a great advantage of this method although bound to the ROW-format is ransparent to the engine type. 2. General recovery regardless of the binlog-format (read more on that in WL#5599). This method leaves aside the mentioned DDL and Non-transactional Query-log-events and applies to the Transactional Rows- and Query -log-events. This type of recovery involve information that Workers left in their info-table. In short, MTS recovery reads the Workers info constructs a set of not-applied transactions that follow LWM. Starting from the low-water-mark the Coordinator parses events to ignore those that were committed and apply the rest. Favorable environment ---------------------- While Statement-format events including Query-log-event and all types of engine will be supported, the primary concern is in achieving performance scaling up with the following conditions: - Innodb storage engine; - ROW binlog format; - Transactions operates on disjoint sets of databases (i.e., they do not conflict and can commit independently). Sketch of Architecture ---------------------- It is assumed for now that any replication event can be partitioned. Query-log or Rows-log events to different databases fall into that category with partitioning per db name. Such partitioning model suggests to equipping the single SQL thread system by a worker thread pool. The SQL thread changes its current multi-purpose role to serve as their Coordinator. Interaction between Coordinator and Workers follow to the producer-consumer model, pretty close to what had been implemented in the WL#4648 (in fact that piece of work can be reused). Pseudo-code for Coordinator: // Coordinator the former SQL thread remains to operate as in the single thread env // to stay in read-execute loop but instead of executing an event // through apply_event_and_update_pos() it distributes tasks to workers while(!killed) exec_relay_log_event(); where exec_relay_log_event() { if (workers_state() is W_ERROR) return error; ev= read(); switch (mts_execution_mode(ev)) case TEMPORARY_SEQUENTIAL: wait_for_workers_to_finish(); case PARALLEL_CONCURRENT: engage_worker(ev, w_thd[worker_id]); break; // Some of events type like Format_Descriptor, Rotate Coordinator // executes itself. Depending on the event property its execution // can require synchronization with Workers. case SEQUENTIAL_SYNC: wait_for_workers_to_finish(); case SEQUENTIAL_ASYNC: apply_event_and_update_pos(); } } Pseudo-code for Worker: // The worker stays in wait-execute loop while (!killed) { ev= get_event(); // wait for an event to show up exec_res= ev->apply_event(rli); ev->update_pos(); // updating workers info-table with the task's group position report(exec_res); // asynch status communication with Coord } The worker pool is activated via START SLAVE. STOP SLAVE terminates all threads including the worker pool.
WL#2775: System tables for master.info, relay_log.info
WL#5563: MTS: Prototype for Slave parallelized by db name
WL#5754: Query event parallel execution
WL#5563: MTS: Prototype for Slave parallelized by db name
WL#5754: Query event parallel execution
Requirements:
=============
R0. To provide the user interface for the parallel execution mode including
A. @@global.slave_parallel_workers
that specifies the Number of worker threads.
Zero number of workers, the default, designates no worker pool and therefore
the legacy sequential execution mode or Single-Threaded-Slave (STS).
Use cases:
U.A: set @@global.slave_parallel_workers= `value`;
(see LLD section for the details of the variable definition)
R1. To provide transparent detection of limitations for parallelization
and performing corresponding actions for such events due
to the following cases:
A: the hash function mapping conflicts;
U.A.1: a transaction spawning to multiple databases
Let's consider multi-statement transactions T_1, T_2, T_3, where
T_1: begin; update db1; commit;
T_2: begin; update db1; commit;
T_3: begin; update db1; update db2; commit;
Verify the correct applying takes place, the parallel exec mode
resumes
after T_3 has committed.
U.A.2: a query crossing databases
set autocommit = ON;
update db1.table, db2.table ...;
In case the number of databases exceed a maximum the whole hash
will be locked for a Worker executing the query that reduces
parallelism.
B: the exceptional set of events to be executed by Coordinator;
U.B: Limited to the sequential execution by Coordinator events include
all type of
events in the list:
STOP_EVENT,
ROTATE_EVENT,
FORMAT_DESCRIPTION_EVENT,
INCIDENT_EVENT
C: events from an old master with reduced parallelism;
U.C: A query event from MTS-unaware old master are treated as if the query
crosses the maximum databases of the case U.A.2.
D: events that can't be supported in MTS;
There are cases when MTS starts reading events not from the boundary of
a group. Unfortunately that can be a legacy log file created by an old
server.
U.D: LOAD-DATA events generated by 4.0,4.1 master. MTS execution will stop
with an error.
R3. To deliver MTS implementation that scales up.
To annotate as favorable so least permant use cases.
use case:
U: to demonstrate when the max performance is achieved as a
function of parameters including:
the number of computer computing units, databases, Workers,
the binary log format, engine types.
NOTE_1: the crash-recovery requirement is addressed by WL#5599.
NOTE_2: monitoring status is considered by other WL:s.
As for the individual per Worker so the aggregate info
on the overall performance incl LWM.
A P_S Worker table (WL#5631)
holds all the per-Worker info including execution times,
the last executed group by a worker, the total performed tasks by a
worker, the current assigned queues and status of the queues.
Applying progress of an individual Worker can be monitored by
mysql.SLAVE_WORKER_INFO table of WL#5599
NOTE_3: an option to set up database distribution manually is not implemented here
The manual partitioning would allow the user to bind two or
more certain databases to be served through one specific queue
thereby to accomplish many *dependent* databases to one Worker
association.
A similar idea is to implement priority mechanism for applying
events from different queues.
Replication events applying phases, tasks decomposition, mapping
----------------------------------------------------------------
The sequential execution of an event breaks into five basic activities
as the following diagram represents
+--------------------------------------------------------+--------------------+
| | |
| +---------------+ +--------------+ +-------------+ | |
| | | | | | | | |
| | Read | | Construct | | Skip | | COORDINATOR |
| | Event --+---+> --+---+-> Until | | STAGES |
| | (relay-log) | | Event | | Delay | | |
| | | | | | | | (There is only |
| +-----^---------+ +--------------+ +-------+-----+ | one instance |
| ! | | of a coordinator) |
|-------+----------------------------------------+-------+--------------------+
| | | | |
| | +--------------+ +-------+-----+ | |
| | | | | V | | WORKERS |
| | | | | | | STAGES |
| +-------------+ COMMIT <+---+-- EXECUTE | | |
| | | | | | (multiple workers |
| | | | | | can exist at the |
| +--------------+ +-------------+ | same time) |
| | |
+--------------------------------------------------------+--------------------+
- Coordinator: reads events from relay log, instantiates them,
handles the skip/until/delay stages and then
schedules them to the correct worker
- Worker: execute events to commit statements and
whole transactions
The relation between coordinator and workers is: 1-N, where 1 is
the one and only coordinator thread and N is the number of threads
in the workers pool.
Agglomeration of Read, Construct and Skip-Until-Delay to be executed within
one thread is reasonable considering the three tasks are light and some host of
codes can be reused.
Execute and Commit tasks fit for delegating to Worker threads.
Indeed, splitting Execute into pieces can not be really
backed up by any reasonable requirement.
The Commit task is actually a part of the Execute and
the last executed group position is a field of RLI table to be updated and
committed within - at the end of - the master transaction.
So agglomeration of the two to carry out by one Worker is a reasonable choice.
Communication mechanism between C and W consists of an assignment
Worker Queue (WQ) and Assigned Partition Hash (APH) which correspond
to the downward arrow from C to W of the diagram.
For recovery purpose there is yet another Groups Assignment Queue
(GAQ) that corresponds to the upward arrow.
WQ holds scheduled to execution events belonging to a certain
partition. Coordinator calls a hash function on an event instance in
order to determine a partition identifier P_id.
A partition to a Worker mapping is controlled with the APH hash.
The hash contains (P_id, N, W_id) tuples, where N - a number of ongoing
*groups* of events.
APH is complemented with Current Group Assigned Partitions object
(CGAP) on the Coordinator's side that lists all P_id found during the
current group assigning. Multiple events of a group map to one P_id as
long as one database is updated. CGAP items can be many if many databases
are updated by the group. Each item of the list forces either to
insert or to update a tuple in the APH at scheduling time as
following.
On the Coordinator side, after computing P_id Coordinator first searches CGAP
for an existing item.
If the search is positive APH does not change.
If the search is negative first a new record with P_id is inserted into
CGAP. Further the P_id-keyed tuple is searched in APH.
If not found a new (P_id, 1, W_id) tuple is inserted, W_id is determined as a new
Worker's id if the max pool size allows to thread it out, and to the
least occupied Worker id if not.
Otherwise the found record is updated with incrementing N++.
At last if the event is the current group terminator, all items in the CGAP are
discarded.
On the Worker's side when the current group is being committed N-- is
decremented in all the group's tuples that Coordinator inserted or
updated. A tuple is (optionally) removed from APH once N becomes zero.
The identifiers of the group are collected in Current Group Executed
Partitions (CGEP) list. That is the Worker's side complementary to APH
object that behaves analogically to the Coordinator's CGAP.
That is each item of the CGEP list forces to update N and possible to
delete a corresponding tuple from APH at the group committing time.
An event propagation is depicted as the following
+---------+----------+
++-------------++ | Queue_1 + Worker_1 |
|| || +---------+----------+
+---+ || Coordinator || hash(E) +---------+----------+
| E +--->|| |+----------->| Queue_2 + Worker_2 |
+---+ || || +---------+----------+
++-------------++ ...
+---------+----------+
| Queue_n | Worker_n |
+---------+----------+
where hash(E): E -> P_id, Queue_i contains zero or more P_id.
A worker that was attached to the P_id executes the event and updates
the RLI recovery table if the event is the current group terminator.
Although workers don't have any communication between each other
they coordinate their commits to RLI as explained further.
WQ has a maximum size threshold.
The limited WQ makes Coordinator to wait until the actual size of
the target WQ decreases within the limit.
Notice, there are few constraints to this type of the
inter-transaction parallelization:
- the entire transaction is mapped to one Worker at most
- the few types of Log_event class (see R1.uc) are executed in the Sequential mode.
Those types represent internal replication facilities and can't
have any significant impact on performance (see LLD for further
details).
- workers still have common resources such as RLI recovery table
and therefore is subject to few implicit synchronizations.
The Commit task has a strong connection with WL#5599. When a Worker
commits to RLI it is supposed to update its own last committed and
possibly the common Low-Water-Mark.
That is the Worker's role includes carrying out a group commit to RLI.
In order to facilitate to that, C and {W} maintain yet another Groups
Assignment Queue (GAQ) that the Worker updates atomically with RLI.
GAQ contains tokens of each being processed group of
events by different Workers.
The queue item consists of
a sequence number of the group in the master binlog order,
the partition id, the assigned Worker and a status - Committed or Not.
The queue's head is a next to the Low-Water-Mark (NLWM) group.
It separates out already committed groups without gaps from the rest
that form thereby a sequence of still ongoing and committed started
from the not-yet-committed NLWM.
C enqueues an item to the GAQ once it
recognizes a new group of events is being read out. W marks the group
as Committed and if its assignment index follows immediately the index of
the current NLWM, W rearranges the head of the queue as in the
following example.
Let's look at use case of 3 Workers identified as W#1, W#2, W#3.
On the following diagram, let NLWM the head of GAQ contains a group
being executed by Worker #2.
Assume further that Coordinator distributed 5 groups as the following
| 0 | 1 | 2 | 3 | 4
+---------+----------+----------+--------+----------+
| W #2 | W #1 | W #1 | W #2 | W #3 |
| | | | | |
+---|-----+----------+-----+----+--------+----------+
here the first number row are sequence numbers in the master binlog
order.
Suppose the first of W #1 and the task of W #3
got committed at time Worker #2 is committing.
W #2 has to rearrange GAQ
| 0 | 1 | 2 | 3 | 4
+---------+----------+----------+--------+----------+----
| W #2 | W #1 | W #1 | W #2 | W #3 | ...
|committin| Committed| | | committed|
+---|-----+----------+-----+----+--------+----------+---
| |
+--- NLWM------------->+
to discard its and the next to it committed groups which is the group
with the sequence number 1. The new NLWM will shift to point at the
group 2 and the queue becomes
| 2 | 3 | 4 |
+---------+--------+----------+
| W #1 | W #2 | W #3 |
| | | committed|
+---------+--------+----------+
Now in case W #1 commits the 2nd group
| 2 | 3 | 4 |
+---------+--------+----------+
| W #1 | W #2 | W #3 |
|committin| | committed|
+---------+--------+----------+
the queue transforms to
| 3 | 4 |
+--------+----------+
| W #2 | W #3 |
| | committed|
+--------+----------+
and at last when W #2 commits it becomes empty.
Invariants of Parallel and Sequential execution modes
-----------------------------------------------------
++-------------++
+-----------------+ || ||
| |Execute || Worker ||
| Event +------------+| ||
| | || ||
| | ++------+------++
| | | distribute
| do_apply_event()|Execute ++------+------++
| +------------++ ||
+-----------------+ || Coordinator ||
|| ||
|| ||
++-------------++
Whichever of Coordinator or Worker executes an event it must
call Event::do_apply_event() method.
The method is the core of the Execute task.
The diagram above explains an event can be applied through
Coordinator::distribute + Worker::execute (the Parallel exec mode) or
Coordinator::Execute. The event classes hierarchy does not receive
any changes due to the new Parallel execution mode thanks to Worker
execution context design.
Error Reporting (to the End User)
=================================
In multi-threaded execution mode a Worker can hit an
error. At that point, it propagates that error to the
Coordinator. The most recent error that a Worker found, and that
forced the slave to stop, will eventually be displayed in "SHOW
SLAVE STATUS".
However, an important note needs to be raised here. When a Worker
stops because of an error, there may be the case that there will
be gaps in the execution history (some other Workers may have
already executed events that were serialized, by the master,
ahead of this one; some other, serialized before, might not even
have finished their execution). Thus apart from the Worker error
that is reported in SHOW SLAVE STATUS, we also issue a warning:
"The slave Coordinator and Worker threads are stopped, possibly
leaving data in inconsistent state. The following restart
shall restore consistency automatically. There might be
exceptional situations in the recovery caused by combination
of non-transactional storage for either of Coordinator or
Workers info tables and updating non-transactional data tables
or DDL queries. In such cases you have to examine your
data (see documentation for de tails)."
We choose to go with this approach because of SHOW SLAVE STATUS
limitation, which is not prepared to show more than one error,
thus we don't want to shadow the original cause.
Future Work (Feature Request):
In the future we can improve the MTS reporting capabilities by
implementing an information/performance schema table which the
user can query to get details on individual the Worker internal
state. Thus, we can promote the warning mentioned above to
error, show it in SHOW SLAVE STATUS and then exhibit the
original cause (and the subsequent list of errors issued during
the entire execution of the offending statement, until the
Worker stops) in the [I|P]_S table.
This section elaborates on the following fine details of the project:
I. Basic objects, lifetimes, states, cooperation
0. Coordinator
1. Worker
2. Worker pool, lifetime, states
3. Coordinator <-> Worker communications;
WQ, GAQ interfaces
4. The load distribution hash function and its supplementary mechanisms
II. Execution modes and flow; use cases and their handling
0. WQ overflow protection
1. Fallback to the sequential exec mode
2. Deferred scheduling of BEGIN;
III. Interfaces for recovery
0. GAQ from the RLI-commit point of view for WL#5599
1. Executing recovery at starting the slave.
IV. User interfaces
I. Basic objects, lifetimes, states, cooperation
===================================================
0. Coordinator, roles
----------------------
Is implemented on the base of the former SQL thread in a way
of WL#5563 prototype.
After Coordinator instantiates an event, it finds out its
partition/database information. For instance Coordinator does not
sql-parse a Query-log-event but rather relies on the info prepared
on the master side.
Coordinator decides if an event needs the sequential execution
if the partition id can't be computed (more than one database with
the automatic hashing) or the event type belongs to the exceptional kinds
(see HLS).
If event is parallelizable, Coordinator calls a hash function
to produce P_id the identifier of a partition/database.
APH is searched for an existing record keyed by P_id, the found
record is updated or a new is inserted.
In the update case there exists a Worker and it remains to handle
the partition.
In the insert case Coordinator the least loaded Worker is assigned.
A separate constant activity is to fill in GAQ each time a new
group is recognized, see further details of handling BEGIN.
Other activities include:
Invoking recovery routine at the slave sql, at bootstrap;
Watching the pool status if any Worker has failed, per event. A
failure noticed triggers marking with KILLED the remained
workers and the Coordinator's exit;
Shutdown the Worker pool, at exit.
1. Worker
----------
Worker is implemented as a system thread that handles events alike
the sequential mode SQL thread.
Its execution context is Slave_worker class based on
Relay_log_info that corresponds to the execution context of the
SQL thread or Coordinator.
Worker calls a new Log_event::do_apply_event_worker(Slave_worker *w)
that casts to Log_event::do_apply_event(w) for all but
Xid_log_event subclasses of Log_event.
Xid_log_event::do_apply_event(Slave_worker *) for execution the event by Worker
is different from do_apply_event(Relay_log_info *) not just
because of the type of the argument but rather logics of modifying
the slave info (recovery) table (see WL#5599).
The passed to do_apply_event() argument can call its own
methods. Those methods need to be declared virtual in
Relay_log_info base class and defined specifically for
Slave_worker.
Example: report().
2. Worker pool, lifetime, states
---------------------------------
Worker pool is a DYNAMIC_ARRAY of THD instances each attached to OS thread.
Worker and Coordinator maintain a WQ as a communication channel.
W and WQ associate 1-to-1, that justifies making WQ as a
member of the Worker class.
The pool is initialized with number of members according to
@@global.slave_parallel_workers at time the SQL/Coordinator thread
starts.
The pool is destroyed by Coordinator at its shutdown time.
A worker resides in a wait-execute loop; wait is for
an event in its WQ, execution is through ev->do_apply_event().
Commit or XID group terminators or a self-terminating event
like DDL of Query_log_event force to update the Worker info table.
Committing the group affects GAQ as described in HLS part.
A worker leaves the loop in case of being killed or errored out.
KILLED is "hard" to force the instant exit like after
KILL query.
Optionally not to be implemented here, there could be soft KILLED
to make the Worker to process all its assignments first and exist upon that.
The soft would be a facility to stop the slave with no gaps
after LWM (see STOP SLAVE).
3. Coordinator <-> Worker communications; WQ, GAQ, APH interfaces
--------------------------------------------------------------
WQ, GAQ are implemented through array of a fixed size in a manner
of the Circular buffer. Here is a template definition:
typedef struct queue
{
DYNAMIC_ARRAY Q; // placeholder
mysql_mutex_t lock;
mysql_cond_t cond;
public:
uint en_queue(item*);
item* head_queue();
item* de_queue();
bool update(uint index, item*)
} Slave_queue;
`item*' stands for a reference to an opaque object corresponding
to the task assignment (WQ) or the group of events descriptor (GAQ).
Coordinator calls WQ.en_queue(), Worker - head_queue() and
de_queue(). Phasing the dequeue operation into two is required by
specifics of the Execute logics.
Coordinator calls ind= GAQ.en_queue() when a new BEGIN found in the
stream of read-out events. Later when the partition id and the Worker id
got determined it calls GAQ.update(ind, {id, W_id, Not-committed}).
Worker calls GAQ.update(ind, {id, W_id, Committed}) and GAQ.de_queue()
as many times as it's necessary if NLWM needs moving.
APH is a hash with a semaphor (mutex/cond var). The mutex is
to be held by a caller for any operation.
Coordinator searches for, inserts or updates a record.
A Worker updates or deletes.
Coordinator can wait for a condition like `N' usage counters zeros
out (see fallback in II.1).
A possible optimization (not yet implemented) for operations on WQ, GAQ,
APH is
to implement spin-locking, and for APH - record-level locking.
Latest (by Tue May 10 15:38:55 EEST 2011) benchmarking suggests this approach.
4. The load distribution hash function and its supplementary mechanisms
------------------------------------------------------------------------
The database partitioning info in Rows-event is available for
Coordinator for granted but not for Query-log-event.
Early preparation of the hash info on the master side is done to
the Query-log-event class through a new Status variable of the class.
Each time a query is logged all databases that it has updated are
put into a list of the new status var.
Coordinator checks the list through a new method similar to the existing
Log_event::get_db() and calls the hash function as
explained in p.1.
II. Execution modes and flow; use cases and their handling
=====================================================
0. WQ Overflow protection and lessen C to W synchronization
-------------------------------------------------------------
C always waits for a Worker if it can't assign a task due to
the limited size of its WQ.
Whenever it is proved all Workers have been assigned with sufficient
bulk of load C goes to sleep for a small amount of time
automatically computed basing on statistics of WQ:s processing.
1. Fallback to the sequential exec mode
----------------------------------------
This part is extensively highlighted in HLS section.
When an event (e.g Query-log-event) contains a db in the partition info that is not
handled by current Worker but instead has been assigned to some other,
C does the following:
a. marks the record of APH with a flag requesting to signal in the
cond var when `N' the usage counter drops to zero by the other Worker;
b. waits for the other Worker to finish tasks on that partition and
gets the signal;
c. updates the APH record to point to the first Worker (naturally, N := 1),
scheduled the event, and goes back into the parallel mode
2. A special treatment to BEGIN Query-log-event
-------------------------------------------------
BEGIN Query_log_event is regraded as the beginning of a group of
events token.
It allocates a new item in GAQ which will stay anonymous until the next event
reveals the partition/database id.
III. Interfaces for recovery
============================
0. GAQ from the RLI-commit point of view for WL#5599
----------------------------------------------------
Recovery implementation part which uses cases include
a. crash
b. stop slave
c. a slave thread is stopped
d. a slave thread is errored out
Commit Query-log-event or XID event forces the Worker to
rearrange NLWM that is to remove first items of GAQ as garbage.
1. Executing recovery at starting the slave
-------------------------------------------
see recovery after the temporary failure in section II as well.
Read details on this section in WL#5599.
IV. USER Interfaces
====================
0. User interfaces to Coordinator
----------------------------------
a. set @@global.slave_parallel_workers= [0..ULONG],
Cmd line: YES
Scope: GLOBAL
Dynamic: YES (does not affect ongoing slave session)
Range: 0..1024
In case @@global.slave_parallel_workers == 0, execution will
be in the legacy sequential mode where SQL thread executes
events.
Changes to the variable are effective after STOP SLAVE
that is the running pool size can't change.
b. @@global.slave_pending_jobs_size_max
Cmd line: YES
Scope: GLOBAL
Dynamic: YES (does not affect ongoing slave session)
Range: 0..ULONG
Max total size of all events in processing by all Workers
The least possible value must be not less than the master side
max_allowed_packet.
c. Recovery related (WL#5599), including
@@global.slave_checkpoint_group
Cmd line: YES
Scope: GLOBAL
Dynamic: YES (does not affect ongoing slave session)
Range: 0..ULONG
@@global.slave_checkpoint_period
Cmd line: YES
Scope: GLOBAL
Dynamic: YES (does not affect ongoing slave session)
Range: 0..ULONG
d. At STOP SLAVE Coordinator waits till all assigned group of events
have been processed. Waiting is limited to a timeout (1 minute
similarly to a Single-Threaded-Slave case).
e. KILL Query|connection Coordinator|Worker has the immediate affect
that is a Worker may not finish up the current ongoing group
thereby letting the committed groups to have some gaps.
Changes to Show-Slave-Status parameters
f. The last executed group of events coordinates as described by
EXEC_MASTER_LOG_POS et all are not update per each group (transaction)
but rather per some number of processed group of
events. The number can't be greater than
@@global.slave_checkpoint_group and anyway SBM updating
rate does not exceed @@global.slave_checkpoint_period.
g. Notice, that Seconds_behind_master (SBM) updating policy is slightly
different for MTS.
The status parameter is updated similarly to EXEC_MASTER_LOG_POS et all
of p.f.
SBM is set to a new value after processing the terminal event
(e.g Commit) of a group.
Coordinator resets SBM when notices no more groups left neither
to read from Relay-log nor to process by Workers.
1. User interfaces to Workers
------------------------------
a. Monitoring through a new P_S table of wl#5631
b. (not to be implemented here) Manual performance tuning
- to freeze (unfreeze) the Worker's communication channel, that is
to make it work through the current set of partitions
- to combine multiple considered to be mutually dependent
partitions/databases into one WQ
- to start multiple workers at the
slave bootstrap and manually associate each with a set of
partitions/databases
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.