WL#5754: Query event parallel execution

Affects: Server-Prototype Only   —   Status: Complete

MOTIVATION
----------

This worklog keeps track of the changes that are needed to make
parallel INTER-transaction applier at the slave when master is
configured to log in STATEMENT format.

In a nutshell, on the master side, it details some of the changes 
needed to log every database on involved in the execution of a 
statement. On the slave side, it addresses some work that is needed 
for supporting temporary tables when slave is operating in parallel 
mode.


OBJECTIVE
---------

This is a sibling of the base framework of WL#5569 implementing 
full and effective parallelization for Query events.

FEATURES
--------

Transparency to the engine type;

Query content is without any limit in particular can contain User variables
  and/or modify a temporary table;

Parallelization is by the actual db name;

Cross-database updates are supported as well;

Automatic back-and-forth switching from PARALLEL to SEQUENTIAL in case
  of a partitioning conflict such as a being distributed to execute
  query updates a database that is in use by another Worker thread.



================================================================================
                    High-Level Specification
================================================================================

0. Parallelization for Query to follow the same semantics, rules - recovery
included -
   and controlled by the user interface by WL#5569;

1. Query can include arbitrary features: call routines, select from or
   update a temporary table, refer to user variables etc;

2. Transparent detection of partitioning conflicts and fall back to the sequential
   execution in such cases;

3. To prove scaling-up performance close to the linear in certain use cases.


An outstanding difference of the Query event from the Rows event bundle is that
the Query event can update multiple databases and does not have explicit names
anywhere in the event header.
Lack of this info is being tackled by introducing and exploiting a new Status
variable (Updated-db:s) of the Query-log-event class.
To-be-updated database names, including ones from involved stored
routines, are all known at the tables locking time and are gathered in
a list that is eventually copied into Updated-db:s new status variable
at logging into the binary log.

While building-up support for the User, Random and Int- vars is rather
straightforward task requiring only to hold on with assigning the
events until their parent query has shown up, the temporary tables
case is a special one.
For the temp tables Coordinator and Workers
continue to use the current placeholder of all replicated slave tables
but pieces of code that modify the placeholder are converted into
critical sections guarded by a new mutex.

I. Database names propagation and handling on the slave side

Coordinator does not sql-parse a Query-log-event but rather relies on
the info prepared on the master side.
Gathering the list of updated databases spawns into few cases:

1. DML

   THD::decide_logging_format(tables)

2. CREATE [temporary] table 

   mysql_create_table()

3. DROP   table(s)

   mysql_rm_table_no_locks()

4. CREATE database

   mysql_create_db()

5. DROP   database

   mysql_rm_db()

6. CREATE sf, sp

   sp_create_routine()

7. DROP   sf, sp

   sp_drop_routine()

8. CREATE|DROP trigger

   mysql_create_or_drop_trigger()

9. ALTER, CREATE|DROP index

   mysql_alter_table() 

10. CREATE, DROP event

11. RENAME table(s)

12. ALTER|CREATE view, DROP VIEW(s)

13. GRANT, REVOKE
    (notice these two can affect multiple databases).

Notice that the rest of DDL queries including
ALTER DATABASE, ALTER PROCEDURE, ALTER EVENT or dealing with
index, tablespaces or the query logs do not change any database.

In the 1st DML case all write-level locked tables' databases are of interest.
In the rest of cases the database names are found in an individual context.
The list of the updated databases is stored into a  new thd's member to be 
consulted at Query_log_event::Query_log_event() constructor to produce the value of
Updated-db:s stored into the Query-log-event's header along with other
status.
Because the status area requires to be of a limited size, the Updated-db
is subject to a maximum number of database that it can hold.
In case the number of databases exceeds it the list is not stored and the event
is applied sequentially on the slave.
The runtime maximum is defined as MAX_DBS_IN_QUERY_MTS. It can be modified
via cmake-configure and is subject to a constraint:

MAX_DBS_IN_QUERY_MTS < OVER_MAX_DBS_IN_QUERY_MTS == 254.

The latter constant is not supposed for changes. Its value is set into
the status var db number part in the mentioned case of the actual number is
over MAX_DBS_IN_QUERY_MTS. Notice, the slave compares *its*
MAX_DBS_IN_QUERY_MTS against the number in the event and if the latter is
greater the event is handled sequentially.
The sequential applying is doomed to the Query events from the Old Master or
events that don't have the new status variable set at all.

At instantiation on the slave the event contains the list of databases and
the core of WL#5569 distribution algorithm is applied.
Log_event::mts_get_dbs() new methods returns the list of databases to a
caller inside of Slave_worker* Log_event::get_slave_worker_id() to be looped
though to invoke get_slave_worker() per each db.
In case a db is occupied the function hangs until the db got released.
The latter is nohow different from waiting a db of a new query of a being
scheduled transaction.

Notice, a conservative estimate for the total size of a replication event 
through @@max_allow_packet and MAX_LOG_EVENT_HEADER becomes lesser precise
because MAX_LOG_EVENT_HEADER is incremented by maximum of the new status variable.
The increment amount is proportional to the maximum number of databases that the
status var can hold and exceeds 1K (de-facto the unit of measure of
@@max_allow_packet) in the default case of MAX_DBS_IN_QUERY_MTS == 16.


II. Temporary tables

On the slave side Query-log-event is regarded as a parent of possible satellite
events that can prepare the parent in the binlog.
Logics of scheduling such mini-group of events is similar to one for
the Table-map + Rows-log-event case implemented in WL#5569.

The temporary table case in a Query-log event is handled as the following.
Coordinator's sql thread thd->temporary_tables remains to be the placeholder
of all replicated temporary tables.
Checking out in the list, inserting into and removal from the list is done by any of
Workers and also by Coordinator thanks to deploying a mutex that is acquired
exclusively
by slave threads. The ordinary connection threads don't grab the mutex.

The following cases represent all access possibilities for the slave threads: 

Insert:  open_table_uncached()   initial population
Look-up: open_table() - look-up for a temporary table
Release: mark_used_tables_as_free_for_reuse()
DROP:    find_temporary_table()
Removal: close_temporary_tables()

any access to thd->temporary_tables should

1. make sure `thd' is the Coordinator's thd
2. surround an access codes with lock/unlock of rli->mts_temp_tables_lock