WL#8599: Reduce contention in IO and SQL threads

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

With 5.7 MTS the replication slave is able to execute more transactions
than the master, even on write-only workloads, as long as the SQL thread
is running without the IO thread.

If the IO thread is active reading the relay log from the master the
overhead can reach 60%, which eats up a lot of the benefit from a more
efficient MTS scheduler.  This seems to be caused by contention on the
MYSQL_RELAY_LOG::LOCK_log.

The main goal (G1) of this worklog is to reduce the contention between
receiver (a.k.a. I/O) and applier (a.k.a. SQL) replication threads.

Other References
----------------

BUG#77778: TUNING THE LOG_LOCK CONTENTION FOR IO_THREAD AND SQL_THREAD IN
RPL_SLAVE.CC

Functional requirements
-----------------------

F-1: Receiver (a.k.a. I/O) thread should not block read access to the
     relay log while queuing events.
F-2: Applier (a.k.a. SQL) thread should not block write access to the
     relay log while reading events.

Non-Functional requirements
---------------------------

NF-1: The performance of the replication threads should not be worse
      than before this worklog implementation.
NF-2: The overall replication performance (particularly the time to
      catch up) should be improved, allowing the slave to have greater
      TPS than the master when using MTS.

Interface Specification
-----------------------

I-1: No new files
I-2: No changes in existing syntax
I-3: No new commands
I-4: No new tools
I-5: No impact on existing functionality

Background
----------

- The relay log has multiple writers:
  - zero, one, or more clients may issue FLUSH RELAY LOG;
  - one applier (a.k.a. SQL) thread (purging old logs);
  - one receiver (a.k.a. I/O) thread;
- The relay log has multiple readers:
  - zero, one, or more applier threads;
  - zero, one, or more clients may issue SHOW RELAYLOG EVENTS;

- Master_info::description_event has one writer:
  - one receiver thread;
- Master_info::description_event has multiple readers:
  - zero, one, or more clients may issue FLUSH RELAY LOGS;
  - one receiver thread reads it in many places;

- Other Master_info member fields has multiple writers:
  - zero, one, or more clients may issue FLUSH RELAY LOGS;
  - one receiver thread change them in two places:
    - handle_slave_io(), when connecting to the master or stopping;
    - queue_event(), when queuing new events to the relay log;
- Other Master_info member fields has multiple readers:
  - zero, one, or more applier threads;
  - zero, one, or more clients may issue SHOW SLAVE STATUS;
  - zero, one, or more clients may query replication P_S tables;

- Currently:
  - All relay log operations are serialized by LOCK_log;
  - All access to Master_info member fields (including
    description_event) are serialized by
    Master_info::data_lock;
  - Locking order is: 1. data_lock, 2. LOCK_log;

High Level Specification
------------------------

P1) Contention between receiver and applier threads in the "hot log"

A slave server having the applier thread applying events from the same
relay log file where the receiver thread is appending the events
received from the master is having contentions that are affecting both
threads performance in comparison with running them alone.

The main issue according to P_S collected data is a concurrency in
relay_log->LOCK_log (P1a).

The next_event() function is acquiring the relay log LOCK_log when the
applier is reading from the "hot_log" to prevent other threads to
rotate the relay log (and close the current relay log file) while the
applier is the middle of the "read_event" operation.

As the receiver thread acquires the relay log LOCK_log to write content
to the relay log, the use of this lock by the next_event() function is
causing the contention when the slave is very updated (the applier is
reading the events from the tail of the relay log, and the receiver is
writing events to the same relay log file).

A second (P1b) issue found was a contention related to the GTID
implementation on "wait/synch/rwlock/sql/gtid_commit_rollback".

On each Gtid_log_event or Anonymous_gtid_log_event received by the I/O
thread, it is acquiring the global SID lock to check if the current
server GTID_MODE is correct for the event received.

For the Gtid_log_event case, it is taking the global SID lock
opportunistically, because it will need to add the received GTID to the
retrieved GTID set that is based on the global SID map).

For the Anonymous_gtid_log_event, it is taking also the global SID lock,
because is it calling the get_gtid_mode() function specifying no lock.

The applier thread will also verify the GTID_MODE when applying those
events.

A third issue (P1c), affecting only the receiver thread, is that it is
taking the mi->data_lock and the rli->LOCK_log twice per event received:
once when queuing the event (writing it to the relay log), and once when
flushing master info after successfully queuing an event.

As the mi->data_lock is taken also by the monitoring threads (either
using SHOW SLAVE STATUS or replication_connection_status P_S table),
monitoring the receiver thread status can also generate performance
impact on the receiver thread.

Rationale and proposed implementation steps
-------------------------------------------

The first step (S1) this worklog will implement is to use the same
approach implemented by the Binlog_sender (WL#5721) to dump binary log
con contents to slave servers in MySQL 5.7. Instead of using a shared
IO_CACHE, the Binlog_sender has it own IO_CACHE used only to read events
from the binary log. This step should solve P1a.

The second step (S2) to be implemented is to make the receiver thread
retrieved GTID sets to have their own SID map/lock. Most of the time
there are no relation between the retrieved GTID sets and the server
GTID state (only when connecting the receiver thread to a master server
both are used by GTID auto positioning protocol). Also, this step should
implement a way of checking the current server GTID_MODE without the need
of relying on a given lock always. This step should solve P1b.

The third step (S3) to be implemented is to make the receiver thread to
flush master info inside queue_event function, while holding the
mi->data_lock and relay_log->LOCK_log. This should solve P1c.

Rationale behind step 2
-----------------------

Thinking about the impact on replication threads when GTID_MODE=ON, we have:

Each receiver thread needs to do three GTID related operations per GTID
transaction:

R1) Check if the GTID_MODE allows the GTID type (rdlock);
R2) Store the queued GTID to be used later (rdlock,
    but it may be upgraded to wrlock and downgraded again for new UUIDs);
R3) Add the queued GTID to the channel RETRIEVED_GTID_SET (rdlock);
Notice that R1 and R2 are done while holding the same lock.

Each applier (or worker) thread needs to do three GTID related operations per
GTID transaction:

W1) Check if the GTID_MODE allows the GTID type (rdlock);
W2) Acquire the ownership of the GTID (rdlock,
    but it may be upgraded to wrlock and downgraded again for new UUIDs);
W3) Add the GTID to GTID_EXECUTED when committed (rdlock);
Notice that W1 and W2 are done while holding the same lock.
Notice that W3 might be done once for many threads committing in a group.

Each binlog sender thread have to do one GTID related operation per GTID event
sent to a receiver:

S1) Check if the GTID_MODE allows the GTID type (rdlock);

Any query into performance_schema.replication_connection_status table will:

M1) Dump the RETRIEVED_GTID_SET for receiver threads (wrlock);

Any monitoring thread executing SHOW SLAVE STATUS will:

M2) Dump the RETRIEVED_GTID_SET for receiver threads (wrlock);
M3) Dump GTID_EXECUTED (wrlock);
Notice that M2 and M3 are done while holding the same lock.

The proposal will make R1, R2, R3, M1 and M2 to rely on channel specific lock
instead of relying on the global SID lock.

Currently:

Any channel doing R1/R2/R3 is blocking:
B1.1) any thread doing M1 on same channel;
B1.2) any thread doing M1 on other channels;
B1.3) any thread doing M2/M3;

Any channel worker doing W1/W2/W3 is blocking:
B2.1) any thread doing M1 on any channel;
B2.2) any thread doing M2/M3;

Any monitoring statement doing M1 is blocking:
B3.1) same channel from doing R1/R2/R2
B3.2) any other channel from doing R1/R2/R2
B3.3) any worker from doing W1/W2/W3;
B3.4) any binlog send from doing S1;
B3.5) any other thread doing M1 on same channel;
B3.6) any other thread doing M1 on other channels;
B3.7) any thread doing M2/M3;

Any monitoring statement doing M2/M3 is blocking:
B4.1) any channel from doing R1/R2/R2
B4.2) any worker from doing W1/W2/W3;
B4.3) any binlog send from doing S1;
B4.4) any thread doing M1;
B4.5) any other thread doing M2/M3;

The design proposed is making B1.2, B2.1, B3.2, B3.3, B3.4 and B3.6 possible.

Each proposed step on the HLS is kind of independent (or incremental)
relative to the other steps. Because of this we intend to split the code
review of this worklog into 5 reviews (one per proposed step).


Step 1: Use a separate IO_CACHE on SQL thread always
----------------------------------------------------
We already coded the infrastructure to the binlog_end_pos
handling, so there is no need of a new variable to determine the relay
log "limit" to readers.

Most of the changes will be on the next_event() function
(sql/rpl_slave.cc) and on Relay_log_info::init_relay_log_pos().


Step 2: Make retrieved GTID sets to use their own SID map/lock
--------------------------------------------------------------

We will create a new PSI_rwlock_key key_rwlock_receiver_sid_lock to be
used by the SID maps of the receiver threads.

Replace the global_sid_lock by the retrieved_gtid_set sid_lock on every
place that is dealing with the retrieved GTID set:
- SHOW SLAVE STATUS;
- replication_connection_status P_S table;
- MYSQL_BIN_LOG::init_gtid_sets();
- MYSQL_BIN_LOG::open_binlog();
- MYSQL_BIN_LOG::reset_logs();
- MYSQL_BIN_LOG::after_append_to_relay_log();
- Previous_gtids_log_event::Previous_gtids_log_event();
- Relay_log_info::add_logged_gtid();
- Relay_log_info::purge_relay_logs();
- recover_relay_log();
- request_dump();
- queue_event();
- Group Replication/Service Channel Interface;

In order to avoid acquiring the global SID lock to check GTID_MODE, as this
variable doesn't changed often, we will introduce a global (atomic) counter
of how many times the GTID_MODE was changed since the server startup.

We will also introduce the Gtid_mode_copy class to hold a copy of the last
GTID_MODE to be returned without the need of acquiring locks if the
local GTID mode counter has the same value as the global atomic counter.


Step 3: Optimize mi->data_lock/relay_log->LOCK_log acquires per event
---------------------------------------------------------------------

This should only affect the queue_event() and handle_slave_io() threads.

The main idea is to add a new parameter to flush_master_info() in order
to tell it to not acquire locks (asserting that the locks are already
taken).

Then, move the flush_master_info() call at handle_slave_io() to
queue_event(), to a place were the locks are already acquired.

Finally, we need to make queue_event to return two possible errors:
error while queuing, and error while flushing info.

The return code will be handled by the caller to throw the error
according with the failure.