WL#10406: Improve usability when receiver thread is waiting for disk space

Affects: Server-8.0   —   Status: Complete

The receiver thread blocks when waiting for disk space. Querying SHOW SLAVE
STATUS (or replication P_S tables related to receiver thread) blocks, but even
if it didn't blocked, it doesn't show that the thread is waiting for disk space
to continue its work. Finally, by killing the receiver thread while it is
waiting for disk space will probably produce a failure in applier thread, as the
content already written to the relay log is not truncated (the relay log will
probably have an incomplete event on it).

The first goal (G1) of this worklog it to make replication
monitoring to work even when the receiver thread is waiting for disk
space.

The second goal (G2) of this worklog is to make the receiver thread
safely stoppable even when waiting for disk space.

User Stories
------------

As a MySQL DBA, I want to be able to inspect the receiver thread status even
when it is waiting for disk space. I can use either "SHOW SLAVE STATUS" or check
the thread state using "SHOW PROCESSLIST". The status should display a
meaningful message like "Waiting for disk space".

As a MySQL DBA, by knowing that the receiver thread is waiting for disk space, I
want to be able to stop it without affecting the applier thread. The receiver
thread should stop promptly (in a few seconds) and no "partial events" should be
left on the last active relay log file.

Functional requirements
-----------------------

F-1: Receiver thread should be safely stoppable even when waiting for disk
     space.
F-2: Replication monitoring should not be blocked by the receiver thread
     waiting for disk space.
Interface Specification
-----------------------

I-1: No new files
I-2: No changes in existing syntax
I-3: No new commands
I-4: No new tools
I-5: No impact on existing functionality

Background
----------

- The relay log has multiple writers:
  - zero, one, or more clients may issue FLUSH RELAY LOG;
  - one applier (a.k.a. SQL) thread (purging old logs);
  - one receiver (a.k.a. I/O) thread;
- The relay log has multiple readers:
  - zero, one, or more applier threads;
  - zero, one, or more clients may issue SHOW RELAYLOG EVENTS;

- Master_info::description_event has one writer:
  - one receiver thread;
- Master_info::description_event has multiple readers:
  - zero, one, or more clients may issue FLUSH RELAY LOGS;
  - one receiver thread reads it in many places;

- Other Master_info member fields has multiple writers:
  - zero, one, or more clients may issue FLUSH RELAY LOGS;
  - one receiver thread change them in two places:
    - handle_slave_io(), when connecting to the master or stopping;
    - queue_event(), when queuing new events to the relay log;
- Other Master_info member fields has multiple readers:
  - zero, one, or more applier threads;
  - zero, one, or more clients may issue SHOW SLAVE STATUS;
  - zero, one, or more clients may query replication P_S tables;

- Currently:
  - All relay log operations are serialized by LOCK_log;
  - All access to Master_info member fields (including
    description_event) are serialized by
    Master_info::data_lock;
  - Locking order is: 1. data_lock, 2. LOCK_log;

High Level Specification
------------------------

P1) Receiver blocked waiting for disk space

A receiver thread queuing an event may block when there is no space left
on disk to write the event into the current relay log file.

Currently, the receiver is blocking while holding rli->data_lock and the
relay_log->LOCK_log.

Monitoring threads cannot collect the receiver thread status when the
receiver thread is blocked waiting for disk space (P1a), as they acquire
the mi->data_lock in order to display consistent information about the
receiver thread.

An applier thread applying event from old relay log files (cold ones)
may need to purge those old relay log files in order to free disk space
to make the receiver thread to be unblocked, but the applier cannot
purge relay log files while the receiver is holding the
mi->data_lock (still P1a).

A receiver thread cannot be stopped "safely" while waiting for disk
space (P1b). It would only be able to process the "kill" signal after
writing the full event to the relay log file.

Rationale and proposed implementation steps
-------------------------------------------

The first step (S1) to be implemented is to make the receiver thread
not to hold the mi->data_lock while writing events to the relay log
file. This would allow monitoring threads to query receiver thread
status and would also allow the applier threads to purge old relay log
files freeing disk space that can potentially unblock the receiver
thread. This step should solve P1a.

The second step (S2) to be implemented is to make the receiver thread
to not use MY_WAIT_IF_FULL flag when initializing the IO_CACHE to write
in the relay log. This would allow partial events on the relay logs in
the case of a full disk, but as the read threads are not allowed to read
after a given position (controlled by relay_log->binlog_end_pos
variable), having the "temporary" partial event is not an issue. If the
receiver thread be requested to stop while waiting for disk space, it
will truncate the current relay log at the relay_log->binlog_end_pos
before (releasing relay_log->LOCK_log and) stopping. This step should
solve P1b.
Step 1: Minimize receiver use of mi->data_lock while queuing events
-------------------------------------------------------------------

The idea is to hold the mi->data_lock just when going to update Master_info
information (master positions) after a successful write of an event into the
relay log file. This should guarantee that SHOW SLAVE STATUS will not display
inconsistent information about the receiver thread status and will not be
blocked when the I/O thread is waiting for disk space.


Step 2: Make receiver thread stoppable while waiting for disk space
-------------------------------------------------------------------

The general idea is to allow the receiver thread to: (2a) display its process
state as "Waiting for disk space" when there is no space left on the device the
relay log is being writing to; and (2b) stop promptly when a STOP SLAVE
[IO_THREAD] be issued and the I/O thread is waiting for disk space, truncating
the relay log file removing any partially event written to it.

In order to implement 2a, we will introduce "enter_stage_hook" to make
my_write() able to set the current thread stage as "Waiting for disk space"
before calling wait_for_free_space() function and restoring the previous thread
stage after the function call. This will make any thread waiting for disk space
on my_write() to report this information not only in error logs but also in
thread status interfaces (performance schema tables, SHOW SLAVE STATUS, etc.).

Also, we will have to create a function to truncate the relay log at the
"binlog_end_pos" when the receiver thread be stopped (probably in the cleanup of
the handle_slave_io() function).