WL#10406: Improve usability when receiver thread is waiting for disk space
Affects: Server-8.0
—
Status: Complete
The receiver thread blocks when waiting for disk space. Querying SHOW SLAVE STATUS (or replication P_S tables related to receiver thread) blocks, but even if it didn't blocked, it doesn't show that the thread is waiting for disk space to continue its work. Finally, by killing the receiver thread while it is waiting for disk space will probably produce a failure in applier thread, as the content already written to the relay log is not truncated (the relay log will probably have an incomplete event on it). The first goal (G1) of this worklog it to make replication monitoring to work even when the receiver thread is waiting for disk space. The second goal (G2) of this worklog is to make the receiver thread safely stoppable even when waiting for disk space. User Stories ------------ As a MySQL DBA, I want to be able to inspect the receiver thread status even when it is waiting for disk space. I can use either "SHOW SLAVE STATUS" or check the thread state using "SHOW PROCESSLIST". The status should display a meaningful message like "Waiting for disk space". As a MySQL DBA, by knowing that the receiver thread is waiting for disk space, I want to be able to stop it without affecting the applier thread. The receiver thread should stop promptly (in a few seconds) and no "partial events" should be left on the last active relay log file.
Functional requirements ----------------------- F-1: Receiver thread should be safely stoppable even when waiting for disk space. F-2: Replication monitoring should not be blocked by the receiver thread waiting for disk space.
Interface Specification ----------------------- I-1: No new files I-2: No changes in existing syntax I-3: No new commands I-4: No new tools I-5: No impact on existing functionality Background ---------- - The relay log has multiple writers: - zero, one, or more clients may issue FLUSH RELAY LOG; - one applier (a.k.a. SQL) thread (purging old logs); - one receiver (a.k.a. I/O) thread; - The relay log has multiple readers: - zero, one, or more applier threads; - zero, one, or more clients may issue SHOW RELAYLOG EVENTS; - Master_info::description_event has one writer: - one receiver thread; - Master_info::description_event has multiple readers: - zero, one, or more clients may issue FLUSH RELAY LOGS; - one receiver thread reads it in many places; - Other Master_info member fields has multiple writers: - zero, one, or more clients may issue FLUSH RELAY LOGS; - one receiver thread change them in two places: - handle_slave_io(), when connecting to the master or stopping; - queue_event(), when queuing new events to the relay log; - Other Master_info member fields has multiple readers: - zero, one, or more applier threads; - zero, one, or more clients may issue SHOW SLAVE STATUS; - zero, one, or more clients may query replication P_S tables; - Currently: - All relay log operations are serialized by LOCK_log; - All access to Master_info member fields (including description_event) are serialized by Master_info::data_lock; - Locking order is: 1. data_lock, 2. LOCK_log; High Level Specification ------------------------ P1) Receiver blocked waiting for disk space A receiver thread queuing an event may block when there is no space left on disk to write the event into the current relay log file. Currently, the receiver is blocking while holding rli->data_lock and the relay_log->LOCK_log. Monitoring threads cannot collect the receiver thread status when the receiver thread is blocked waiting for disk space (P1a), as they acquire the mi->data_lock in order to display consistent information about the receiver thread. An applier thread applying event from old relay log files (cold ones) may need to purge those old relay log files in order to free disk space to make the receiver thread to be unblocked, but the applier cannot purge relay log files while the receiver is holding the mi->data_lock (still P1a). A receiver thread cannot be stopped "safely" while waiting for disk space (P1b). It would only be able to process the "kill" signal after writing the full event to the relay log file. Rationale and proposed implementation steps ------------------------------------------- The first step (S1) to be implemented is to make the receiver thread not to hold the mi->data_lock while writing events to the relay log file. This would allow monitoring threads to query receiver thread status and would also allow the applier threads to purge old relay log files freeing disk space that can potentially unblock the receiver thread. This step should solve P1a. The second step (S2) to be implemented is to make the receiver thread to not use MY_WAIT_IF_FULL flag when initializing the IO_CACHE to write in the relay log. This would allow partial events on the relay logs in the case of a full disk, but as the read threads are not allowed to read after a given position (controlled by relay_log->binlog_end_pos variable), having the "temporary" partial event is not an issue. If the receiver thread be requested to stop while waiting for disk space, it will truncate the current relay log at the relay_log->binlog_end_pos before (releasing relay_log->LOCK_log and) stopping. This step should solve P1b.
Step 1: Minimize receiver use of mi->data_lock while queuing events ------------------------------------------------------------------- The idea is to hold the mi->data_lock just when going to update Master_info information (master positions) after a successful write of an event into the relay log file. This should guarantee that SHOW SLAVE STATUS will not display inconsistent information about the receiver thread status and will not be blocked when the I/O thread is waiting for disk space. Step 2: Make receiver thread stoppable while waiting for disk space ------------------------------------------------------------------- The general idea is to allow the receiver thread to: (2a) display its process state as "Waiting for disk space" when there is no space left on the device the relay log is being writing to; and (2b) stop promptly when a STOP SLAVE [IO_THREAD] be issued and the I/O thread is waiting for disk space, truncating the relay log file removing any partially event written to it. In order to implement 2a, we will introduce "enter_stage_hook" to make my_write() able to set the current thread stage as "Waiting for disk space" before calling wait_for_free_space() function and restoring the previous thread stage after the function call. This will make any thread waiting for disk space on my_write() to report this information not only in error logs but also in thread status interfaces (performance schema tables, SHOW SLAVE STATUS, etc.). Also, we will have to create a function to truncate the relay log at the "binlog_end_pos" when the receiver thread be stopped (probably in the cleanup of the handle_slave_io() function).
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.