MySQL 8.4.2
Source Code Documentation
Background redo log threads

Three background log threads are responsible for writes of new data to disk:

  1. Log writer - writes from the log buffer or write-ahead buffer to OS buffers.
  2. Log flusher - writes from OS buffers to disk (fsyncs).
  3. Log write_notifier - notifies user threads about completed writes to disk (when write_lsn is advanced).
  4. Log flush_notifier - notifies user threads about completed fsyncs (when flushed_to_disk_lsn is advanced).

Two background log threads are responsible for checkpoints (reclaiming space in log files):

  1. Log checkpointer - determines subsect_redo_log_available_for_checkpoint_lsn and writes checkpoints.

Thread: log writer

This thread is responsible for writing data from the log buffer to disk (to the log files). However, it's not responsible for doing fsync() calls. It copies data to system buffers. It is the log flusher thread, which is responsible for doing fsync().

There are following points that need to be addressed by the log writer thread:

  1. Find out how much data is ready in the log buffer, which is concurrently filled in by multiple user threads.

    In the log recent written buffer, user threads set links for every finished write to the log buffer. Each such link is represented as a number of bytes written, starting from a start_lsn. The link is stored in the slot assigned to the start_lsn of the write.

    The log writer thread tracks links in the recent written buffer, traversing a connected path created by the links. It stops when it encounters a missing outgoing link. In such case the next fragment of the log buffer is still being written (or the maximum assigned lsn was reached).

    It also stops as soon as it has traversed by more than 4kB, in which case it is enough for a next write (unless we decided again to do fsyncs from inside the log writer thread). After traversing links and clearing slots occupied by the links (in the recent written buffer), the log writer thread updates log.buf_ready_for_write_lsn.

    Example of links in the recent written buffer

    Note
    The log buffer has no holes up to the log.buf_ready_for_write_lsn (all concurrent writes for smaller lsn have been finished).

    If there were no links to traverse, log.buf_ready_for_write_lsn was not advanced and the log writer thread needs to wait. In such case it first uses spin delay and afterwards switches to wait on the writer_event.

  2. Prepare log blocks for writing - update their headers and footers.

    The log writer thread detects completed log blocks in the log buffer. Such log blocks will not receive any more writes. Hence their headers and footers could be easily updated (e.g. checksum is calculated).

    Complete blocks are detected and written

    If any complete blocks were detected, they are written directly from the log buffer (after updating headers and footers). Afterwards the log writer thread retries the previous step before making next decisions. For each write consisting of one or more complete blocks, the MONITOR_LOG_FULL_BLOCK_WRITES is incremented by one.

    Note
    There is a special case - when write-ahead is required, data needs to be copied to the write-ahead buffer and the last incomplete block could also be copied and written. For details read below and check the next point.

    The special case is also for the last, incomplete log block. Note that log.buf_ready_for_write_lsn could be in the middle of such block. In such case, next writes are likely incoming to the log block.

    Incomplete block is copied

    For performance reasons we often need to write the last incomplete block. That's because it turned out, that we should try to reclaim user threads as soon as possible, allowing them to handle next transactions and provide next data.

    In such case:

    • the last log block is first copied to the dedicated buffer, up to the log.buf_ready_for_write_lsn,
    • the remaining part of the block in the dedicated buffer is filled in with 0x00 bytes,
    • header fields are updated,
    • checksum is calculated and stored in the block's footer,
    • the block is written from the dedicated buffer,
    • the MONITOR_LOG_PARTIAL_BLOCK_WRITES is incremented by one.

      Note
      The write-ahead buffer is used as the dedicated buffer for writes of the last incomplete block. That's because, whenever we needed a next write-ahead (even for complete blocks), we possibly can also write the last incomplete block during the write-ahead. The monitor counters for full/partial block writes are incremented before the logic related to writing ahead is applied. Hence the counter of partial block writes is not incremented if a full block write was possible (in which case only requirement for write-ahead could be the reason of writing the incomplete block).
      Remarks
      The log writer thread never updates first_rec_group field. It has to be set by user threads before the block is allowed to be written. That's because only user threads know where are the boundaries between groups of log records. The user thread which has written data ending at lsn which needs to be pointed as first_rec_group, is the one responsible for setting the field. User thread which has written exactly up to the end of log block, is considered ending at lsn after the header of the next log block. That's because after such write, the log writer is allowed to write the next empty log block (buf_ready_for_write_lsn points then to such lsn). The first_rec_group field is updated before the link is added to the log recent written buffer.
  3. Avoid read-on-write issue.

    The log writer thread is also responsible for writing ahead to avoid the read-on-write problem. It tracks up to which point the write ahead has been done. When a write would go further:

    • If we were trying to write more than size of single write-ahead region, we limit the write to completed write-ahead sized regions, and postpone writing the last fragment for later (retrying with the first step and updating the buf_ready_for_write_lsn).

      Note
      If we needed to write complete regions of write-ahead bytes, they are ready in the log buffer and could be written directly from there. Such writes would not cause read-on-write problem, because size of the writes is divisible by write-ahead region.
    • Else, we copy data to special write-ahead buffer, from which we could safely write the whole single write-ahead sized region. After copying the data, the write-ahead buffer is completed with 0x00 bytes.

      Note
      The write-ahead buffer is also used for copying the last incomplete log block, which was described in the previous point.
  4. Update write_lsn.

    After doing single write (single log_data_blocks_write()), the log writer thread updates log.write_lsn and fallbacks to its main loop. That's because a lot more data could be prepared in meantime, as the write operation could take significant time.

    That's why the general rule is that after doing log_data_blocks_write(), we need to update log.buf_ready_for_write_lsn before making next decisions on how much to write within next such call.

  5. Notify log writer_notifier thread using os_event_set on the write_notifier_event.

    See also
    Waiting until log has been written to
  6. Notify log flusher thread using os_event_set() on the flusher_event.

Thread: log flusher

The log flusher thread is responsible for doing fsync() of the log files.

When the fsync() calls are finished, the log flusher thread updates the log.flushed_to_disk_lsn and notifies the log flush_notifier thread using os_event_set() on the flush_notifier_event.

Remarks
Small optimization has been applied - if there was only a single log block flushed since the previous flush, then the log flusher thread notifies user threads directly (instead of notifying the log flush_notifier thread). Impact of the optimization turned out to be positive for some scenarios and negative for other, so further investigation is required. However, because the change seems to make sense from logical point of view, it has been preserved.

If the log flusher thread detects that none of the conditions is satisfied, it simply waits and retries the checks. After initial spin delay, it waits on the flusher_event.

Thread: log flush_notifier

The log flush_notifier thread is responsible for notifying all user threads that are waiting for log.flushed_to_disk_lsn >= lsn, when the condition is satisfied.

Remarks
It also notifies when it is very likely to be satisfied (lsn values are within the same log block). It is allowed to make mistakes and it is responsibility of the notified user threads to ensure, that the flushed_to_disk_lsn is advanced sufficiently.

The log flush_notifier thread waits for the advanced flushed_to_disk_lsn in loop, using os_event_wait_time_low() on the flush_notifier_event. When it gets notified by the log flusher, it ensures that the flushed_to_disk_lsn has been advanced (single new byte is enough though).

It notifies user threads waiting on all events between (inclusive):

  • event for a block with the previous value of flushed_to_disk_lsn,
  • event for a block containing the new value of flushed_to_disk_lsn.

Events are assigned per blocks in the circular array of events using mapping:

    event_slot = (lsn-1) / OS_FILE_LOG_BLOCK_SIZE % S

where S is size of the array (number of slots with events). Each slot has single event, which groups all user threads waiting for flush up to any lsn within the same log block (or log block with number greater by S*i).

Notifications executed on slots

Internal mutex in event is used, to avoid missed notifications (these would be worse than the false notifications).

However, there is also maximum timeout defined for the waiting on the event. After the timeout was reached (default: 1ms), the flushed_to_disk_lsn is re-checked in the user thread (just in case).

Note
Because flushes are possible for log.write_lsn set in the middle of log block, it is likely that the same slot for the same block will be notified multiple times in a row. We tried delaying notifications for the last block, but the results were only worse then. It turned out that latency is extremely important here.
See also
Waiting until log has been flushed

Thread: log write_notifier

The log write_notifier thread is responsible for notifying all user threads that are waiting for log.write_lsn >= lsn, when the condition is satisfied.

Remarks
It also notifies when it is very likely to be satisfied (lsn values are within the same log block). It is allowed to make mistakes and it is responsibility of the notified user threads to ensure, that the write_lsn is advanced sufficiently.

The log write_notifier thread waits for the advanced write_lsn in loop, using os_event_wait_time_low() on the write_notifier_event. When it gets notified (by the log writer), it ensures that the write_lsn has been advanced (single new byte is enough). Then it notifies user threads waiting on all events between (inclusive):

  • event for a block with the previous value of write_lsn,
  • event for a block containing the new value of write_lsn.

Events are assigned per blocks in the circular array of events using mapping:

    event_slot = (lsn-1) / OS_FILE_LOG_BLOCK_SIZE % S

where S is size of the array (number of slots with events). Each slot has single event, which groups all user threads waiting for write up to any lsn within the same log block (or log block with number greater by S*i).

Internal mutex in event is used, to avoid missed notifications (these would be worse than the false notifications).

However, there is also maximum timeout defined for the waiting on the event. After the timeout was reached (default: 1ms), the write_lsn is re-checked in the user thread (just in case).

Note
Because writes are possible for log.write_lsn set in the middle of log block, it is likely that the same slot for the same block will be notified multiple times in a row.
See also
Waiting until log has been written to

Thread: log checkpointer

The log checkpointer thread is responsible for:

  1. Checking if a checkpoint write is required (to decrease checkpoint age before it gets too big).
  2. Checking if synchronous flush of dirty pages should be forced on page cleaner threads, because of space in redo log or age of the oldest page.
  3. Writing checkpoints (it's the only thread allowed to do it!).

This thread has been introduced at the very end. It was not required for the performance, but it makes the design more consistent after we have introduced other log threads. That's because user threads are not doing any writes to the log files themselves then. Previously they were writing checkpoints when needed, which required synchronization between them.

The log checkpointer thread updates log.available_for_checkpoint_lsn, which is calculated as:

    min(log.buf_dirty_pages_added_up_to_lsn, max(0, oldest_lsn - L))

where:

  • oldest_lsn = min(oldest modification of the earliest page from each flush list),
  • L is a number of slots in the log recent closed buffer.

The special case is when there is no dirty page in flush lists - then it's basically set to the log.buf_dirty_pages_added_up_to_lsn.

Note
Note that previously, all user threads were trying to calculate this lsn concurrently, causing contention on flush_list mutex, which is required to read the oldest_modification of the earliest added page. Now the lsn is updated in single thread.

Waiting until log has been written to

disk

User has to wait until the log writer thread has written data from the log buffer to disk for lsn >= end_lsn of log range used by the user, which is true when:

    write_lsn >= end_lsn

The log.write_lsn is updated by the log writer thread.

The waiting is solved using array of events. The user thread waiting for a given lsn, waits using the event at position:

    slot = (end_lsn - 1) / OS_FILE_LOG_BLOCK_SIZE % S

where S is number of entries in the array. Therefore the event corresponds to log block which contains the end_lsn.

The log write_notifier thread tracks how the log.write_lsn is advanced and notifies user threads for consecutive slots.

Remarks
When the write_lsn is in the middle of log block, all user threads waiting for lsn values within the whole block are notified. When user thread is notified, it checks if the current value of the write_lsn is sufficient and retries waiting if not. To avoid missed notifications, event's internal mutex is used.

Waiting until log has been flushed

to disk

If a user need to assure the log persistence in case of crash (e.g. on COMMIT of a transaction), he has to wait until [log flusher](Thread: log flusher) has flushed log files to disk for lsn >= end_lsn of log range used by the user, which is true when:

    flushed_to_disk_lsn >= end_lsn

The log.flushed_to_disk_lsn is updated by the log flusher thread.

The waiting is solved using array of events. The user thread waiting for a given lsn, waits using the event at position:

    slot = (end_lsn - 1) / OS_FILE_LOG_BLOCK_SIZE % S

where S is number of entries in the array. Therefore the event corresponds to log block which contains the end_lsn.

The log flush_notifier thread tracks how the log.flushed_to_disk_lsn is advanced and notifies user threads for consecutive slots.

Remarks
When the flushed_to_disk_lsn is in the middle of log block, all user threads waiting for lsn values within the whole block are notified. When user thread is notified, it checks if the current value of the flushed_to_disk_lsn is sufficient and retries waiting if not. To avoid missed notifications, event's internal mutex is used.