WL#4007: Improve binlog writing speed by using pwrite
Affects: WorkLog-3.4 — Status: Assigned — Priority: Medium
RATIONALE --------- The intention of this worklog is to reduce the contention on the binary log by allowing several transactions to be written in parallel. The idea is to allow multiple concurrent writes by pre-calculating the position of the next write and only hold the mutex during this calculation. This will reduce the locking time of the binlog mutex since by knowing the position of the next write, the transaction can be written to the binary log *after* the binlog mutex has been released. SUMMARY ------- When writing to binary log, we have two different approaches depending on if we are writing a transaction or if we are writing directly to the binary log. By harmonizing the two approaches, it is possible to significantly improve the writing speed by using pwrite(). Currently, we have a *next write position* variable with the next valid position in the binary log that can be written. This variable usually refers to one-after-the-end of the previously written event. Currently, data is written to the binary log by the following method: 1. Lock the log 2. Write the data 3. Update the next write position to refer to the next valid position that can be written. This is given by the member function that writes an event to the binary log. 4. Unlock the log By ensuring that the amount of data to write to the binary log is known beforehand (by writing data to an internal buffer and checking what size that buffer has or by computing the size of the data by any other means), we can use the following method to improve the writing speed of the binary log: 2. Lock the log 3. Compute the next valid position to write to and update next_write_pos with this position 4. Unlock the log 5. Write the data By using this technique, we allow several (client) threads to write data to the binary log simultaneously. Since the operating system keeps the file in memory until it can be safely written to disk (sync:ed), this technique will allow the operating system to write the data in larger chunks, as well as writing data to the internal memory buffer at a higher speed. See also WL#2492. Especially the "commit throttling feature": There is a limit of how many syncs per second binlog will be doing. if a new sync is going to happen earlier than 1/Nth of a second, we wait. The point is - if syncs will go one after another they may exhaust all the available io capacity, while a user may want to keep some of the io bandwidth for the actual table reads and writes. Note that this is different from sync-binlog, because syncs are not skipped and thus the durability is preserved. A throughput (number of syncs per second) isn't affected either - only the duration of a single sync. We do less syncs per second, but more work on every sync.
The current situation uses two different techniques when writing to the binary log: one method when writing non-transactionally (directly to the binary log) and one when writing a transaction to the binary log on commit. Writing directly to binary log ============================== This situation occurs when writing an event outside a transaction (i.e., either AUTO_COMMIT=1 or we are after a START TRANSACTION with no matching COMMIT or ROLLBACK) or when writing a non-transactional event and there is no data in the transaction cache. For this case, we are assuming that the above situation directs the server to write the event to the binary log directly. 1. Writing an event to the binary log starts through the call of ``MYSQL_BIN_LOG::write(Log_event*)``. Some initial checks are made to decide if the above situation holds, but these checks will, by the above assumption, direct the server to write directly to the binary log. 2. Lock the binary log by grabbing the mutex ``LOCK_log``. 3. Write a context to the binary log consisting of: the RAND() seed, INSERT_ID(), user variables used by the statement, etc. This is not done for row-based logging. 4. Write the event by calling ``Log_event::write(IO_CACHE*)``. 5. flush and sync the binlog, if the ``sync_binlog_counter`` says that it should. 6. Signal that the binlog is updated, in order to allow any listening dump threads to proceed. 7. Rotate binlog if the ``max_size`` is exceeded. 8. Unlock the binary log by releasing the mutex ``LOCK_log``.
This low-level design depends on the patches for BUG#43262 and WL#4832 (3 patches) The affected routines are: 1) write(THD *thd, IO_cache *cache, Log_event *commit_event) This routine is the place where the binlog now gets written. The functions append and appendv are only used by the relaylog and the relaylog is only written by one thread the SLAVE IO thread so it's not such an interesting problem to solve currently. When we enter this function the following holds: 1) The IO_cache contains all things to write except for a BEGIN and a COMMIT record. In the case of an event which has commit_event == NULL then there is no special handling of BEGIN and COMMIT needed. What we really need to prepare here is to find out how many bytes we are planning to write to the binlog.
Copyright (c) 2000, 2016, Oracle Corporation and/or its affiliates. All rights reserved.