WL#4007: Improve binlog writing speed by using pwrite

Affects: WorkLog-3.4 — Status: Assigned

RATIONALE
---------

The intention of this worklog is to reduce the contention on the binary log by
allowing several transactions to be written in parallel.

The idea is to allow multiple concurrent writes by pre-calculating the
position of the next write and only hold the mutex during this calculation.

This will reduce the locking time of the binlog mutex since by knowing the
position of the next write, the transaction can be written to the binary log
*after* the binlog mutex has been released.


SUMMARY
-------

When writing to binary log, we have two different approaches depending on if we
are writing a transaction or if we are writing directly to the binary log. By
harmonizing the two approaches, it is possible to significantly improve the
writing speed by using pwrite().

Currently, we have a *next write position* variable with the next valid position
in the binary log that can be written. This variable usually refers to
one-after-the-end of the previously written event.

Currently, data is written to the binary log by the following method:

1. Lock the log

2. Write the data

3. Update the next write position to refer to the next valid position that
   can be written. This is given by the member function that writes an event
   to the binary log.

4. Unlock the log


By ensuring that the amount of data to write to the binary log is known
beforehand (by writing data to an internal buffer and checking what size that
buffer has or by computing the size of the data by any other means), we can
use the following method to improve the writing speed of the binary log:

2. Lock the log

3. Compute the next valid position to write to and update next_write_pos
   with this position

4. Unlock the log

5. Write the data

By using this technique, we allow several (client) threads to write data to the
binary log simultaneously. Since the operating system keeps the file in memory
until it can be safely written to disk (sync:ed), this technique will allow the
operating system to write the data in larger chunks, as well as writing data to
the internal memory buffer at a higher speed.

See also WL#2492. Especially the "commit throttling feature":

There is a limit of how many syncs per second binlog will be doing.  if a new
sync is going to happen earlier than 1/Nth of a second, we wait.  The point
is - if syncs will go one after another they may exhaust all the available io
capacity, while a user may want to keep some of the io bandwidth for the
actual table reads and writes. Note that this is different from sync-binlog,
because syncs are not skipped and thus the durability is preserved. A
throughput (number of syncs per second) isn't affected either - only the
duration of a single sync. We do less syncs per second, but more work on every
sync.

The current situation uses two different techniques when writing to the binary
log: one method when writing non-transactionally (directly to the binary log)
and one when writing a transaction to the binary log on commit.

Writing directly to binary log
==============================

This situation occurs when writing an event outside a transaction (i.e., either
AUTO_COMMIT=1 or we are after a START TRANSACTION with no matching COMMIT or
ROLLBACK) or when writing a non-transactional event and there is no data in the
transaction cache.

For this case, we are assuming that the above situation directs the server to
write the event to the binary log directly.


1. Writing an event to the binary log starts through the call of
   ``MYSQL_BIN_LOG::write(Log_event*)``. Some initial checks are made
   to decide if the above situation holds, but these checks will, by
   the above assumption, direct the server to write directly to the
   binary log.

2. Lock the binary log by grabbing the mutex ``LOCK_log``.

3. Write a context to the binary log consisting of: the RAND() seed,
   INSERT_ID(), user variables used by the statement, etc. This is not
   done for row-based logging.

4. Write the event by calling ``Log_event::write(IO_CACHE*)``.

5. flush and sync the binlog, if the ``sync_binlog_counter`` says that
   it should.

6. Signal that the binlog is updated, in order to allow any listening dump
   threads to proceed.

7. Rotate binlog if the ``max_size`` is exceeded.

8. Unlock the binary log by releasing the mutex ``LOCK_log``.

This low-level design depends on the patches for
BUG#43262 and WL#4832 (3 patches)

The affected routines are:
1) write(THD *thd, IO_cache *cache, Log_event *commit_event)
This routine is the place where the binlog now gets written.
The functions append and appendv are only used by the relaylog and
the relaylog is only written by one thread the SLAVE IO thread so
it's not such an interesting problem to solve currently.

When we enter this function the following holds:
1) The IO_cache contains all things to write except for a BEGIN
and a COMMIT record.

In the case of an event which has commit_event == NULL then there
is no special handling of BEGIN and COMMIT needed.

What we really need to prepare here is to find out how many bytes
we are planning to write to the binlog.