WL#4007: Improve binlog writing speed by using pwrite

Affects: WorkLog-3.4   —   Status: Assigned

RATIONALE
---------

The intention of this worklog is to reduce the contention on the binary log by
allowing several transactions to be written in parallel.

The idea is to allow multiple concurrent writes by pre-calculating the
position of the next write and only hold the mutex during this calculation.

This will reduce the locking time of the binlog mutex since by knowing the
position of the next write, the transaction can be written to the binary log
*after* the binlog mutex has been released.


SUMMARY
-------

When writing to binary log, we have two different approaches depending on if we
are writing a transaction or if we are writing directly to the binary log. By
harmonizing the two approaches, it is possible to significantly improve the
writing speed by using pwrite().

Currently, we have a *next write position* variable with the next valid position
in the binary log that can be written. This variable usually refers to
one-after-the-end of the previously written event.

Currently, data is written to the binary log by the following method:

1. Lock the log

2. Write the data

3. Update the next write position to refer to the next valid position that
   can be written. This is given by the member function that writes an event
   to the binary log.

4. Unlock the log


By ensuring that the amount of data to write to the binary log is known
beforehand (by writing data to an internal buffer and checking what size that
buffer has or by computing the size of the data by any other means), we can
use the following method to improve the writing speed of the binary log:

2. Lock the log

3. Compute the next valid position to write to and update next_write_pos
   with this position

4. Unlock the log

5. Write the data

By using this technique, we allow several (client) threads to write data to the
binary log simultaneously. Since the operating system keeps the file in memory
until it can be safely written to disk (sync:ed), this technique will allow the
operating system to write the data in larger chunks, as well as writing data to
the internal memory buffer at a higher speed.

See also WL#2492. Especially the "commit throttling feature":

There is a limit of how many syncs per second binlog will be doing.  if a new
sync is going to happen earlier than 1/Nth of a second, we wait.  The point
is - if syncs will go one after another they may exhaust all the available io
capacity, while a user may want to keep some of the io bandwidth for the
actual table reads and writes. Note that this is different from sync-binlog,
because syncs are not skipped and thus the durability is preserved. A
throughput (number of syncs per second) isn't affected either - only the
duration of a single sync. We do less syncs per second, but more work on every
sync.