WL#5223: Group Commit of Binary Log

Affects: Server-5.6   —   Status: Complete

PROBLEM (SUMMARY)
-----------------
When the binary log is enabled there is a dramatic drop in performance due to 
the following reasons:

  . the binary log does not exploit the group commit techniques;

  . there are several access to disk, i.e. writes and flushes.

MySQL uses the write-ahead logging to provide both durability and consistency.
Specifically, they write redo and less frequently undo changes to the log and
make sure that upon commit the changes made of behalf of a transaction are
written and flushed to a stable storage (e.g. disk).

Notice however that the higher the number of transactions committing per second
the higher the rate one needs to write and flush the logs. If nothing is done 
the log will eventually become a performance bottleneck. To circumvent this 
issue one postpones for a while any access to a stable storage in order to 
gather in memory as many commits as possible and thus issue one single write and
flush for a set of transactions.

This technique is named group commit and is widely used in database systems to
improve their performances.


PROPOSED SOLUTION (SUMMARY)
---------------------------
We are going to:

  . use the group commit technique to reduce the number of writes and
    flushes;


PROBLEM (DETAILS)
-----------------
See in what follows what happens when one wants to commit a transaction and the
binary log is enabled. This description is based on Harrison's analysis of the
performance problems associated with the current implementation and uses Innodb
as the storage engine because it is the only truly transactional engine kept by
MySQL:

1. Prepare Innodb:

   a) Write prepare record to Innodb's log buffer
   b) Sync log file to disk
   c) Take prepare_commit_mutex

2. "Prepare" binary log:

   a) Write transaction to binary log
   b) Sync binary log based on sync_binlog

3. Commit Innodb:

   a) Write commit record to log
   b) Release prepare_commit_mutex
   c) Sync log file to disk
   d) Innodb locks are released

4. "Commit" binary log:

   a) Nothing necessary to do here.

There are five problems with this model:

1. Prepare_commit_mutex prevents the binary log and Innodb from
   group committing.

  The prepare_commit_mutex is used to ensure that transactions are committed
  in the binary log by the same order they are committed in the Innodb logs.
  This is a requirement imposed by Innodb Hot Backup and we don't have the
  intention to change that.

2. The binary log is not prepared for group committing.

  Due to this mutex only one transaction executes step 2 at a time and as
  such the binary log cannot group a set of transactions to reduce the number
  of writes and flushes. Besides, the code is not prepared to exploit group
  commit. In this WL, we plan to remove the mutex and make the binary log
  exploit the group commit techniques.

3. Locks are held for duration of 3 fsync's.

  MySQL uses locks to implement its consistency modes. Clearly, the higher the
  number of locks the lower the concurrency level. In general, it is only safe
  to release transaction's locks when a it has written its commit record to
  disk.
  Locks are divided in two distinct sets: shared and exclusive locks. Shared
  locks may be released as soon as one finds out that a transaction has carried
  on its activities and is willing to commit.
  This point happens after (1.a) when the database has obligated itself to
  commit the transaction if someone decides to do so.

4. There are unnecessary disk accesses, i.e. too many fsync's.

  Transactions are written to disk three times when the binary log is enabled.
  This may be improved as the binary log is considered the source of truth and
  used for recovery.
  Currently, upon recovery, Innodb compiles a list of all transactions that were
  prepared and not either committed or aborted and checks the binary log to
  decide on their fates. If a transaction was written to the binary log, it is
  committed. Otherwise, it is rolled back.
  Clearly, one does not need to write and flush the commit records because
  eventually it will be written by another transaction in the prepare phase or
  Innodb's background process that every second writes and flushes Innodb buffer
  logs.
  In this WL, we postpone the write and flush at the commit phase, in order to
  improve performance and make the time between periodic writes and flushes
  configurable. The greater the period the greater the likelihood of increasing
  the recovery time.
  In the feature, we should improve this scenario by avoiding the write and
  flush at the prepare phase and relying in the binary log to replay missing
  transactions. See WL#6305 for further details.

5. The binary log works as both a storage engine and a transaction
   coordinator making it difficult to maintain and evolve.

  The binary log registers as a handler and get callbacks when preparing,
  committing, and aborting. This allow it to write cached data to the binary
  log or manipulate the "transaction state" in other ways.
  In addition, the binary log acts as a transaction coordinator, whose purpose
  is to order the commits and handle 2PC commit correctly.
  The binary log is unsuitable as a handler and is in reality *only* a
  Transaction Coordinator. The fact that it registers as a handler causes some
  problems in maintenance (and potentially performance) and makes it more
  difficult to implement new features.


PROPOSED SOLUTION (DETAILS)
--------------------------- 
The work of improving performance when the binary log is enabled can be split
in the following tasks:

1. Eliminate prepare_commit_mutex, or the need for it.

  This has to do with the ordering of the transactions in the binary log
  compared to the order of transactions in the Innodb logs.

2. Flush the binary log properly.

  Preparing and committing a transaction to the binary log does not 
  automatically mean that the binary log is flushed. Indeed, the entire point
  of performing a group commit of the binary log is to not flush the binary
  log with each transaction and instead improve performance by reducing the
  number of flushes per transaction.

3. Handle the release of the read locks so that it is possible to further
   improve performance.

  Releasing locks earlier may improve performance specially for applications
  that have a high number of reads.

4. Postpone write and flush at the commit phase.

  This will improve performance by reducing the number of writes and flushes
  when the binary log is enabled.

5. Make the binary log be only a transaction coordinator.

  Although this will not bring any performance improvements, this will simplify 
  the code and ease implementation of the changes proposed in this WL.


NOTES/DISCUSSIONS
-----------------
Facebook's proof of concept showed a > 20x performance increase was possible
when (1) (2) and (4) are fixed [this was publicized on the MySQL Conference
/Matz].  Fixing just one of the three doesn't give nearly the improvement, so
any new model should take into account these three things.


FUTURE WORK
------------
We also believe that WL#4925 will improve performance particularly when Hard 
Drives are used and the operating system provides support to pre-allocate files 
with no overhead. See WL#4925 for further details.

Bugs similar to BUG#11938382 need to be fixed because the DUMP Thread grabs
mutex LOCK_log to access a hot binary log file thus harming performance.