WL#5223: Group Commit of Binary Log

Affects: Server-5.6 — Status: Complete

PROBLEM (SUMMARY)
-----------------
When the binary log is enabled there is a dramatic drop in performance due to 
the following reasons:

  . the binary log does not exploit the group commit techniques;

  . there are several access to disk, i.e. writes and flushes.

MySQL uses the write-ahead logging to provide both durability and consistency.
Specifically, they write redo and less frequently undo changes to the log and
make sure that upon commit the changes made of behalf of a transaction are
written and flushed to a stable storage (e.g. disk).

Notice however that the higher the number of transactions committing per second
the higher the rate one needs to write and flush the logs. If nothing is done 
the log will eventually become a performance bottleneck. To circumvent this 
issue one postpones for a while any access to a stable storage in order to 
gather in memory as many commits as possible and thus issue one single write and
flush for a set of transactions.

This technique is named group commit and is widely used in database systems to
improve their performances.


PROPOSED SOLUTION (SUMMARY)
---------------------------
We are going to:

  . use the group commit technique to reduce the number of writes and
    flushes;


PROBLEM (DETAILS)
-----------------
See in what follows what happens when one wants to commit a transaction and the
binary log is enabled. This description is based on Harrison's analysis of the
performance problems associated with the current implementation and uses Innodb
as the storage engine because it is the only truly transactional engine kept by
MySQL:

1. Prepare Innodb:

   a) Write prepare record to Innodb's log buffer
   b) Sync log file to disk
   c) Take prepare_commit_mutex

2. "Prepare" binary log:

   a) Write transaction to binary log
   b) Sync binary log based on sync_binlog

3. Commit Innodb:

   a) Write commit record to log
   b) Release prepare_commit_mutex
   c) Sync log file to disk
   d) Innodb locks are released

4. "Commit" binary log:

   a) Nothing necessary to do here.

There are five problems with this model:

1. Prepare_commit_mutex prevents the binary log and Innodb from
   group committing.

  The prepare_commit_mutex is used to ensure that transactions are committed
  in the binary log by the same order they are committed in the Innodb logs.
  This is a requirement imposed by Innodb Hot Backup and we don't have the
  intention to change that.

2. The binary log is not prepared for group committing.

  Due to this mutex only one transaction executes step 2 at a time and as
  such the binary log cannot group a set of transactions to reduce the number
  of writes and flushes. Besides, the code is not prepared to exploit group
  commit. In this WL, we plan to remove the mutex and make the binary log
  exploit the group commit techniques.

3. Locks are held for duration of 3 fsync's.

  MySQL uses locks to implement its consistency modes. Clearly, the higher the
  number of locks the lower the concurrency level. In general, it is only safe
  to release transaction's locks when a it has written its commit record to
  disk.
  Locks are divided in two distinct sets: shared and exclusive locks. Shared
  locks may be released as soon as one finds out that a transaction has carried
  on its activities and is willing to commit.
  This point happens after (1.a) when the database has obligated itself to
  commit the transaction if someone decides to do so.

4. There are unnecessary disk accesses, i.e. too many fsync's.

  Transactions are written to disk three times when the binary log is enabled.
  This may be improved as the binary log is considered the source of truth and
  used for recovery.
  Currently, upon recovery, Innodb compiles a list of all transactions that were
  prepared and not either committed or aborted and checks the binary log to
  decide on their fates. If a transaction was written to the binary log, it is
  committed. Otherwise, it is rolled back.
  Clearly, one does not need to write and flush the commit records because
  eventually it will be written by another transaction in the prepare phase or
  Innodb's background process that every second writes and flushes Innodb buffer
  logs.
  In this WL, we postpone the write and flush at the commit phase, in order to
  improve performance and make the time between periodic writes and flushes
  configurable. The greater the period the greater the likelihood of increasing
  the recovery time.
  In the feature, we should improve this scenario by avoiding the write and
  flush at the prepare phase and relying in the binary log to replay missing
  transactions. See WL#6305 for further details.

5. The binary log works as both a storage engine and a transaction
   coordinator making it difficult to maintain and evolve.

  The binary log registers as a handler and get callbacks when preparing,
  committing, and aborting. This allow it to write cached data to the binary
  log or manipulate the "transaction state" in other ways.
  In addition, the binary log acts as a transaction coordinator, whose purpose
  is to order the commits and handle 2PC commit correctly.
  The binary log is unsuitable as a handler and is in reality *only* a
  Transaction Coordinator. The fact that it registers as a handler causes some
  problems in maintenance (and potentially performance) and makes it more
  difficult to implement new features.


PROPOSED SOLUTION (DETAILS)
--------------------------- 
The work of improving performance when the binary log is enabled can be split
in the following tasks:

1. Eliminate prepare_commit_mutex, or the need for it.

  This has to do with the ordering of the transactions in the binary log
  compared to the order of transactions in the Innodb logs.

2. Flush the binary log properly.

  Preparing and committing a transaction to the binary log does not 
  automatically mean that the binary log is flushed. Indeed, the entire point
  of performing a group commit of the binary log is to not flush the binary
  log with each transaction and instead improve performance by reducing the
  number of flushes per transaction.

3. Handle the release of the read locks so that it is possible to further
   improve performance.

  Releasing locks earlier may improve performance specially for applications
  that have a high number of reads.

4. Postpone write and flush at the commit phase.

  This will improve performance by reducing the number of writes and flushes
  when the binary log is enabled.

5. Make the binary log be only a transaction coordinator.

  Although this will not bring any performance improvements, this will simplify 
  the code and ease implementation of the changes proposed in this WL.


NOTES/DISCUSSIONS
-----------------
Facebook's proof of concept showed a > 20x performance increase was possible
when (1) (2) and (4) are fixed [this was publicized on the MySQL Conference
/Matz].  Fixing just one of the three doesn't give nearly the improvement, so
any new model should take into account these three things.


FUTURE WORK
------------
We also believe that WL#4925 will improve performance particularly when Hard 
Drives are used and the operating system provides support to pre-allocate files 
with no overhead. See WL#4925 for further details.

Bugs similar to BUG#11938382 need to be fixed because the DUMP Thread grabs
mutex LOCK_log to access a hot binary log file thus harming performance.

HIGH LEVEL SOLUTION
-------------------
We propose to execute the following steps to commit a transaction:

1. Ask binary log (i.e. coordinator to prepare

  a) Request to release locks earlier
  b) Prepare Innodb (Callback to (2.a))

2. Prepare Innodb:

  a) Write prepare record to Innodb log buffer
  b) Sync log file to disk

3. Ask binary log (i.e. coordinator) to commit

  a) Lock access to flush stage
  b) Write a set of transactions to the binary log
  c) Unlock access to flush stage
  d) Lock access to sync stage
  e) Flush the binary log to disk
  f) Unlock access to sync stage
  g) Lock access to commit stage
  h) Commit Innodb (Callback to (4.a))
  i) Unlock access to commit stage

4. Commit Innodb

  a) Write a commit record to Innodb log buffer

Similar steps happen when a transaction is rolled back but still needs to
write its changes to the binary log because non-transactional operations
were executed. The different is that the transaction is never prepared
and a rollback is called instead of a commit.

For simplicity we omit the cases for all servant calls. See additional
information on these cases in what follows.


MAKING BINARY LOG TRANSACTION COORDINATOR
-----------------------------------------
The solution is, in short, to not register the binary log as a handlerton
and instead extend the =TC_LOG= interface to cover for the cases where
critical job is done by the binary log handlerton functions.


There are a few places where some job needs to be done by the binary
log, and these naturally extend to other transaction coordinators:

1. When committing, transaction records can be written to the binary
   log, not only when it is a "real" transaction. This means that we
   always need to have a call to a logging function such as the
   =log_xid= function, and not only when it is an XA transaction. If
   we do that, all the binlog writing from the =binlog_commit= can be
   moved to this logging function.

2. When rolling back a transaction, a transaction record can
   potentially be written to the binary log, or the caches have to be
   cleared to be able to execute a new transaction.

3. When setting a savepoint, the binary log need to set a marker in
   the transaction cache to be able to truncate written data on a
   rollback.

4. When rolling back to a savepoint, the transaction cache need to be
   truncated.

5. When the connection is closed, it is necessary to clean up data for
   the session.

6. The commit job for the binary is done in the =log_xid= replacement
   mentioned in point 1 above, but for symmetry it might make sense to
   introduce a =commit= function as well, or just introduce the
   =commit= function and let the =TC_LOG= do everything there.

Based on the existing functions in =handlerton.cc= (=ha_commit_trans=,
=ha_rollback_trans=, and =ha_prepare=), we stratify the commit
interface in three level:

1. Low-level transaction functions used by the transaction
   coordinator. These functions commit all handlertons and reset
   transaction data for the thread and are called when the transaction
   is actually committed.

   This means copying information from the other functions in
   =handlerton.cc= and creating separate functions =ha_commit_low=,
   =ha_prepare_low=, and =ha_rollback_low=.

   These functions are used by the transaction coordinator to prepare, commit,
   or rollback the transactions.

2. Transaction coordinator functions: =commit=, =prepare=, and
   =rollback= are added. For =TC_LOG_DUMMY= and =TC_LOG_MMAP= these
   just call the corresponding low-level functions above, but for the
   binary log the low-level functions above will not be called until
   the transaction is successfully written to the binary log.

3. High-level transaction functions used by the server:
   =ha_commit_trans=, =ha_prepare_trans=, and
   =ha_rollback_trans=. These functions will use the transaction
   coordinator interface to execute 2PC, if necessary. At large, they
   remain intact to an external observer (except for the name change
   of =ha_prepare= to =ha_prepare_trans=).

   The =ha_commit_trans= function contain the 2PC procedure and will
   call the TC_LOG functions at the appropriate times.

At no time may a function at level N call functions at level N-1, but
it is possible that a transaction coordinator decides to abort a
transaction instead of committing it.


BINARY LOG GROUP COMMIT
-----------------------

This feature involves three queues associated with three stages for commit
and is the base for binary log group commit.

1. Flush Stage

  In this stage, the caches are copied to the binary log cache
  and written to disk.

2. Sync Stage
  In this stage, the binary log file is synced to disk.

3. Commit Stage

  In this stage, all transactions that participated in the sync stage are
  committed by the same order as they were written to the binary log.
  As aforementioned, this is a requirement imposed by Innodb Hot Backup.
  There will be an option to enable this behavior and by default
  transactions may commit in any order thus further improving performance.

In particular, the first transaction that reaches a stage is elected leader and
the others are followers. The leader will perform the stage on behalf of the
followers and itself.

Followers release all latches and go waiting on a condition variable that is
signalled when the leader has finished the commit stage.

When entering a stage, the leader will then grab the entire queue of sessions
that have queued up for the stage and process them according to the stage.

When leaving a stage, the leader queues up the full queue for the next stage,
and if the queue for the next stage was empty, it will be the leader for this
stage as well, otherwise, it will change role and become a follower.

With this strategy, stages that take long time will accumulate many sessions
while still allowing earlier stages to execute as large batches as possible.
Even temporary degradations resulting from OS or hardware behaviour will allow
the procedure to adapt and accumulate larger batches.

For the flush queue, the leader acts a little differently. Instead of taking the
entire queue, the leader will read from flush queue until the last transaction
is unqueued or a timeout expires.

For each transaction fetched from the flush queue, it writes its contents to the
binary log's memory buffer. If unqueueing a transaction results in an empty
queue, the leader in the flush phase will write the binary log's memory buffer
to disk and proceed to the sync stage.

It is necessary to take the transactions one by one since emptying the queue
will cause the next thread that enqueues to become a stage leader, which means
that the currently running thread have to leave the flush stage.

The idea is that by taking as many transactions as possible and including them
in the sync, we will increase the number of transactions per sync and thereby
improve the overall performance.

The timeout keeps a boundary on how long one should wait until proceeding to
the next phase. Otherwise, one could wait until all running transactions had
committed thus causing performance problems. Notice when the time has expired
the entire queue (i.e. flush queue) have to be fetched otherwise there will be
no leader to take care of the next batch.

Finally, to further improve performance, shared locks are released on the 
prepare phase and commit records are written to memory and eventually to disk by
a transaction in the prepare phase or periodically by Innodb's checkpoint 
process.

MAKING BINARY LOG TRANSACTION COORDINATOR
-----------------------------------------
class TC_LOG
{
  /*
    Prepare the coordinator to handle a new session. This works as a startup
    function.
  */
  virtual int open_connection(THD* thd)= 0;

  /*
    This works as a shutdown function and deinitialize any structure created
    to handle the session.
  */
  virtual int close_connection(THD* thd)= 0;

  /*
     Log a commit record of the transaction to the transaction coordinator log.
     When the function returns, the transaction commit is properly logged to
     the transaction coordinator log and can be committed in the storage
     engines.
  */
  virtual int commit(THD *thd, bool all) = 0;

  /*
     Log a rollback record of the transaction to the transaction coordinator
     log. When the function returns, the transaction have been aborted in the
     transaction coordinator log.
  */
  virtual int rollback(THD *thd, bool all) = 0;

  /*
    Called when a new savepoint is defined and gives the chance to allocate
    any internal structure to keep track of the savepoint, if necessary.
  */
  virtual int savepoint_set(THD* thd, SAVEPOINT *sv)= 0;

  /*
    Called when a savepoint is released and gives the chance to clean up
    any internal structure allocated to keep track of the savepoint, if
    necessary.

  */
  virtual int savepoint_release(THD* thd, SAVEPOINT *sv)= 0;

  /*
    Called when a transaction is rolled back to a previously defined savepoint
    and is used to throw away saved changes that are not necessary after the
    rollback.
  */
  virtual int savepoint_rollback(THD* thd, SAVEPOINT *sv)= 0;
};

public class TC_LOG_DUMMY: public TC_LOG
{
  int open_connection(THD* thd)         { return 0; }
  int close_connection(THD* thd)        { return 0; }
  int commit(THD *thd, bool all)        { return ha_commit_low(thd, all); }
  int rollback(THD *thd, bool all)      { return ha_rollback_low(thd, all); }
  int savepoint_set(THD* thd, SAVEPOINT *sv)      { return 0; }
  int savepoint_release(THD* thd, SAVEPOINT *sv)  { return 0; }
  int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; }
};

public class TC_LOG_MMAP: public TC_LOG
{
  int open_connection(THD* thd)         { return 0; }
  int close_connection(THD* thd)        { return 0; }
  int commit(THD *thd, bool all);
  int rollback(THD *thd, bool all)      { return ha_rollback_low(thd, all); }
  int savepoint_set(THD* thd, SAVEPOINT *sv)      { return 0; }
  int savepoint_release(THD* thd, SAVEPOINT *sv)  { return 0; }
  int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; }
};

int TC_LOG_MMAP::commit(THD *thd, bool all)
{
  /* Get information on Xid to do a 2-PC*/
  ha_commit_low(thd, all); /* Call engines to carry on the commit. */
  /* Release information on Xid */
}

public class MYSQL_BIN_LOG: public TC_LOG
{
  int open_connection(THD* thd);
  int close_connection(THD* thd);
  int commit(THD *thd, bool all);

  int rollback(THD *thd, bool all);
  int savepoint_set(THD* thd, SAVEPOINT *sv);
  int savepoint_release(THD* thd, SAVEPOINT *sv);
  int savepoint_rollback(THD* thd, SAVEPOINT *sv);
};

int MYSQL_BIN_LOG::open_connection(THD* thd)
{
  /* Allocate memory used to store transaction's changes. */
}

int MYSQL_BIN_LOG::close_connection(THD* thd)
{
  /* Deallocate memory used to store transaction's changes. */
}

int MYSQL_BIN_LOG::commit(THD *thd, bool all)
{
  /* Call batch_commit(). */
}

int rollback(THD *thd, bool all)
{
  /*
    Call batch_commit() if there is any non-transactional change
    that require to be written to the binary log.
  */
}

int savepoint_set(THD* thd, SAVEPOINT *sv)
{
  /* Sets a savepoint. */
  /*
    In the future, can be used to improve how the binary log
    handle savepoints. This is however out of the scope of
    this WL.
  */
}

int savepoint_release(THD* thd, SAVEPOINT *sv)
{
  /* Does nothing. */
  /*
    In the future, can be used to improve how the binary log
    handle savepoints. This is however out of the scope of
    this WL.
  */
}

int savepoint_rollback(THD* thd, SAVEPOINT *sv)
{
  /* Rolls back the binary log to a pre-defined savepoint. */
  /*
    In the future, can be used to improve how the binary log
    handle savepoints. This is however out of the scope of
    this WL.
  */
}


BINARY LOG GROUP COMMIT
-----------------------
Besides doing the aforementioned changes and removing the prepare_commit_mutex,
we present in what follows the core of the WL.

int MYSQL_BIN_LOG::batch_commit(THD* thd, bool all)
{

  /*
     Add transactions to the flush queue. The first transaction becomes
     the leader and proceeds to the next stages. Followers will block
     and eventually will return the commit's status: success or error.

     The stage is executed under the LOCK_log.
  */
  if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))
    return finish_commit(thd->commit_error);

  /*
    Write itself and follower's contents to the binary log.
  */
  THD *flush_queue= NULL; /* Gets a pointer to the flush_queue */
  error= flush_stage_queue(&wait_queue);

  /*
    Before going into this stage, the wait_queue is copied into the sync_queue
    then the LOCK_log is released and the LOCK_sync is acquired.
  */
  if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log, 
                   &LOCK_sync))
    return finish_commit(thd->commit_error);

  /*
     Sync the binary log according to the option sync_binlog.
  */
  THD *sync_queue= NULL; /* Gets a pointer to the sync_queue */
  error= sync_stage_queue(&sync_queue);

  /*
    This stage is skipped if we do not need to order the commits and
    each thread have to execute the handlerton commit instead.

    However, since we are keeping the lock from the previous stage, we
    need to unlock it if we skip the stage.
  */
  if (opt_binlog_order_commits)
  {
    if (change_stage(thd, Stage_manager::COMMIT_STAGE,
                     final_queue, &LOCK_sync, &LOCK_commit))
      return finish_commit(thd);
    THD *commit_queue= NULL;
    /*
      Commit all transactions by the same order they were written
      into the binary log.
    */
    error= commit_stage_queue(&commit_queue);
    mysql_mutex_unlock(&LOCK_commit);
    final_queue= commit_queue;
  }
  else
  {
    final_queue= sync_queue;
    mysql_mutex_unlock(&LOCK_sync);
  }

  /*
    Notify followers that the can carry on their activities.
    Either commit themselves if opt_binlog_order_commits is
    false or simply return the result to clients.
  */
  stage_manager.signal_done(final_queue);


  return finish_commit(thd);
}

int MYSQL_BIN_LOG::flush_stage_queue(THD** queue)
{
}

int MYSQL_BIN_LOG::sync_stage_queue(THD** queue)
{
}

MYSQL_BIN_LOG::commit_stage_queue(THD** queue)
{
}


ADDED OPTIONS
-------------
sql/sys_vars.cc:

static Sys_var_mybool Sys_binlog_order_commits(
  "binlog_order_commits",
  "Issue internal commit calls in the same order as transactions are"
  " written to the binary log.",
  GLOBAL_VAR(opt_binlog_order_commits),
  CMD_LINE(OPT_ARG), DEFAULT(FALSE));

static Sys_var_int32 Sys_binlog_max_flush_queue_time(
       "binlog_max_flush_queue_time",
       "The maximum time that the binary log group commit will keep reading"
       " transactions before it flush the transactions to the binary log (and"
       " optionally sync, depending on the value of sync_binlog).",
       GLOBAL_VAR(opt_binlog_max_flush_queue_time),
       CMD_LINE(REQUIRED_ARG),
       VALID_RANGE(0, 100000), DEFAULT(0), BLOCK_SIZE(1),
       NO_MUTEX_GUARD, NOT_IN_BINLOG);

storage/innobase/handler/ha_innodb.cc:

static MYSQL_SYSVAR_UINT(flush_log_at_timeout, srv_flush_log_at_timeout,
  PLUGIN_VAR_OPCMDARG,
  "Write and flush logs every (n) second.",
  NULL, NULL, 1, 0, 2700, 0);


User Documentation
==================

Changelog entry:
http://dev.mysql.com/doc/refman/5.6/en/news-5-6-6.html

System variable descriptions:
http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-
log.html#sysvar_binlog_order_commits

http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-
log.html#sysvar_binlog_max_flush_queue_time

http://dev.mysql.com/doc/refman/5.6/en/innodb-
parameters.html#sysvar_innodb_flush_log_at_timeout