WL#5223: Group Commit of Binary Log
Affects: Server-5.6
—
Status: Complete
PROBLEM (SUMMARY)
-----------------
When the binary log is enabled there is a dramatic drop in performance due to
the following reasons:
. the binary log does not exploit the group commit techniques;
. there are several access to disk, i.e. writes and flushes.
MySQL uses the write-ahead logging to provide both durability and consistency.
Specifically, they write redo and less frequently undo changes to the log and
make sure that upon commit the changes made of behalf of a transaction are
written and flushed to a stable storage (e.g. disk).
Notice however that the higher the number of transactions committing per second
the higher the rate one needs to write and flush the logs. If nothing is done
the log will eventually become a performance bottleneck. To circumvent this
issue one postpones for a while any access to a stable storage in order to
gather in memory as many commits as possible and thus issue one single write and
flush for a set of transactions.
This technique is named group commit and is widely used in database systems to
improve their performances.
PROPOSED SOLUTION (SUMMARY)
---------------------------
We are going to:
. use the group commit technique to reduce the number of writes and
flushes;
PROBLEM (DETAILS)
-----------------
See in what follows what happens when one wants to commit a transaction and the
binary log is enabled. This description is based on Harrison's analysis of the
performance problems associated with the current implementation and uses Innodb
as the storage engine because it is the only truly transactional engine kept by
MySQL:
1. Prepare Innodb:
a) Write prepare record to Innodb's log buffer
b) Sync log file to disk
c) Take prepare_commit_mutex
2. "Prepare" binary log:
a) Write transaction to binary log
b) Sync binary log based on sync_binlog
3. Commit Innodb:
a) Write commit record to log
b) Release prepare_commit_mutex
c) Sync log file to disk
d) Innodb locks are released
4. "Commit" binary log:
a) Nothing necessary to do here.
There are five problems with this model:
1. Prepare_commit_mutex prevents the binary log and Innodb from
group committing.
The prepare_commit_mutex is used to ensure that transactions are committed
in the binary log by the same order they are committed in the Innodb logs.
This is a requirement imposed by Innodb Hot Backup and we don't have the
intention to change that.
2. The binary log is not prepared for group committing.
Due to this mutex only one transaction executes step 2 at a time and as
such the binary log cannot group a set of transactions to reduce the number
of writes and flushes. Besides, the code is not prepared to exploit group
commit. In this WL, we plan to remove the mutex and make the binary log
exploit the group commit techniques.
3. Locks are held for duration of 3 fsync's.
MySQL uses locks to implement its consistency modes. Clearly, the higher the
number of locks the lower the concurrency level. In general, it is only safe
to release transaction's locks when a it has written its commit record to
disk.
Locks are divided in two distinct sets: shared and exclusive locks. Shared
locks may be released as soon as one finds out that a transaction has carried
on its activities and is willing to commit.
This point happens after (1.a) when the database has obligated itself to
commit the transaction if someone decides to do so.
4. There are unnecessary disk accesses, i.e. too many fsync's.
Transactions are written to disk three times when the binary log is enabled.
This may be improved as the binary log is considered the source of truth and
used for recovery.
Currently, upon recovery, Innodb compiles a list of all transactions that were
prepared and not either committed or aborted and checks the binary log to
decide on their fates. If a transaction was written to the binary log, it is
committed. Otherwise, it is rolled back.
Clearly, one does not need to write and flush the commit records because
eventually it will be written by another transaction in the prepare phase or
Innodb's background process that every second writes and flushes Innodb buffer
logs.
In this WL, we postpone the write and flush at the commit phase, in order to
improve performance and make the time between periodic writes and flushes
configurable. The greater the period the greater the likelihood of increasing
the recovery time.
In the feature, we should improve this scenario by avoiding the write and
flush at the prepare phase and relying in the binary log to replay missing
transactions. See WL#6305 for further details.
5. The binary log works as both a storage engine and a transaction
coordinator making it difficult to maintain and evolve.
The binary log registers as a handler and get callbacks when preparing,
committing, and aborting. This allow it to write cached data to the binary
log or manipulate the "transaction state" in other ways.
In addition, the binary log acts as a transaction coordinator, whose purpose
is to order the commits and handle 2PC commit correctly.
The binary log is unsuitable as a handler and is in reality *only* a
Transaction Coordinator. The fact that it registers as a handler causes some
problems in maintenance (and potentially performance) and makes it more
difficult to implement new features.
PROPOSED SOLUTION (DETAILS)
---------------------------
The work of improving performance when the binary log is enabled can be split
in the following tasks:
1. Eliminate prepare_commit_mutex, or the need for it.
This has to do with the ordering of the transactions in the binary log
compared to the order of transactions in the Innodb logs.
2. Flush the binary log properly.
Preparing and committing a transaction to the binary log does not
automatically mean that the binary log is flushed. Indeed, the entire point
of performing a group commit of the binary log is to not flush the binary
log with each transaction and instead improve performance by reducing the
number of flushes per transaction.
3. Handle the release of the read locks so that it is possible to further
improve performance.
Releasing locks earlier may improve performance specially for applications
that have a high number of reads.
4. Postpone write and flush at the commit phase.
This will improve performance by reducing the number of writes and flushes
when the binary log is enabled.
5. Make the binary log be only a transaction coordinator.
Although this will not bring any performance improvements, this will simplify
the code and ease implementation of the changes proposed in this WL.
NOTES/DISCUSSIONS
-----------------
Facebook's proof of concept showed a > 20x performance increase was possible
when (1) (2) and (4) are fixed [this was publicized on the MySQL Conference
/Matz]. Fixing just one of the three doesn't give nearly the improvement, so
any new model should take into account these three things.
FUTURE WORK
------------
We also believe that WL#4925 will improve performance particularly when Hard
Drives are used and the operating system provides support to pre-allocate files
with no overhead. See WL#4925 for further details.
Bugs similar to BUG#11938382 need to be fixed because the DUMP Thread grabs
mutex LOCK_log to access a hot binary log file thus harming performance.
WL#2540: Replication event checksums
WL#4832: Improve scalability of binary logging
WL#5493: Binlog crash-safe when master crashed
WL#4832: Improve scalability of binary logging
WL#5493: Binlog crash-safe when master crashed
HIGH LEVEL SOLUTION ------------------- We propose to execute the following steps to commit a transaction: 1. Ask binary log (i.e. coordinator to prepare a) Request to release locks earlier b) Prepare Innodb (Callback to (2.a)) 2. Prepare Innodb: a) Write prepare record to Innodb log buffer b) Sync log file to disk 3. Ask binary log (i.e. coordinator) to commit a) Lock access to flush stage b) Write a set of transactions to the binary log c) Unlock access to flush stage d) Lock access to sync stage e) Flush the binary log to disk f) Unlock access to sync stage g) Lock access to commit stage h) Commit Innodb (Callback to (4.a)) i) Unlock access to commit stage 4. Commit Innodb a) Write a commit record to Innodb log buffer Similar steps happen when a transaction is rolled back but still needs to write its changes to the binary log because non-transactional operations were executed. The different is that the transaction is never prepared and a rollback is called instead of a commit. For simplicity we omit the cases for all servant calls. See additional information on these cases in what follows. MAKING BINARY LOG TRANSACTION COORDINATOR ----------------------------------------- The solution is, in short, to not register the binary log as a handlerton and instead extend the =TC_LOG= interface to cover for the cases where critical job is done by the binary log handlerton functions. There are a few places where some job needs to be done by the binary log, and these naturally extend to other transaction coordinators: 1. When committing, transaction records can be written to the binary log, not only when it is a "real" transaction. This means that we always need to have a call to a logging function such as the =log_xid= function, and not only when it is an XA transaction. If we do that, all the binlog writing from the =binlog_commit= can be moved to this logging function. 2. When rolling back a transaction, a transaction record can potentially be written to the binary log, or the caches have to be cleared to be able to execute a new transaction. 3. When setting a savepoint, the binary log need to set a marker in the transaction cache to be able to truncate written data on a rollback. 4. When rolling back to a savepoint, the transaction cache need to be truncated. 5. When the connection is closed, it is necessary to clean up data for the session. 6. The commit job for the binary is done in the =log_xid= replacement mentioned in point 1 above, but for symmetry it might make sense to introduce a =commit= function as well, or just introduce the =commit= function and let the =TC_LOG= do everything there. Based on the existing functions in =handlerton.cc= (=ha_commit_trans=, =ha_rollback_trans=, and =ha_prepare=), we stratify the commit interface in three level: 1. Low-level transaction functions used by the transaction coordinator. These functions commit all handlertons and reset transaction data for the thread and are called when the transaction is actually committed. This means copying information from the other functions in =handlerton.cc= and creating separate functions =ha_commit_low=, =ha_prepare_low=, and =ha_rollback_low=. These functions are used by the transaction coordinator to prepare, commit, or rollback the transactions. 2. Transaction coordinator functions: =commit=, =prepare=, and =rollback= are added. For =TC_LOG_DUMMY= and =TC_LOG_MMAP= these just call the corresponding low-level functions above, but for the binary log the low-level functions above will not be called until the transaction is successfully written to the binary log. 3. High-level transaction functions used by the server: =ha_commit_trans=, =ha_prepare_trans=, and =ha_rollback_trans=. These functions will use the transaction coordinator interface to execute 2PC, if necessary. At large, they remain intact to an external observer (except for the name change of =ha_prepare= to =ha_prepare_trans=). The =ha_commit_trans= function contain the 2PC procedure and will call the TC_LOG functions at the appropriate times. At no time may a function at level N call functions at level N-1, but it is possible that a transaction coordinator decides to abort a transaction instead of committing it. BINARY LOG GROUP COMMIT ----------------------- This feature involves three queues associated with three stages for commit and is the base for binary log group commit. 1. Flush Stage In this stage, the caches are copied to the binary log cache and written to disk. 2. Sync Stage In this stage, the binary log file is synced to disk. 3. Commit Stage In this stage, all transactions that participated in the sync stage are committed by the same order as they were written to the binary log. As aforementioned, this is a requirement imposed by Innodb Hot Backup. There will be an option to enable this behavior and by default transactions may commit in any order thus further improving performance. In particular, the first transaction that reaches a stage is elected leader and the others are followers. The leader will perform the stage on behalf of the followers and itself. Followers release all latches and go waiting on a condition variable that is signalled when the leader has finished the commit stage. When entering a stage, the leader will then grab the entire queue of sessions that have queued up for the stage and process them according to the stage. When leaving a stage, the leader queues up the full queue for the next stage, and if the queue for the next stage was empty, it will be the leader for this stage as well, otherwise, it will change role and become a follower. With this strategy, stages that take long time will accumulate many sessions while still allowing earlier stages to execute as large batches as possible. Even temporary degradations resulting from OS or hardware behaviour will allow the procedure to adapt and accumulate larger batches. For the flush queue, the leader acts a little differently. Instead of taking the entire queue, the leader will read from flush queue until the last transaction is unqueued or a timeout expires. For each transaction fetched from the flush queue, it writes its contents to the binary log's memory buffer. If unqueueing a transaction results in an empty queue, the leader in the flush phase will write the binary log's memory buffer to disk and proceed to the sync stage. It is necessary to take the transactions one by one since emptying the queue will cause the next thread that enqueues to become a stage leader, which means that the currently running thread have to leave the flush stage. The idea is that by taking as many transactions as possible and including them in the sync, we will increase the number of transactions per sync and thereby improve the overall performance. The timeout keeps a boundary on how long one should wait until proceeding to the next phase. Otherwise, one could wait until all running transactions had committed thus causing performance problems. Notice when the time has expired the entire queue (i.e. flush queue) have to be fetched otherwise there will be no leader to take care of the next batch. Finally, to further improve performance, shared locks are released on the prepare phase and commit records are written to memory and eventually to disk by a transaction in the prepare phase or periodically by Innodb's checkpoint process.
MAKING BINARY LOG TRANSACTION COORDINATOR
-----------------------------------------
class TC_LOG
{
/*
Prepare the coordinator to handle a new session. This works as a startup
function.
*/
virtual int open_connection(THD* thd)= 0;
/*
This works as a shutdown function and deinitialize any structure created
to handle the session.
*/
virtual int close_connection(THD* thd)= 0;
/*
Log a commit record of the transaction to the transaction coordinator log.
When the function returns, the transaction commit is properly logged to
the transaction coordinator log and can be committed in the storage
engines.
*/
virtual int commit(THD *thd, bool all) = 0;
/*
Log a rollback record of the transaction to the transaction coordinator
log. When the function returns, the transaction have been aborted in the
transaction coordinator log.
*/
virtual int rollback(THD *thd, bool all) = 0;
/*
Called when a new savepoint is defined and gives the chance to allocate
any internal structure to keep track of the savepoint, if necessary.
*/
virtual int savepoint_set(THD* thd, SAVEPOINT *sv)= 0;
/*
Called when a savepoint is released and gives the chance to clean up
any internal structure allocated to keep track of the savepoint, if
necessary.
*/
virtual int savepoint_release(THD* thd, SAVEPOINT *sv)= 0;
/*
Called when a transaction is rolled back to a previously defined savepoint
and is used to throw away saved changes that are not necessary after the
rollback.
*/
virtual int savepoint_rollback(THD* thd, SAVEPOINT *sv)= 0;
};
public class TC_LOG_DUMMY: public TC_LOG
{
int open_connection(THD* thd) { return 0; }
int close_connection(THD* thd) { return 0; }
int commit(THD *thd, bool all) { return ha_commit_low(thd, all); }
int rollback(THD *thd, bool all) { return ha_rollback_low(thd, all); }
int savepoint_set(THD* thd, SAVEPOINT *sv) { return 0; }
int savepoint_release(THD* thd, SAVEPOINT *sv) { return 0; }
int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; }
};
public class TC_LOG_MMAP: public TC_LOG
{
int open_connection(THD* thd) { return 0; }
int close_connection(THD* thd) { return 0; }
int commit(THD *thd, bool all);
int rollback(THD *thd, bool all) { return ha_rollback_low(thd, all); }
int savepoint_set(THD* thd, SAVEPOINT *sv) { return 0; }
int savepoint_release(THD* thd, SAVEPOINT *sv) { return 0; }
int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; }
};
int TC_LOG_MMAP::commit(THD *thd, bool all)
{
/* Get information on Xid to do a 2-PC*/
ha_commit_low(thd, all); /* Call engines to carry on the commit. */
/* Release information on Xid */
}
public class MYSQL_BIN_LOG: public TC_LOG
{
int open_connection(THD* thd);
int close_connection(THD* thd);
int commit(THD *thd, bool all);
int rollback(THD *thd, bool all);
int savepoint_set(THD* thd, SAVEPOINT *sv);
int savepoint_release(THD* thd, SAVEPOINT *sv);
int savepoint_rollback(THD* thd, SAVEPOINT *sv);
};
int MYSQL_BIN_LOG::open_connection(THD* thd)
{
/* Allocate memory used to store transaction's changes. */
}
int MYSQL_BIN_LOG::close_connection(THD* thd)
{
/* Deallocate memory used to store transaction's changes. */
}
int MYSQL_BIN_LOG::commit(THD *thd, bool all)
{
/* Call batch_commit(). */
}
int rollback(THD *thd, bool all)
{
/*
Call batch_commit() if there is any non-transactional change
that require to be written to the binary log.
*/
}
int savepoint_set(THD* thd, SAVEPOINT *sv)
{
/* Sets a savepoint. */
/*
In the future, can be used to improve how the binary log
handle savepoints. This is however out of the scope of
this WL.
*/
}
int savepoint_release(THD* thd, SAVEPOINT *sv)
{
/* Does nothing. */
/*
In the future, can be used to improve how the binary log
handle savepoints. This is however out of the scope of
this WL.
*/
}
int savepoint_rollback(THD* thd, SAVEPOINT *sv)
{
/* Rolls back the binary log to a pre-defined savepoint. */
/*
In the future, can be used to improve how the binary log
handle savepoints. This is however out of the scope of
this WL.
*/
}
BINARY LOG GROUP COMMIT
-----------------------
Besides doing the aforementioned changes and removing the prepare_commit_mutex,
we present in what follows the core of the WL.
int MYSQL_BIN_LOG::batch_commit(THD* thd, bool all)
{
/*
Add transactions to the flush queue. The first transaction becomes
the leader and proceeds to the next stages. Followers will block
and eventually will return the commit's status: success or error.
The stage is executed under the LOCK_log.
*/
if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log))
return finish_commit(thd->commit_error);
/*
Write itself and follower's contents to the binary log.
*/
THD *flush_queue= NULL; /* Gets a pointer to the flush_queue */
error= flush_stage_queue(&wait_queue);
/*
Before going into this stage, the wait_queue is copied into the sync_queue
then the LOCK_log is released and the LOCK_sync is acquired.
*/
if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log,
&LOCK_sync))
return finish_commit(thd->commit_error);
/*
Sync the binary log according to the option sync_binlog.
*/
THD *sync_queue= NULL; /* Gets a pointer to the sync_queue */
error= sync_stage_queue(&sync_queue);
/*
This stage is skipped if we do not need to order the commits and
each thread have to execute the handlerton commit instead.
However, since we are keeping the lock from the previous stage, we
need to unlock it if we skip the stage.
*/
if (opt_binlog_order_commits)
{
if (change_stage(thd, Stage_manager::COMMIT_STAGE,
final_queue, &LOCK_sync, &LOCK_commit))
return finish_commit(thd);
THD *commit_queue= NULL;
/*
Commit all transactions by the same order they were written
into the binary log.
*/
error= commit_stage_queue(&commit_queue);
mysql_mutex_unlock(&LOCK_commit);
final_queue= commit_queue;
}
else
{
final_queue= sync_queue;
mysql_mutex_unlock(&LOCK_sync);
}
/*
Notify followers that the can carry on their activities.
Either commit themselves if opt_binlog_order_commits is
false or simply return the result to clients.
*/
stage_manager.signal_done(final_queue);
return finish_commit(thd);
}
int MYSQL_BIN_LOG::flush_stage_queue(THD** queue)
{
}
int MYSQL_BIN_LOG::sync_stage_queue(THD** queue)
{
}
MYSQL_BIN_LOG::commit_stage_queue(THD** queue)
{
}
ADDED OPTIONS
-------------
sql/sys_vars.cc:
static Sys_var_mybool Sys_binlog_order_commits(
"binlog_order_commits",
"Issue internal commit calls in the same order as transactions are"
" written to the binary log.",
GLOBAL_VAR(opt_binlog_order_commits),
CMD_LINE(OPT_ARG), DEFAULT(FALSE));
static Sys_var_int32 Sys_binlog_max_flush_queue_time(
"binlog_max_flush_queue_time",
"The maximum time that the binary log group commit will keep reading"
" transactions before it flush the transactions to the binary log (and"
" optionally sync, depending on the value of sync_binlog).",
GLOBAL_VAR(opt_binlog_max_flush_queue_time),
CMD_LINE(REQUIRED_ARG),
VALID_RANGE(0, 100000), DEFAULT(0), BLOCK_SIZE(1),
NO_MUTEX_GUARD, NOT_IN_BINLOG);
storage/innobase/handler/ha_innodb.cc:
static MYSQL_SYSVAR_UINT(flush_log_at_timeout, srv_flush_log_at_timeout,
PLUGIN_VAR_OPCMDARG,
"Write and flush logs every (n) second.",
NULL, NULL, 1, 0, 2700, 0);
User Documentation
==================
Changelog entry:
http://dev.mysql.com/doc/refman/5.6/en/news-5-6-6.html
System variable descriptions:
http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-
log.html#sysvar_binlog_order_commits
http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-
log.html#sysvar_binlog_max_flush_queue_time
http://dev.mysql.com/doc/refman/5.6/en/innodb-
parameters.html#sysvar_innodb_flush_log_at_timeout
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.