WL#5223: Group Commit of Binary Log
Affects: Server-5.6
—
Status: Complete
PROBLEM (SUMMARY) ----------------- When the binary log is enabled there is a dramatic drop in performance due to the following reasons: . the binary log does not exploit the group commit techniques; . there are several access to disk, i.e. writes and flushes. MySQL uses the write-ahead logging to provide both durability and consistency. Specifically, they write redo and less frequently undo changes to the log and make sure that upon commit the changes made of behalf of a transaction are written and flushed to a stable storage (e.g. disk). Notice however that the higher the number of transactions committing per second the higher the rate one needs to write and flush the logs. If nothing is done the log will eventually become a performance bottleneck. To circumvent this issue one postpones for a while any access to a stable storage in order to gather in memory as many commits as possible and thus issue one single write and flush for a set of transactions. This technique is named group commit and is widely used in database systems to improve their performances. PROPOSED SOLUTION (SUMMARY) --------------------------- We are going to: . use the group commit technique to reduce the number of writes and flushes; PROBLEM (DETAILS) ----------------- See in what follows what happens when one wants to commit a transaction and the binary log is enabled. This description is based on Harrison's analysis of the performance problems associated with the current implementation and uses Innodb as the storage engine because it is the only truly transactional engine kept by MySQL: 1. Prepare Innodb: a) Write prepare record to Innodb's log buffer b) Sync log file to disk c) Take prepare_commit_mutex 2. "Prepare" binary log: a) Write transaction to binary log b) Sync binary log based on sync_binlog 3. Commit Innodb: a) Write commit record to log b) Release prepare_commit_mutex c) Sync log file to disk d) Innodb locks are released 4. "Commit" binary log: a) Nothing necessary to do here. There are five problems with this model: 1. Prepare_commit_mutex prevents the binary log and Innodb from group committing. The prepare_commit_mutex is used to ensure that transactions are committed in the binary log by the same order they are committed in the Innodb logs. This is a requirement imposed by Innodb Hot Backup and we don't have the intention to change that. 2. The binary log is not prepared for group committing. Due to this mutex only one transaction executes step 2 at a time and as such the binary log cannot group a set of transactions to reduce the number of writes and flushes. Besides, the code is not prepared to exploit group commit. In this WL, we plan to remove the mutex and make the binary log exploit the group commit techniques. 3. Locks are held for duration of 3 fsync's. MySQL uses locks to implement its consistency modes. Clearly, the higher the number of locks the lower the concurrency level. In general, it is only safe to release transaction's locks when a it has written its commit record to disk. Locks are divided in two distinct sets: shared and exclusive locks. Shared locks may be released as soon as one finds out that a transaction has carried on its activities and is willing to commit. This point happens after (1.a) when the database has obligated itself to commit the transaction if someone decides to do so. 4. There are unnecessary disk accesses, i.e. too many fsync's. Transactions are written to disk three times when the binary log is enabled. This may be improved as the binary log is considered the source of truth and used for recovery. Currently, upon recovery, Innodb compiles a list of all transactions that were prepared and not either committed or aborted and checks the binary log to decide on their fates. If a transaction was written to the binary log, it is committed. Otherwise, it is rolled back. Clearly, one does not need to write and flush the commit records because eventually it will be written by another transaction in the prepare phase or Innodb's background process that every second writes and flushes Innodb buffer logs. In this WL, we postpone the write and flush at the commit phase, in order to improve performance and make the time between periodic writes and flushes configurable. The greater the period the greater the likelihood of increasing the recovery time. In the feature, we should improve this scenario by avoiding the write and flush at the prepare phase and relying in the binary log to replay missing transactions. See WL#6305 for further details. 5. The binary log works as both a storage engine and a transaction coordinator making it difficult to maintain and evolve. The binary log registers as a handler and get callbacks when preparing, committing, and aborting. This allow it to write cached data to the binary log or manipulate the "transaction state" in other ways. In addition, the binary log acts as a transaction coordinator, whose purpose is to order the commits and handle 2PC commit correctly. The binary log is unsuitable as a handler and is in reality *only* a Transaction Coordinator. The fact that it registers as a handler causes some problems in maintenance (and potentially performance) and makes it more difficult to implement new features. PROPOSED SOLUTION (DETAILS) --------------------------- The work of improving performance when the binary log is enabled can be split in the following tasks: 1. Eliminate prepare_commit_mutex, or the need for it. This has to do with the ordering of the transactions in the binary log compared to the order of transactions in the Innodb logs. 2. Flush the binary log properly. Preparing and committing a transaction to the binary log does not automatically mean that the binary log is flushed. Indeed, the entire point of performing a group commit of the binary log is to not flush the binary log with each transaction and instead improve performance by reducing the number of flushes per transaction. 3. Handle the release of the read locks so that it is possible to further improve performance. Releasing locks earlier may improve performance specially for applications that have a high number of reads. 4. Postpone write and flush at the commit phase. This will improve performance by reducing the number of writes and flushes when the binary log is enabled. 5. Make the binary log be only a transaction coordinator. Although this will not bring any performance improvements, this will simplify the code and ease implementation of the changes proposed in this WL. NOTES/DISCUSSIONS ----------------- Facebook's proof of concept showed a > 20x performance increase was possible when (1) (2) and (4) are fixed [this was publicized on the MySQL Conference /Matz]. Fixing just one of the three doesn't give nearly the improvement, so any new model should take into account these three things. FUTURE WORK ------------ We also believe that WL#4925 will improve performance particularly when Hard Drives are used and the operating system provides support to pre-allocate files with no overhead. See WL#4925 for further details. Bugs similar to BUG#11938382 need to be fixed because the DUMP Thread grabs mutex LOCK_log to access a hot binary log file thus harming performance.
WL#2540: Replication event checksums
WL#4832: Improve scalability of binary logging
WL#5493: Binlog crash-safe when master crashed
WL#4832: Improve scalability of binary logging
WL#5493: Binlog crash-safe when master crashed
HIGH LEVEL SOLUTION ------------------- We propose to execute the following steps to commit a transaction: 1. Ask binary log (i.e. coordinator to prepare a) Request to release locks earlier b) Prepare Innodb (Callback to (2.a)) 2. Prepare Innodb: a) Write prepare record to Innodb log buffer b) Sync log file to disk 3. Ask binary log (i.e. coordinator) to commit a) Lock access to flush stage b) Write a set of transactions to the binary log c) Unlock access to flush stage d) Lock access to sync stage e) Flush the binary log to disk f) Unlock access to sync stage g) Lock access to commit stage h) Commit Innodb (Callback to (4.a)) i) Unlock access to commit stage 4. Commit Innodb a) Write a commit record to Innodb log buffer Similar steps happen when a transaction is rolled back but still needs to write its changes to the binary log because non-transactional operations were executed. The different is that the transaction is never prepared and a rollback is called instead of a commit. For simplicity we omit the cases for all servant calls. See additional information on these cases in what follows. MAKING BINARY LOG TRANSACTION COORDINATOR ----------------------------------------- The solution is, in short, to not register the binary log as a handlerton and instead extend the =TC_LOG= interface to cover for the cases where critical job is done by the binary log handlerton functions. There are a few places where some job needs to be done by the binary log, and these naturally extend to other transaction coordinators: 1. When committing, transaction records can be written to the binary log, not only when it is a "real" transaction. This means that we always need to have a call to a logging function such as the =log_xid= function, and not only when it is an XA transaction. If we do that, all the binlog writing from the =binlog_commit= can be moved to this logging function. 2. When rolling back a transaction, a transaction record can potentially be written to the binary log, or the caches have to be cleared to be able to execute a new transaction. 3. When setting a savepoint, the binary log need to set a marker in the transaction cache to be able to truncate written data on a rollback. 4. When rolling back to a savepoint, the transaction cache need to be truncated. 5. When the connection is closed, it is necessary to clean up data for the session. 6. The commit job for the binary is done in the =log_xid= replacement mentioned in point 1 above, but for symmetry it might make sense to introduce a =commit= function as well, or just introduce the =commit= function and let the =TC_LOG= do everything there. Based on the existing functions in =handlerton.cc= (=ha_commit_trans=, =ha_rollback_trans=, and =ha_prepare=), we stratify the commit interface in three level: 1. Low-level transaction functions used by the transaction coordinator. These functions commit all handlertons and reset transaction data for the thread and are called when the transaction is actually committed. This means copying information from the other functions in =handlerton.cc= and creating separate functions =ha_commit_low=, =ha_prepare_low=, and =ha_rollback_low=. These functions are used by the transaction coordinator to prepare, commit, or rollback the transactions. 2. Transaction coordinator functions: =commit=, =prepare=, and =rollback= are added. For =TC_LOG_DUMMY= and =TC_LOG_MMAP= these just call the corresponding low-level functions above, but for the binary log the low-level functions above will not be called until the transaction is successfully written to the binary log. 3. High-level transaction functions used by the server: =ha_commit_trans=, =ha_prepare_trans=, and =ha_rollback_trans=. These functions will use the transaction coordinator interface to execute 2PC, if necessary. At large, they remain intact to an external observer (except for the name change of =ha_prepare= to =ha_prepare_trans=). The =ha_commit_trans= function contain the 2PC procedure and will call the TC_LOG functions at the appropriate times. At no time may a function at level N call functions at level N-1, but it is possible that a transaction coordinator decides to abort a transaction instead of committing it. BINARY LOG GROUP COMMIT ----------------------- This feature involves three queues associated with three stages for commit and is the base for binary log group commit. 1. Flush Stage In this stage, the caches are copied to the binary log cache and written to disk. 2. Sync Stage In this stage, the binary log file is synced to disk. 3. Commit Stage In this stage, all transactions that participated in the sync stage are committed by the same order as they were written to the binary log. As aforementioned, this is a requirement imposed by Innodb Hot Backup. There will be an option to enable this behavior and by default transactions may commit in any order thus further improving performance. In particular, the first transaction that reaches a stage is elected leader and the others are followers. The leader will perform the stage on behalf of the followers and itself. Followers release all latches and go waiting on a condition variable that is signalled when the leader has finished the commit stage. When entering a stage, the leader will then grab the entire queue of sessions that have queued up for the stage and process them according to the stage. When leaving a stage, the leader queues up the full queue for the next stage, and if the queue for the next stage was empty, it will be the leader for this stage as well, otherwise, it will change role and become a follower. With this strategy, stages that take long time will accumulate many sessions while still allowing earlier stages to execute as large batches as possible. Even temporary degradations resulting from OS or hardware behaviour will allow the procedure to adapt and accumulate larger batches. For the flush queue, the leader acts a little differently. Instead of taking the entire queue, the leader will read from flush queue until the last transaction is unqueued or a timeout expires. For each transaction fetched from the flush queue, it writes its contents to the binary log's memory buffer. If unqueueing a transaction results in an empty queue, the leader in the flush phase will write the binary log's memory buffer to disk and proceed to the sync stage. It is necessary to take the transactions one by one since emptying the queue will cause the next thread that enqueues to become a stage leader, which means that the currently running thread have to leave the flush stage. The idea is that by taking as many transactions as possible and including them in the sync, we will increase the number of transactions per sync and thereby improve the overall performance. The timeout keeps a boundary on how long one should wait until proceeding to the next phase. Otherwise, one could wait until all running transactions had committed thus causing performance problems. Notice when the time has expired the entire queue (i.e. flush queue) have to be fetched otherwise there will be no leader to take care of the next batch. Finally, to further improve performance, shared locks are released on the prepare phase and commit records are written to memory and eventually to disk by a transaction in the prepare phase or periodically by Innodb's checkpoint process.
MAKING BINARY LOG TRANSACTION COORDINATOR ----------------------------------------- class TC_LOG { /* Prepare the coordinator to handle a new session. This works as a startup function. */ virtual int open_connection(THD* thd)= 0; /* This works as a shutdown function and deinitialize any structure created to handle the session. */ virtual int close_connection(THD* thd)= 0; /* Log a commit record of the transaction to the transaction coordinator log. When the function returns, the transaction commit is properly logged to the transaction coordinator log and can be committed in the storage engines. */ virtual int commit(THD *thd, bool all) = 0; /* Log a rollback record of the transaction to the transaction coordinator log. When the function returns, the transaction have been aborted in the transaction coordinator log. */ virtual int rollback(THD *thd, bool all) = 0; /* Called when a new savepoint is defined and gives the chance to allocate any internal structure to keep track of the savepoint, if necessary. */ virtual int savepoint_set(THD* thd, SAVEPOINT *sv)= 0; /* Called when a savepoint is released and gives the chance to clean up any internal structure allocated to keep track of the savepoint, if necessary. */ virtual int savepoint_release(THD* thd, SAVEPOINT *sv)= 0; /* Called when a transaction is rolled back to a previously defined savepoint and is used to throw away saved changes that are not necessary after the rollback. */ virtual int savepoint_rollback(THD* thd, SAVEPOINT *sv)= 0; }; public class TC_LOG_DUMMY: public TC_LOG { int open_connection(THD* thd) { return 0; } int close_connection(THD* thd) { return 0; } int commit(THD *thd, bool all) { return ha_commit_low(thd, all); } int rollback(THD *thd, bool all) { return ha_rollback_low(thd, all); } int savepoint_set(THD* thd, SAVEPOINT *sv) { return 0; } int savepoint_release(THD* thd, SAVEPOINT *sv) { return 0; } int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; } }; public class TC_LOG_MMAP: public TC_LOG { int open_connection(THD* thd) { return 0; } int close_connection(THD* thd) { return 0; } int commit(THD *thd, bool all); int rollback(THD *thd, bool all) { return ha_rollback_low(thd, all); } int savepoint_set(THD* thd, SAVEPOINT *sv) { return 0; } int savepoint_release(THD* thd, SAVEPOINT *sv) { return 0; } int savepoint_rollback(THD* thd, SAVEPOINT *sv) { return 0; } }; int TC_LOG_MMAP::commit(THD *thd, bool all) { /* Get information on Xid to do a 2-PC*/ ha_commit_low(thd, all); /* Call engines to carry on the commit. */ /* Release information on Xid */ } public class MYSQL_BIN_LOG: public TC_LOG { int open_connection(THD* thd); int close_connection(THD* thd); int commit(THD *thd, bool all); int rollback(THD *thd, bool all); int savepoint_set(THD* thd, SAVEPOINT *sv); int savepoint_release(THD* thd, SAVEPOINT *sv); int savepoint_rollback(THD* thd, SAVEPOINT *sv); }; int MYSQL_BIN_LOG::open_connection(THD* thd) { /* Allocate memory used to store transaction's changes. */ } int MYSQL_BIN_LOG::close_connection(THD* thd) { /* Deallocate memory used to store transaction's changes. */ } int MYSQL_BIN_LOG::commit(THD *thd, bool all) { /* Call batch_commit(). */ } int rollback(THD *thd, bool all) { /* Call batch_commit() if there is any non-transactional change that require to be written to the binary log. */ } int savepoint_set(THD* thd, SAVEPOINT *sv) { /* Sets a savepoint. */ /* In the future, can be used to improve how the binary log handle savepoints. This is however out of the scope of this WL. */ } int savepoint_release(THD* thd, SAVEPOINT *sv) { /* Does nothing. */ /* In the future, can be used to improve how the binary log handle savepoints. This is however out of the scope of this WL. */ } int savepoint_rollback(THD* thd, SAVEPOINT *sv) { /* Rolls back the binary log to a pre-defined savepoint. */ /* In the future, can be used to improve how the binary log handle savepoints. This is however out of the scope of this WL. */ } BINARY LOG GROUP COMMIT ----------------------- Besides doing the aforementioned changes and removing the prepare_commit_mutex, we present in what follows the core of the WL. int MYSQL_BIN_LOG::batch_commit(THD* thd, bool all) { /* Add transactions to the flush queue. The first transaction becomes the leader and proceeds to the next stages. Followers will block and eventually will return the commit's status: success or error. The stage is executed under the LOCK_log. */ if (change_stage(thd, Stage_manager::FLUSH_STAGE, thd, NULL, &LOCK_log)) return finish_commit(thd->commit_error); /* Write itself and follower's contents to the binary log. */ THD *flush_queue= NULL; /* Gets a pointer to the flush_queue */ error= flush_stage_queue(&wait_queue); /* Before going into this stage, the wait_queue is copied into the sync_queue then the LOCK_log is released and the LOCK_sync is acquired. */ if (change_stage(thd, Stage_manager::SYNC_STAGE, wait_queue, &LOCK_log, &LOCK_sync)) return finish_commit(thd->commit_error); /* Sync the binary log according to the option sync_binlog. */ THD *sync_queue= NULL; /* Gets a pointer to the sync_queue */ error= sync_stage_queue(&sync_queue); /* This stage is skipped if we do not need to order the commits and each thread have to execute the handlerton commit instead. However, since we are keeping the lock from the previous stage, we need to unlock it if we skip the stage. */ if (opt_binlog_order_commits) { if (change_stage(thd, Stage_manager::COMMIT_STAGE, final_queue, &LOCK_sync, &LOCK_commit)) return finish_commit(thd); THD *commit_queue= NULL; /* Commit all transactions by the same order they were written into the binary log. */ error= commit_stage_queue(&commit_queue); mysql_mutex_unlock(&LOCK_commit); final_queue= commit_queue; } else { final_queue= sync_queue; mysql_mutex_unlock(&LOCK_sync); } /* Notify followers that the can carry on their activities. Either commit themselves if opt_binlog_order_commits is false or simply return the result to clients. */ stage_manager.signal_done(final_queue); return finish_commit(thd); } int MYSQL_BIN_LOG::flush_stage_queue(THD** queue) { } int MYSQL_BIN_LOG::sync_stage_queue(THD** queue) { } MYSQL_BIN_LOG::commit_stage_queue(THD** queue) { } ADDED OPTIONS ------------- sql/sys_vars.cc: static Sys_var_mybool Sys_binlog_order_commits( "binlog_order_commits", "Issue internal commit calls in the same order as transactions are" " written to the binary log.", GLOBAL_VAR(opt_binlog_order_commits), CMD_LINE(OPT_ARG), DEFAULT(FALSE)); static Sys_var_int32 Sys_binlog_max_flush_queue_time( "binlog_max_flush_queue_time", "The maximum time that the binary log group commit will keep reading" " transactions before it flush the transactions to the binary log (and" " optionally sync, depending on the value of sync_binlog).", GLOBAL_VAR(opt_binlog_max_flush_queue_time), CMD_LINE(REQUIRED_ARG), VALID_RANGE(0, 100000), DEFAULT(0), BLOCK_SIZE(1), NO_MUTEX_GUARD, NOT_IN_BINLOG); storage/innobase/handler/ha_innodb.cc: static MYSQL_SYSVAR_UINT(flush_log_at_timeout, srv_flush_log_at_timeout, PLUGIN_VAR_OPCMDARG, "Write and flush logs every (n) second.", NULL, NULL, 1, 0, 2700, 0); User Documentation ================== Changelog entry: http://dev.mysql.com/doc/refman/5.6/en/news-5-6-6.html System variable descriptions: http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary- log.html#sysvar_binlog_order_commits http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary- log.html#sysvar_binlog_max_flush_queue_time http://dev.mysql.com/doc/refman/5.6/en/innodb- parameters.html#sysvar_innodb_flush_log_at_timeout
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.