sync_binlog=1 is sometimes required for critical applications. If the sync_binlog is not set to 1, there is a risk of corruption of the binary log if the server crashes. But, in many filesystems/OS, sync_binlog=1 is currently way too slow (sometimes 2-4 times slower than set to 0). I have researched the cause of the performance problem then now I believe that one of the big causes is architecture of writing to binlog, "extending file per each commit" Currently binlog file is not pre-allocated, but extending per each transaction commit. In contrast, InnoDB log file is pre-allocated and file size is not changed. In other words, binlog adopts “appending” architecture, InnoDB log adopts “overwriting” architecture. For most of operating systems/file systems (including Solaris ufs and Linux ext3), overwriting is much faster than appending. So binlog should pre-allocate large file size (e.g. 128MB at once), then overwriting per commit. Here is a result of my preallocating binlog patch. mysqlslap insert throughput (InnoDB) 1) preallocate, sync_binlog=1 1 conn: 3,707 commits/sec 20 conn: 7,696 commits/sec 50 conn: 8,156 commits/sec 2) normal, sync_binlog=1 1 conn: 359 commits/sec 20 conn: 798 commits/sec 50 conn: 760 commits/sec The difference between 1) and 2) is huge (10 times faster) so preallocating binlog is really helpful for people using sync_binlog=1 (no outstanding difference when using sync_binlog=0). Here is a low-level benchmarking result on Linux (RHEL5.3, ext3). 1. creating an empty file, then writing 128*1024 times with fdatasync() fdatasync per second: 3085.94321 (Emulating current binlog (sync-binlog=1) behavior) 2. creating a 1GB data file, then writing 128*1024 times with fdatasync() fdatasync per second: 12330.47268 (Emulating current InnoDB log file behavior) 3. zero-filling 1GB, then writing 128*1024 times with fdatasync() immediately fdatasync per second: 6569.00072 (Zero-filling causes massive disk writes so killing application performance) 4. zero-filling 1GB, sleeping 20 seconds, then writing 128*1024 times with fdatasync() fdatasync per second: 11669.81157 Apparently 2 and 4 are much better than 1 or 3.
I. Preallocating algorithm A: Automatic Preallocation: Internally using posix_fallocate() to preallocate binlog B: Static Preallocation: Implementing a C++ tool to generate preallocated binary log file(s) II. How to handle existing binlog files III. New my.cnf parameter I. Preallocating mechanism Two design choices can be considered to implement binlog preallocation. A: Automatic Preallocation: Internally using posix_fallocate() to preallocate binlog Highly depending on filesystems and libc, but on some environments it is possible to preallocate files with no overhead, which doesn't kill fdatasync() performance from applications. posix_fallocate() can be implicitly issued by mysqld so no explicit SQL command is required. Currently posix_fallocate() doesn't work on most Linux distributions including RHEL5.3, but does zero-filling (which causes massive writes) instead. This is not fast. Here is a table about whether posix_fallocate() is supported or not. On Windows, SetFilePointerEx() and SetEndOfFile() can be used as a replacement of posix_fallocate() +------------+----------------------------+--------------+----------+---------+ | OS | Linux (Fedora11) | Linux(RHEL5) | Solaris | Windows | | Filesystem |ext2/3| xfs | btrfs | ocfs2 | ext2/3 |ufs | zfs | NTFS | +------------+------+-----+-------+-------+--------------+----+-----+---------+ | | x | y | y | y | x | n |n(*1)| y | +------------+------+-----+-------+-------+--------------+----+-----+---------+ n: posix_fallocate() is not supported y: posix_fallocate() is supported (SetFilePointerEx() and SetEndOfFile() on Windows) x: posix_fallocate() is supported, but doing zero-filling instead (very slow) (*1): zfs is copy-on-write based filesystem so overwriting performance is almost as equal as appending (This should apply to btrfs though I have not tested benchmark yet). To encapsulate platform differences, we define a new function mysql_fallocate() that does: - Use posix_fallocate() if it is available. - If on windows, define mysql_fallocate() to call SetEndOfFile() to allocate the file. - Otherwise, define it as a no-op, doing nothing. One issue is that in some OS/filesystems (i.e. ext3 on RHEL5) posix_fallocate() is defined but does zero filling internally, which will take a lot of time and massive writes. To avoid this scenario, we add a new configuration parameter so that users can disable automatic preallocation. See "III. New my.cnf parameter" for details. B: Static Preallocation: Implementing a C++ tool to generate preallocated binary log file(s) Allocating one more preallocated binary logs by using a command line client "mysqlbinlogalloc", written in C++ Usage is as follows: ./mysqlbinlogalloc --start-suffix=1 --end-suffix=10 --binlog-size=2G --binlog-name=$MYSQLD_DATADIR/pre-allocated-binlog/empty Then empty.000001 to empty.000010, 2GB are created under $MYSQLD_DATADIR/pre-allocated-binlog directory. Note: --binlog-name must be set to the pre-allocated-binlog subdirectory under mysql data directory. Otherwise, the files will not be picked by the mysql server. Plan A is the simplest approach and DBAs do not have to do anything, but major OS/filesystem doesn't support posix_fallocate() so supporting just plan A is not enough. Plan B requires DBAs additional work, but this enables preallocating binlog to work on all platforms. That's why supporting both plan A and plan B would be the best approach. II. How to handle existing statically pre-allocated binlog files If statically pre-allocated binlog file already exists, mysqlbinlogalloc should exit with error (file exists) at that point. For example, if empty.000006 exists in the above case, empty.000001 to empty.000005 will be created, but empty.000006 will not be overwritten, and empty.000007 to empty.000010 will not be created. When server rotating, the "$MYSQLD_DATADIR/pre-allocated-binlog/" directory is investigated to find one pre-allocated binlog file and then rename it to a new binlog file with the highest number in the sequence of binlogs, which is created if no any pre-allocated binlog file is found. III. New my.cnf parameter Based on I and II, I propose introducing a new my.cnf parameter "uint binlog_preallocate" to do the following. 0: Not doing automatic preallocation, but detecting statically preallocated files. 1: Detecting statically preallocated files, automatically preallocating one binlog file if statically preallocated file does not exist.
1. Implementing a tool to generate preallocated binary log file(s) Allocating one more preallocated binary logs by using a command line client "mysqlbinlogalloc", written in C++ Usage is as follows: ./mysqlbinlogalloc --start-suffix=1 --end-suffix=10 --binlog-size=2G --binlog-name=$MYSQLD_DATADIR/pre-allocated-binlog/empty The generate process: - Pre-allocating a temp file and then rename it to empty.000001, repeat the above process until pre-allocating the tenth temp file and rename it to empty.000010. 2. Automatic preallocation - Using tool requires more DBA work. Automatic preallocation makes DBAs happy. - The problem of automatic preallocation is preallocation itself takes long time and massive writes happen unless posix_fallocate() is supported. Currently posix_fallocate() doesn't work on most of major OS/filesystems. - Zero-filling large file takes time. This means it takes more time to starting up mysqld. This means failover time becomes longer. - General recommendation is using automatic preallocate only if posix_fallocate() is accurately supported. Otherwise using tool is recommended. Be careful that even though posix_fallocate() is available on client side, on some environments it just does zero filling, which takes long time. Automatic preallocation can be done by introducing a server system variable: - uint binlog_preallocate, default is 0 - if binlog_preallocate == 0, not doing automatic preallocation, but detecting statically preallocated files. - if binlog_preallocate == 1, automatically preallocating one new binlog file named with the highest number in the sequence if statically preallocated file does not exist. Otherwise renaming one statically preallocated file to the new binlog file. 3. Truncating preallocated space when closing the binlog Most of existing logics are implemented based on the idea that binlog file is EOF after reading all events. This is not true for preallocated binlog. Preallocated spaces are not EOF, but bogus data. Currently BINLOG DUMP thread and et al terminates processing binlog when detecting failure events. So truncating preallocated space when binlog is closed would be nice. Binlog files are closed by mysqladmin shutdown, FLUSH LOGS, reaching maximum size, etc. Closed binlog is not reused, but new binlog is created, so truncating space is fine. Closing binlog is implemented in MYSQL_LOG::close(). If binlogs are not closed properly (i.e. killing mysqld), preallocated space is not truncated. Behavior depends on tools/clients. The crashed binlog file will be truncated after master server restarts. See WL#5493. - BINLOG DUMP thread can proceed without stopping. When BINLOG DUMP thread can not read events from binlog which is not closed correctly (can be corrupted), BINLOG DUMP thread handles it as EOF, then moving to the next binlog. See mysql_binlog_send(). So this is not a problem. - SHOW BINLOG EVENTS just returns I/O error to mysql client even though it reads all normal events. - mysqlbinlog returns error to STDERR when reading failure events. But mysqlbinlog returns all normal events to STDOUT so you can get all events. In practice this not a problem. 4. Checking actual size when BINLOG DUMP thread is reading active binlog. (This change does not apply to client tools such as mysqlbinlog) Issue: Checking EOF of binlogs is needed. Currently binlogs are read by IO_SIZE (4K) at one time until reaching EOF, but this doesn't work for preallocated binlogs because preallocated binlogs have bogus data after actual events. For BINLOG DUMP thread, when reading the latest active binary log, getting actual size information is needed to check EOF. actual size is got by mysql_bin_log.get_log_file()->pos_in_file. - Reading log event (binlog dump thread) int Log_event::read_log_event(IO_CACHE* file, String* packet, mysql_mutex_t* log_lock, uint8 checksum_alg_arg) BINLOG DUMP thread might read older binlogs. Actual size information is not needed for older binlogs. Just reading until close or rotation event is fine. 5. Checking currently reading binlog is active or not (This change does not apply to client tools such as mysqlbinlog) BINLOG DUMP thread needs to check an actual size when reading active binlog. There is a method to check whether the binlog is active or not. bool MYSQL_BIN_LOG::is_active(const char *log_file_name_arg) This method requires current binlog name. Currently Log_event::read_log_event() does not have any information about currently reading binlog file name. When reading binlog, open_binlog(IO_CACHE, ...) is called before calling Log_event::read_log_event(IO_CACHE, ...). So setting current binlog file information into IO_CACHE->file_name would be fine.