WL#4925: Preallocating binary log for improving write performance

Affects: Server-5.6   —   Status: On-Hold

sync_binlog=1 is sometimes required for critical applications. If the sync_binlog
is not set to 1, there is a risk of corruption of the binary log if the
server crashes. But, in many filesystems/OS, sync_binlog=1 is currently way
too slow (sometimes 2-4 times slower than set to 0). 
I have researched the cause of the performance problem then now I believe that
one of the big causes is architecture of writing to binlog, "extending file per
each commit"
Currently binlog file is not pre-allocated, but extending per each transaction
commit. In contrast, InnoDB log file is pre-allocated and file size is not
changed. In other words, binlog adopts “appending” architecture, InnoDB log
adopts “overwriting” architecture. 
For most of operating systems/file systems (including Solaris ufs and Linux
ext3), overwriting is much faster than appending. So binlog should pre-allocate
large file size (e.g. 128MB at once), then overwriting per commit. 



Here is a result of my preallocating binlog patch.

mysqlslap insert throughput (InnoDB)

1) preallocate, sync_binlog=1
1 conn: 3,707 commits/sec
20 conn: 7,696 commits/sec
50 conn: 8,156 commits/sec


2) normal, sync_binlog=1
1 conn: 359 commits/sec
20 conn: 798 commits/sec
50 conn: 760 commits/sec

The difference between 1) and 2) is huge (10 times faster) so
preallocating binlog is really helpful for people using sync_binlog=1 (no
outstanding difference when using sync_binlog=0).


Here is a low-level benchmarking result on Linux (RHEL5.3, ext3). 

1. creating an empty file, then writing 128*1024 times with fdatasync()
  fdatasync per second: 3085.94321
  (Emulating current binlog (sync-binlog=1) behavior)

2. creating a 1GB data file, then writing 128*1024 times with fdatasync()
  fdatasync per second: 12330.47268
  (Emulating current InnoDB log file behavior)

3. zero-filling 1GB, then writing 128*1024 times with fdatasync()
immediately
  fdatasync per second: 6569.00072
  (Zero-filling causes massive disk writes so killing application performance)

4. zero-filling 1GB, sleeping 20 seconds, then writing 128*1024
times with fdatasync()
  fdatasync per second: 11669.81157

Apparently 2 and 4 are much better than 1 or 3. 

I. Preallocating algorithm
A: Automatic Preallocation: Internally using posix_fallocate() to preallocate binlog
B: Static Preallocation: Implementing a C++ tool to generate preallocated binary
log file(s)
II. How to handle existing binlog files
III. New my.cnf parameter



I. Preallocating mechanism

Two design choices can be considered to implement binlog preallocation.

A: Automatic Preallocation: Internally using posix_fallocate() to preallocate binlog
 Highly depending on filesystems and libc, but on some environments 
it is possible to preallocate files with no overhead, which doesn't kill
fdatasync() performance from applications.
 posix_fallocate() can be implicitly issued by mysqld so no explicit SQL command
is required.
 Currently posix_fallocate() doesn't work on most Linux distributions including
RHEL5.3, but does zero-filling (which causes massive writes) instead. This is
not fast.
 Here is a table about whether posix_fallocate() is supported or not.
 On Windows, SetFilePointerEx() and SetEndOfFile() can be used as a replacement
of posix_fallocate()
+------------+----------------------------+--------------+----------+---------+
| OS         |     Linux (Fedora11)       | Linux(RHEL5) |  Solaris | Windows |
| Filesystem |ext2/3| xfs | btrfs | ocfs2 |    ext2/3    |ufs | zfs |  NTFS   |
+------------+------+-----+-------+-------+--------------+----+-----+---------+
|            |  x   |  y  |   y   |   y   |      x       |  n |n(*1)|    y    |
+------------+------+-----+-------+-------+--------------+----+-----+---------+
n: posix_fallocate() is not supported
y: posix_fallocate() is supported (SetFilePointerEx() and SetEndOfFile() on Windows)
x: posix_fallocate() is supported, but doing zero-filling instead (very slow)
(*1): zfs is copy-on-write based filesystem so overwriting performance is 
almost as equal as appending (This should apply to btrfs though I have not
tested benchmark yet). 

 To encapsulate platform differences, we define a new function mysql_fallocate()
that does:
 - Use posix_fallocate() if it is available.
 - If on windows, define mysql_fallocate() to call SetEndOfFile() to
  allocate the file.
 - Otherwise, define it as a no-op, doing nothing.
 One issue is that in some OS/filesystems (i.e. ext3 on RHEL5) posix_fallocate()
is defined but does zero filling internally, which will take a lot of time and
massive writes. To avoid this scenario, we add a new configuration parameter so
that users can disable automatic preallocation. See "III. New my.cnf parameter"
for details. 


B: Static Preallocation: Implementing a C++ tool to generate preallocated binary
log file(s)
Allocating one more preallocated binary logs by using a command line client
"mysqlbinlogalloc", written in C++
Usage is as follows:
  ./mysqlbinlogalloc  --start-suffix=1 --end-suffix=10 --binlog-size=2G
--binlog-name=$MYSQLD_DATADIR/pre-allocated-binlog/empty
   Then empty.000001 to empty.000010, 2GB are created under
$MYSQLD_DATADIR/pre-allocated-binlog directory.

Note: --binlog-name must be set to the pre-allocated-binlog subdirectory under
mysql data directory. Otherwise, the files will not be picked by the mysql server.

Plan A is the simplest approach and DBAs do not have to do anything, but major
OS/filesystem doesn't support posix_fallocate() so supporting just plan A is not
enough. Plan B requires DBAs additional work, but this enables preallocating
binlog to work on all platforms. That's why supporting both plan A and plan B
would be the best approach. 


II. How to handle existing statically pre-allocated binlog files
If statically pre-allocated binlog file already exists, mysqlbinlogalloc should
exit with error (file exists) at that point. For example, if empty.000006 exists
in the above case, empty.000001 to empty.000005 will be created, but
empty.000006 will not be overwritten, and empty.000007 to empty.000010 will not
be created. 

When server rotating, the "$MYSQLD_DATADIR/pre-allocated-binlog/" directory is
investigated to find one pre-allocated binlog file and then rename it to a new
binlog file with the highest number in the sequence of binlogs, which is created
if no any pre-allocated binlog file is found.


III. New my.cnf parameter
Based on I and II, I propose introducing a new my.cnf parameter "uint
binlog_preallocate" to do the following.
0: Not doing automatic preallocation, but detecting statically preallocated files.
1: Detecting statically preallocated files, automatically preallocating one
binlog file if statically preallocated file does not exist.


1. Implementing a tool to generate preallocated binary log file(s)
Allocating one more preallocated binary logs by using a command line client
"mysqlbinlogalloc", written in C++
Usage is as follows:
  ./mysqlbinlogalloc  --start-suffix=1 --end-suffix=10 --binlog-size=2G
--binlog-name=$MYSQLD_DATADIR/pre-allocated-binlog/empty

The generate process:
- Pre-allocating a temp file and then rename it to empty.000001, repeat
the above process until pre-allocating the tenth temp file and rename it
to empty.000010.


2. Automatic preallocation
- Using tool requires more DBA work. Automatic preallocation makes DBAs happy.
- The problem of automatic preallocation is preallocation itself takes long time
and massive writes happen unless posix_fallocate() is supported. Currently
posix_fallocate() doesn't work on most of major OS/filesystems.
- Zero-filling large file takes time. This means it takes more time to starting
up mysqld. This means failover time becomes longer.
- General recommendation is using automatic preallocate only if
posix_fallocate() is accurately supported. Otherwise using tool is recommended.
Be careful that even though posix_fallocate() is available on client side, on
some environments it just does zero filling, which takes long time.

Automatic preallocation can be done by introducing a server system variable: 
- uint binlog_preallocate, default is 0
- if binlog_preallocate == 0, not doing automatic preallocation, but detecting
statically preallocated files.
- if binlog_preallocate == 1, automatically preallocating one new binlog file
named with the highest number in the sequence if statically preallocated file
does not exist. Otherwise renaming one statically preallocated file to the
new binlog file.


3. Truncating preallocated space when closing the binlog
Most of existing logics are implemented based on the idea that binlog file is
EOF after reading all events. This is not true for preallocated binlog.
Preallocated spaces are not EOF, but bogus data. Currently BINLOG DUMP thread
and et al terminates processing binlog when detecting failure events.

So truncating preallocated space when binlog is closed would be nice. Binlog
files are closed by mysqladmin shutdown, FLUSH LOGS, reaching maximum size, etc.
Closed binlog is not reused, but new binlog is created, so truncating space is
fine. Closing binlog is implemented in MYSQL_LOG::close(). 

If binlogs are not closed properly (i.e. killing mysqld), preallocated space is
not truncated. Behavior depends on tools/clients. The crashed binlog file will
be truncated after master server restarts. See WL#5493.
- BINLOG DUMP thread can proceed without stopping. When BINLOG DUMP thread can
not read events from binlog which is not closed correctly (can be corrupted),
BINLOG DUMP thread handles it as EOF, then moving to the next binlog. See
mysql_binlog_send(). So this is not a problem.
- SHOW BINLOG EVENTS just returns I/O error to mysql client even though it reads
all normal events. 
- mysqlbinlog returns error to STDERR when reading failure events. But
mysqlbinlog returns all normal events to STDOUT so you can get all events. In
practice this not a problem.


4. Checking actual size when BINLOG DUMP thread is reading active binlog.
(This change does not apply to client tools such as mysqlbinlog)
Issue:
Checking EOF of binlogs is needed. Currently binlogs are read by IO_SIZE (4K) at
one time until reaching EOF, but this doesn't work for preallocated binlogs
because preallocated binlogs have bogus data after actual events. 

For BINLOG DUMP thread, when reading the latest active binary log, getting
actual size information is needed to check EOF. actual size is got by
mysql_bin_log.get_log_file()->pos_in_file.


 - Reading log event (binlog dump thread)
 int Log_event::read_log_event(IO_CACHE* file, String* packet,
                               mysql_mutex_t* log_lock,
                               uint8 checksum_alg_arg)

BINLOG DUMP thread might read older binlogs. Actual size information is not
needed for older binlogs. Just reading until close or rotation event is fine.


5. Checking currently reading binlog is active or not
(This change does not apply to client tools such as mysqlbinlog)
BINLOG DUMP thread needs to check an actual size when reading active binlog.
There is a method to check whether the binlog is active or not.
bool MYSQL_BIN_LOG::is_active(const char *log_file_name_arg)

This method requires current binlog name. Currently Log_event::read_log_event()
does not have any information about currently reading binlog file name.

When reading binlog, open_binlog(IO_CACHE, ...) is called before calling
Log_event::read_log_event(IO_CACHE, ...). So setting current binlog file
information into IO_CACHE->file_name would be fine.