WL#13782: InnoDB: Add dynamic config option to use fallocate() on Linux

Affects: Server-8.0   —   Status: Complete

The motivation is to make inserts faster for data loading.

InnoDB explicitly writes zeros to disk when creating new pages in a tablespace or 
when creating a new redo log file.

We take this conservative approach because fallocate() doesn't seem to be atomic 
in the sense that it guarantees that the allocation of space and updation of the 
file metadata in the file system will be atomic (handle power failure for example 
while executing fallocate). This is a problem because InnoDB currently assumes 
that a freshly created page will have zeroes (NUL) in it.

Introduce a dynamic variable: --innodb-extend-and-initialize

When this variable is not set we will not write zeros but use fallocate(2):


For additional safety write a redo log record before allocating new pages so that 
we can handle any errors if the newly created pages were not marked as zeroed out 
due to the fallocate problem described above. Additionally, we can zero out the 
page while processing the redo log record because it is guaranteed that the 
record will be reconstructible from any redo log records that follow the page 
create redo log record.
Functional requirements
1. FR-1: Provide a global dynamic boolean variable innodb-extend-and-initialize 
to turn on/off initializing the newly created pages with zeros. The default value 
of this variable will be TRUE which means the newly allocated pages will be 
initialized with zeros.

2. FR-2: Do not explicitly zero out the newly created pages if innodb-extend-and-
initialize is set to false.

3. FR-3: Write a redo log record while creating new pages for recovery purposes.

4. FR-4: Initialize the pages to zero while processing the redo log record for 
extending the tablespace.

Non-functional requirements
1. NF-1: INSERT should be faster for data loading.
In the current implementation, new space is allocated and initialized by writing 
NULLs in it. This slows down the operations which need more space in 
tablespaces. This slowdown further aggravates if new pages are added more 

This worklog allows the user to configure the server to not write NULLs while 
allocating new pages in a tablespace. This will result in faster turn around 
times for operations which frequently allocate new pages in the tablespace.

The new pages are allocated by calling posix_fallocate() which reserves the 
space and updates the file metadata in the filesystem on successful execution. 
All the subsequent calls to access this page will see it as initialized by 
writing 0's in it.

However, the call does not seem to be atomic i.e. it can crash after reserving 
the space, but just before updating the file metadata in the filesystem. This 
leads to a situation where any attempt to access the page after recovery may not 
return it as initialized. This will be a problem as InnoDB assumes a newly 
created page to be initialized with 0's.

This worklog also fills this small window where posix_fallocate() may fail after 
reserving the space, but before updating the file metadata in the filesystem by 
writing a redo log record for space extend operation just before allocating new 
space. This log record will have the following pieces of information recorded:

1) space_id: Space ID of the space being extended
2) Offset: Offset from where the space with given ID will be extended
3) Size: Size by which the file will be extended starting at offset

During recovery, this log record will be replayed to re-initialize this 
space by writing 0's in the space starting at offset recorded in the log record.

User Interface changes
A new boolean global dynamic variable innodb_extend_and_initialize will be 
introduced to control the allocation behavior.

The valid values for this parameter will be '0|1' or 'TRUE|FALSE' or 'ON|OFF'.

The variable innodb_extend_and_initialize can be set using the following 


This parameter can also be passed to the command line as follows:

mysqld --innodb--extend-and-initialize=[TRUE|FALSE]

The parameter can also be written in a .cnf file which can be passed to the 
mysqld as a command-line parameter:


The default value for the parameter innodb_extend_and_initialize will be true, 
to maintain backward compatibility.

The current value of this parameter will be shown by the command:


The current value of this parameter can also be queried by querying 


Behavior changes
If the variable is set to TRUE, then the server will allocate the space and 
initialize it with NULLs and a log entry will be added to the redo log for 
recovery purpose.

If the variable is set to FALSE, then the server will allocate the space without 
initializing it with NULLs and a log entry will be added to the redo log for 
recovery purpose.

While processing the redo logs during recovery, space allocated while writing 
the redo log will be initialized by writing NULLs into it.

Impact on platforms other than Linux
This parameter is Linux specific.

On all other POSIX and Windows systems, the allocation will happen in the 
existing manner by writing NULLs in the allocated space.

Any attempt to change the variable innodb_extend_and_initialize on these systems 
will result in the following warning message being returned to the user and 
the variable value unchanged to default (TRUE):

"Changing innodb_extend_and_initialize not supported on this platform. Falling 
back to the default."

Open items
Q: How does the server ensure that the checkpoint does not cross the log lsn 
until the extend operation is complete?

There is an edge case identified in this implementation as described in the 
illustration given below:

S1 - Call Fil_shard::space_extend
S2 - mtr_start
S3 - Write redo log
S4 - mtr_commit
S5 - fallocate()
S6 - exit space_extend

The mtr is committed at step-4 which means the log entry has been written to the 
redo log. Since this entry is not specific to a page, the checkpoint may cross 
this redo log before the actual operation is complete. In such a case, the redo 
log may not be replayed during recovery as the checkpoint has progressed ahead 
of the redo log.