WL#13782: InnoDB: Add dynamic config option to use fallocate() on Linux
Affects: Server-8.0 — Status: Complete
The motivation is to make inserts faster for data loading. InnoDB explicitly writes zeros to disk when creating new pages in a tablespace or when creating a new redo log file. We take this conservative approach because fallocate() doesn't seem to be atomic in the sense that it guarantees that the allocation of space and updation of the file metadata in the file system will be atomic (handle power failure for example while executing fallocate). This is a problem because InnoDB currently assumes that a freshly created page will have zeroes (NUL) in it. Introduce a dynamic variable: --innodb-extend-and-initialize When this variable is not set we will not write zeros but use fallocate(2): http://man7.org/linux/man-pages/man2/fallocate.2.html For additional safety write a redo log record before allocating new pages so that we can handle any errors if the newly created pages were not marked as zeroed out due to the fallocate problem described above. Additionally, we can zero out the page while processing the redo log record because it is guaranteed that the record will be reconstructible from any redo log records that follow the page create redo log record.
Functional requirements 1. FR-1: Provide a global dynamic boolean variable innodb-extend-and-initialize to turn on/off initializing the newly created pages with zeros. The default value of this variable will be TRUE which means the newly allocated pages will be initialized with zeros. 2. FR-2: Do not explicitly zero out the newly created pages if innodb-extend-and- initialize is set to false. 3. FR-3: Write a redo log record while creating new pages for recovery purposes. 4. FR-4: Initialize the pages to zero while processing the redo log record for extending the tablespace. Non-functional requirements 1. NF-1: INSERT should be faster for data loading.
In the current implementation, new space is allocated and initialized by writing NULLs in it. This slows down the operations which need more space in tablespaces. This slowdown further aggravates if new pages are added more frequently. This worklog allows the user to configure the server to not write NULLs while allocating new pages in a tablespace. This will result in faster turn around times for operations which frequently allocate new pages in the tablespace. The new pages are allocated by calling posix_fallocate() which reserves the space and updates the file metadata in the filesystem on successful execution. All the subsequent calls to access this page will see it as initialized by writing 0's in it. However, the call does not seem to be atomic i.e. it can crash after reserving the space, but just before updating the file metadata in the filesystem. This leads to a situation where any attempt to access the page after recovery may not return it as initialized. This will be a problem as InnoDB assumes a newly created page to be initialized with 0's. This worklog also fills this small window where posix_fallocate() may fail after reserving the space, but before updating the file metadata in the filesystem by writing a redo log record for space extend operation just before allocating new space. This log record will have the following pieces of information recorded: 1) space_id: Space ID of the space being extended 2) Offset: Offset from where the space with given ID will be extended 3) Size: Size by which the file will be extended starting at offset During recovery, this log record will be replayed to re-initialize this space by writing 0's in the space starting at offset recorded in the log record. User Interface changes ---------------------- A new boolean global dynamic variable innodb_extend_and_initialize will be introduced to control the allocation behavior. The valid values for this parameter will be '0|1' or 'TRUE|FALSE' or 'ON|OFF'. The variable innodb_extend_and_initialize can be set using the following commands: mysql> SET @@GLOBAL.INNODB_EXTEND_AND_INITIALIZE = [TRUE|FALSE]; mysql> SET GLOBAL INNODB_EXTEND_AND_INITIALIZE = [TRUE|FALSE]; mysql> SET PERSIST INNODB_EXTEND_AND_INITIALIZE = [TRUE|FALSE]; This parameter can also be passed to the command line as follows: mysqld --innodb--extend-and-initialize=[TRUE|FALSE] The parameter can also be written in a .cnf file which can be passed to the mysqld as a command-line parameter: [mysqld] innodb_extend_and_initialize=[TRUE|FALSE] The default value for the parameter innodb_extend_and_initialize will be true, to maintain backward compatibility. The current value of this parameter will be shown by the command: SELECT @@GLOBAL.INNODB_EXTEND_AND_INITIALIZE; The current value of this parameter can also be queried by querying performance_schema.global_variables: SELECT * FROM PERFORMANCE_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAMES LIKE '%innodb_extend_and_initialize%'; Behavior changes ----------------- If the variable is set to TRUE, then the server will allocate the space and initialize it with NULLs and a log entry will be added to the redo log for recovery purpose. If the variable is set to FALSE, then the server will allocate the space without initializing it with NULLs and a log entry will be added to the redo log for recovery purpose. While processing the redo logs during recovery, space allocated while writing the redo log will be initialized by writing NULLs into it. Impact on platforms other than Linux ------------------------------------ This parameter is Linux specific. On all other POSIX and Windows systems, the allocation will happen in the existing manner by writing NULLs in the allocated space. Any attempt to change the variable innodb_extend_and_initialize on these systems will result in the following warning message being returned to the user and the variable value unchanged to default (TRUE): "Changing innodb_extend_and_initialize not supported on this platform. Falling back to the default." Open items ---------- Q: How does the server ensure that the checkpoint does not cross the log lsn until the extend operation is complete? There is an edge case identified in this implementation as described in the illustration given below: S1 - Call Fil_shard::space_extend S2 - mtr_start S3 - Write redo log S4 - mtr_commit S5 - fallocate() S6 - exit space_extend The mtr is committed at step-4 which means the log entry has been written to the redo log. Since this entry is not specific to a page, the checkpoint may cross this redo log before the actual operation is complete. In such a case, the redo log may not be replayed during recovery as the checkpoint has progressed ahead of the redo log.
Copyright (c) 2000, 2022, Oracle Corporation and/or its affiliates. All rights reserved.