MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
The InnoDB Change Buffer

One of the challenges in storage engine design is random I/O during a write operation. In InnoDB, a table will have one clustered index and zero or more secondary indexes.  Each of these indexes is a B-tree.  When a record is inserted into a table, the record is first inserted into clustered index and then into each of the secondary indexes.  So, the resulting I/O operation will be randomly distributed across the disk.  The I/O pattern is similarly random for update and delete operations. To mitigate this problem, the InnoDB storage engine uses a special data structure called the change buffer (previously known as the insert buffer, which is while you will see ibuf and IBUF used in various internal names).

The change buffer is another B-tree, with the ability to hold the record of any secondary index.  It is also referred to as a universal tree in the source code.  There is only one change buffer within InnoDB and it is persisted in the system tablespace.  The root page of this change buffer tree is fixed at FSP_IBUF_TREE_ROOT_PAGE_NO (which is equal to 4) in the system tablespace (which has space id of 0). When the server is started, the change buffer tree is loaded by making use of this fixed page number.   You can refer to the ibuf_init_at_db_start() function for further details.

The total size of the change buffer is configurable and is designed to ensure that the complete change buffer tree can reside in main memory.   The size of the change buffer is configured using the innodb_change_buffer_max_size system variable.

Overview of Change Buffering

Change buffering is applicable only to non-unique secondary indexes (NUSI).  InnoDB buffers 3 types of operations on NUSI: insert, delete marking, and delete.  These operations are enumerated by ibuf_op_t within InnoDB:

One important point to remember is that the change buffering is leaf page oriented. A particular operation to NUSI is buffered in the change buffer only if the relevant non-root leaf page of the NUSI is not already available in the buffer pool.  This means that the buffered change is predefined to happen in a particular leaf page of a NUSI within the InnoDB system.  This makes it necessary to track the free space available in the NUSI leaf pages.  This tracking is necessary because merging these buffered operations to the NUSI leaf page must not result in a B-tree page split or B-tree page merge.

Special Change Buffer Fields

When a NUSI record is buffered in the change buffer, 4 special change buffer fields are added to the beginnning of the NUSI record.  Each of these 4 fields and their contents are explained below.  The primary key of the change buffer tree is then {space_id, page_no, count}, where the count helps to maintain the order in which the change is buffered for that particular page.  The change buffer row format has evolved over a period of time and the following table provides information for MySQL 5.5+:
[table “12” not found /]

The row format of the change buffer records themselves are always REDUNDANT.

Change Buffer Bitmap Page

The free space information for each page is tracked in predefined pages called the change buffer bitmap page, which is also known as the ibuf bitmap page.  These pages always follow the extent descriptor pages.  The following table gives the predefined page numbers of the change buffer bitmap pages:

Page Size in Kilobytes (KB) Extent Size in Pages Extent Descriptor Page Numbers Change Buffer Bitmap Pages (ibuf bitmap pages) Number of pages described by one ibuf bitmap page
4 256 0, 4096, 8192, 12288, ... 1, 4097, 8193, 12289, ... 4096
8 128 0, 8192,16384, 24576, ... 1, 8193, 16385, 24577, ... 8192
16 64 0, 16384, 32768, 49152, ... 1, 16385, 32769, 49153, ... 16384
32 64 0, 32768, 65536, ... 1, 32769, 65537, ... 32768
64 64 0, 65536, 131072, ... 1, 65537, 131073, ... 65536

The page number 1 is also referred to as the FSP_IBUF_BITMAP_OFFSET. These change buffer bitmap pages help to answer the following questions quickly:

  • Does the given page have any buffered changes in the ibuf (insert/change buffer) tree?  This question will be asked when a page is read into the buffer pool.  The buffered changes will be merged to the actual page before putting it into the buffer pool.
  • Does the given page have enough free space so that a change can be buffered? This question will be asked when we want to modify a leaf page of NUSI and it is not already available in the buffer pool.

In the next section we will see at the information stored in the ibuf bitmap page which helps to answer above questions.

Information Stored in Change Buffer Bitmap Page

The change buffer bitmap page uses 4 bits (IBUF_BITS_PER_PAGE) to describe each page. It contains an array of such 4 bits describing each of the pages.  This whole array is called the “ibuf bitmap” (insert/change buffer bitmap).  This array begins after the page header at an offset equal to IBUF_BITMAP (which is equal to 94). Given a page number, the ibuf bitmap page that contains the 4 bit information on the given page can be calculated as: ulint bitmap_page_no = FSP_IBUF_BITMAP_OFFSET + ((page_no / page_size) * page_size); .

You can refer to the ibuf_bitmap_page_no_calc() function for more details on the complete calculation.  Likewise, given a page number, the offset within the change buffer bitmap page contains the 4 bits that can be used easily for the calculations.  I leave this as an exercise to the reader (refer to function ibuf_bitmap_page_get_bits_low() for further info). The following table provides details about these 4 bits:

Bit Value Description
IBUF_BITMAP_FREE 0 The first two bits are used to represent the free space available in the leaf page of the NUSI.
IBUF_BITMAP_BUFFERED 2 The third bit if set means that the leaf page has buffered entries in the change buffer.
IBUF_BITMAP_IBUF 3 The fourth bit if set means that this page is part of the change buffer.

This means that only 2 bits are available to store the free space information of the page.  There are only 4 possible values: 0, 1, 2, 3.  Using these 2 bits, we try to encode the free space information for a page.  The rule is as follows — there must be at least UNIV_PAGE_SIZE / IBUF_PAGE_SIZE_PER_FREE_SPACE bytes of free space for the change buffer to be used:

Tracking Free Space of Pages

Before an insert operation (IBUF_OP_INSERT) is buffered, the free space available in the target NUSI leaf page is approximately calculated using the information available in the change buffer bitmap page. This conversion is done in the ibuf_index_page_calc_free_from_bits() function and the formula used is:

The following table provides the conversions done from the encoded value found within the change buffer bitmap page to a meaningful value in bytes:

Page Size in Kilo Bytes (KB) Free Space Information from IBUF bitmap page
(ibuf_code)
The free space available in NUSI leaf page (approximately assumed)
4 0 0 bytes
1 128 bytes
2 256 bytes
3 512 bytes
16 0 0 bytes
1 512 bytes
2 1024 bytes
3 2048 bytes

Using this information, we can determine if the record to be buffered will fit into the page or not.  If there is enough space then the insert will be buffered.  Using this approach, we ensure that merging these records to the target NUSI will not result in a page split.

Updating Free Space Information

After buffering an insert or delete operation, the free space information in the change buffer bitmap page must be updated accordingly (a delete mark operation will not change the free space information).  To update the free space information we need to convert the free space in bytes back to the IBUF encoded value.  This is done in the ibuf_index_page_calc_free_bits() function using the following formula:

In the above formula, max_ins_size is the maximum insert size (maximum free space) available in the page after page re-organization.

Record Count in NUSI Leaf Pages

In the case of a purge operation  (IBUF_OP_DELETE),  we need to ensure that the number of records in the target NUSI leaf page doesn’t go to zero because in InnoDB, the leaf pages of B-tree are not allowed to become empty.  They must have at least 1 record. Since the number of records in the target NUSI page is unknown (because it is not loaded into the buffer pool yet), the buffered insert operations are taken into account.  If 1 insert operation is buffered then it is assumed that the target NUSI page has 1 record, and if 2 insert operations are buffered then it is assumed that the target NUSI page has 2 records and so on.  In this way the number of records in the target NUSI page is calculated.  Based on this calculated record count, a purge operation is either buffered or not.  This means that if there are no buffered insert or delete-mark operations, then no purge operations can be buffered.

Merging the Buffered Changes to Target NUSI Page

The changes to NUSI leaf pages are buffered in the change buffer if the NUSI leaf page is not already in the buffer pool. These buffered operations are merged back into the actual NUSI leaf page under various circumstances:

  1. When these NUSI leaf pages are subsequently read into the buffer pool, the buffered operations will be merged.  A NUSI leaf page could be read into the buffer pool during an index lookup, an index scan or because of read ahead.
  2. When the master thread of InnoDB periodically does change buffer merges by calling ibuf_merge_in_background().
  3. When there are too many operations buffered for a particular NUSI leaf page.
  4. When the change buffer tree reaches its maximum allowed size.

The change buffer merge operation is initiated by calling the ibuf_merge_in_background() or ibuf_contract() functions. The change buffer merges are done either in the foreground or in the background. The foreground change buffer merges are done as part of a DML operation and hence will affect the performance experienced by the end user.  The background change buffer merges are instead done periodically when there is less activity in the server.

We ensure that the change buffer merge does not result in a B-tree page split or page merge operation. It also shouldn’t result in an empty leaf page.  Before the target NUSI leaf page are placed into the buffer pool, the buffered changes are applied to them. Once the buffered changes are merged for a page, its associated 4 bits of information in the change buffer bitmap page are also updated.

Conclusion

This article provided an overview of the change buffer subsystem of InnoDB.  It explained the additional fields that are added to a secondary index record before storing it within the change buffer.  It provided information about how the change buffer keeps track of free space information of the NUSI leaf pages by making use of the change buffer bitmap page.  It also explained the need to calculate the number of records in the NUSI leaf pages so that it doesn’t become empty.

Thanks to Marko Makela for reviewing this article and helping to make it more accurate. If you have any questions, please feel free to post a comment below.

That’s all for now. THANK YOU for using MySQL!