WL#5756: InnoDB: Support Page Sizes 4k and 8k
Status: Complete — Priority: Medium
This involves adding a new startup parameter innodb-page-size that can be set to either 4k, 8k, or 16k. This must be set before any InnoDB tablespaces are created, before bootstrap, so that all InnoDB tablespaces have the same page size. Supporting smaller page sizes without recompiling the server has been asked for years by MySQL users, including Facebook (http://mysqlha.blogspot.com/2009/06/alternate-page-sizes-in-innodb.html). Theoretically, SSD could benefit from smaller page sizes. But tuning SSD (with smaller page sizes) should not be part of this worklog.
The tablespace page size must be stored in the tablespace header page so that the value of innodb_page_size can be corrected if the tablepaces already exist. In the current file version, the compressed page size of a tablespace is stored in the FSP_SPACE_FLAGS field of the tablespace header along with other flags that can determine the table row type for file-per-tablespace. The physical tablespace page size can be added to these flags, but once an expanded flags is written to a tablespace, older engines will not be able to open that tablespace. In order to prepare for this, the InnoDB code needs to be refactored in the way that it handles table flags. There needs to be a distinction between table flags and filespace flags. The bit manipulations also need to be isolated to inline functions defined separately for table and tablespace flags. See WL#5873. In order to know if an existing file-per-table tablespace containing a compressed table can be opened by an engine using a fixed UNIV_PAGE_SIZE, we will need to know the uncompressed page size. Once the tablespace flags are handled independently from the table flags, the Logical-Uncompressed Page Size of the tablespace can be added to the tablespace flags. In the future there will be more than one table per tablespace and we plan to support multiple uncompressed page sizes at some point. See WL#5628. When we can get more than one table per tablespace the physical page size will be the same for all tables in that tablespace. So if there are multiple compressed tables, they MUST all have the same KEY_BLOCK_SIZE or compressed page size. At that point, we can continue to rely on the ZIP_SIZE flags currently in bit positions 2-5 in FSP_SPACE_FLAGS to determine the physical page size because they will refer to the tablespace, not just one table in it. There are ~450 occurrences of UNIV_PAGE_SIZE. For this worklog, UNIV_PAGE_SIZE can be set to a global variable instead of the predefined constant. Then there are many places where UNIV_PAGE_SIZE is used in a precompiled macro which need to be replaced. Often, UNIV_PAGE_SIZE can be replaced by UNIV_PAGE_SIZE_MAX or UNIV_PAGE_SIZE_DEF which are still constants. Debugging should look into any problems with the length of records found in non-index pages such as the problem found in BUG#54924. This task may also show problems with index key lengths when tests are run with smaller page sizes. One such problem is Bug #13336585-CHANGE BUFFERING WITH 4K PAGES CAN ASSERT IF SECONDARY KEY IS NEAR MAX. The maximum key length allowed by MySQL is 3072 which is less than 1/4th of the free space on a 16k page. The worst case for a large key length is where both the the primary and secondary key is the max length. Since InnoDB must have at least 2 records per index page, the maximum key length must be less than 1/4th of the free space on a page. InnoDB can calculate the free space on a page and the record overhead, but it needs to know the table metadata first. MySQL asks each engine what the maximum key length is before the CREATE TABLE statement. Since InnoDB had the advantage of a maximum key length that will easily fit 4 to a 16k page, this worklog will implement similar limits on 8k and 4k page sizes. The maximum key length for 16k pages is 3072 bytes, so the limit for 8k pages will be 1536 bytes and the limit for 4k pages will be 768 bytes. These limits easily allow record data plus record overhead to be less than 1/4th of the free space on any page. A node pointer record in a non-unique secondary index has the most overhead, because it will store tuples (sec_col1,sec_col2,...,sec_colN,pk_col1,pk_col2,...,pk_colM,child_page_no). A unique index would only store (sec_col1,sec_col2,...,sec_colM,child_page_no). The maximum number of columns is N+M=16+16 when none of the columns of the secondary index belong to the primary key. The maximum payload length is 3072+3072 bytes (3072 for the secondary index and 3072 for the primary key). The maximum record header size is 2*(N+M)+6 bytes in ROW_FORMAT=COMPACT and one byte less in other formats. The 1/4 is a bit inaccurate (it is a sufficient condition, but sometimes unnecessarily strict). What InnoDB assumes is that each node pointer page must have space for at least two node pointer records. A future effort may be to increase these limits by enforcing them in CREATE TABLE if customers need this. But it is generally not wise to make indexes where the key is so long that there is only 2 or 3 records per page because it increases the depth of the Btree.
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.