WL#6035: Add native partitioning support to InnoDB
Affects: Server-Prototype Only
—
Status: Complete
Current partitioning support for InnoDB relies on the ha_partition handler, which creates a new handler for each partition, which can waste resources when having many partitions. By supporting native partitioning within InnoDB one can use the resources better. Related bugs: BUG#62536: PARTITIONS USE TOO MUCH MEMORY BUG#37252: PARTITIONING PERFORMANCE DROPS DRASTICALLY WITH HUNDREDS OF PARTITIONS
Functional Requirement 1: No visible change from 5.6 partitioned innodb tables to WL#6035 except for: FR 1.1 No .par file (using internal data dictionary instead). FR 1.2 Minor change of behavior of auto_increment (kept between FLUSH TABLE). I.e the next auto_increment value will not be reset by FLUSH TABLE, but kept and used after FLUSH TABLE. (i.e. if the highest value is deleted before a FLUSH TABLE, it will not be reused again afterwards). FR 1.3 Minor change of statistics resulting in some changed execution plans. FR 1.3.1 Minimum number of rows estimated for a partitioned innodb table will be 1 instead of 2. (The optimizer will not create a const or system table for InnoDB, even if it estimates the number of rows to 1, unless it is an exact PK search?). FR 1.3.2 Minimum number of rows estimated for range read on index will be 0 per partition instead of 1. But the minimal total number of rows for the whole table will still be 1. FR 1.3.3 All partitions will that are in the read set (i.e. after pruning is done) will be included when calculating the matching rows in an index range (records_in_range()) instead of only the biggest partitions. Non functional requirements: NFR 1 No degradation in performance (CPU, Memory) of: Statements IN (SELECT, UPDATE, DELETE) Number of partitions = 300? Compare with mysql-trunk revid that the WL#6035 branch is merged into. NFR 1.1 Point statements (With PK, which are required to be a part of the partitioning expression). NFR 1.2 RANGE statements (With the partitioning column included in the range key) NFR 1.3 RANGE statements (without the partitioning column included in the range key) NFR 1.4 JOINS between N partitioned tables with M partitions (n=5, M=300?) NFR 1.5 High concurrency (normal sysbench with partitioned tables?) NFR 1.6 Less memory usage when having a high number of partitions open, see BUG#70641 which tests with 30 databases with 100 tables containing 100 partitions. What needs to be added to that test is not only a CREATE TABLE for each table, but also at least one PK point select, one secondary index range query, update and a delete query to get better coverage of memory usage.
Main change, Instead of having a 'main' handler (pointed out by TABLE::file) which in turn have one handler per partition where all read/writes are forwarded: +-----------------+ | main handler | | (ha_partition) | +-----------------+ +---------------+ +---------------+ +---------------+ | part1 handler | | part2 handler | .... | partN handler | | ha_innobase | | ha_innobase | | ha_innobase | +---------------+ +---------------+ +---------------+ We encapsulate all partitioning in one single handler: +-----------------+ | handler | | (ha_innopart) | +-----------------+ This can be done since: InnoDB treats every thing as a b-tree/index: - a table is an index clustered on the primary key (If there are no defined primary key, it will use the first non-null unique index. If no such index exists, it will generate an unique id for each row (DB_ROW_ID) and use that as hidden primary key). - each secondary index appends the part of the primary key columns that are not yet defined in the index. This will not change by moving the partitioning logic into the innodb handler. The generic partitioning engine calculate partition id as an own partition-aware handler and forward create/open/read/search/write/update/close/delete operations to the lower level non-partitioning-aware storage engine to execute within the correct partition. This worklog will merge that functionality with the InnoDB handler, so it can natively support partitioning without the need of the generic partitioning engine and having one InnoDB handler per partition. To decrease the risk of regression while still taking advantage of the current InnoDB handler this will be implemented by creating a sub-class of the ha_innobase class, named ha_innopart. This new class will enhance its base class with partitioning support. This also encapsulates the partitioning implementation, without interfering with the non-partitioned InnoDB code. Both ha_innobase and ha_innopart will have the same handlerton, innodb_hton. The main reason for not pushing down the partitioning logic further down into the low-level b-tree index innodb implementation is that the partitioning expression is based on the Item class. Which uses the myisam row-format and normally the table->record[0] data buffer used by the Storage Engine API (handler base class).
Code layout: The new ha_innopart handler is declared in ha_innodb.h and defined/implemented in the new file ha_innopart.cc, with its ALTER TABLE specific code added to hander0alter.cc. Server layer changes: Only small adjustments where needed in the server layer to support this. Server wide partitioning code changes: Some refactoring to move common code from ha_partition engine into partition_info class, to increase code reuse. Like partition name handling. Also creating a common Partition_share class (derived from Handler_share, created in WL#4305) in handler.{h,cc}. To allow ha_partition and ha_innopart to share the same code for auto increment and partition names. InnoDB changes: ha_innopart handlers are created from the innobase_hton if the table is partitioned. Which makes ha_innopart also shares the same handlerton as ha_innobase. If the table is partitioned and the table's engine type is InnoDB, then a ha_innopart handler is created and returned, else an ordinary non-partitioning aware ha_innobase handler is created and returned. This allows old ha_partition based tables (.frm based) to still be created and usable. Since only the handlerton is known during drop and rename table, ha_innobase also needed to handle partitions in delete_table and rename_table. This was implemented by extending/copying row_drop_database_for_mysql. INPLACE alter needs to be adjusted to not update the prebuilt variable in a different handler instance, but only set it in the owning instance. This results in ha_partition based tables will no-longer support INPLACE alter. But since they should migrate to ha_innopart based tables I don't see that as a problem. Some functions must remove the 'inline' declaration, since it will be used outside ha_innodb.cc. And some functions need to be changed from private to protected. And some functions needs to be virtual, so they can be properly overloaded in ha_innopart. Partitioning aware InnoDB handler: A new handler class, ha_innopart, is derived from the ha_innobase class and algorithms from ha_partition is used to support partitioning. Explanation of read/write operations: - write: Check partition id for new row, insert into that partition. Example: write_row(). - update: Check partition id for new and old row, if same partition update within that partition, else insert into new partition and delete from old partition. Example: update_row(). - non-sorted read: start on first used partition, read until it is exhausted, continue with next partition etc. Example rnd_init()/rnd_next() or index_init(idx, sorted = false)/ index_read*()/index_next(). - sorted read (one partition): If only one partition, use that partition. - sorted read (several partitions): In index_init(sorted=true) create a merge sort buffer (one full MySQL record per used partition). For index_read*() calls, read one row from each matching partition into the buffer, sort the buffer and return the first row. For index_next(), replace the previously returned row by reading the next row from the same partition, sort that new row and return the first row from the updated buffer/priority queue. To handle cases with exact search + index_next() (used in some group by) we also must remember which partitions returned HA_ERR_KEY_NOT_FOUND and retry to fill those rows in the first index_next() call. Basic types of information operations: - Table level info - Either already in the handler or possible to extract the info from a single partition. Example: table_flags. - Partition level info - Used by I_S to get info from a specific partition. Example: get_dynamic_partition_info(). ha_partition: Using the Handler_share as base class for Ha_partition_share. This also enables GIS to be used by partitioned innodb tables, with the restriction that it cannot be in the partitioning expression (due to its blob nature). Full Text Search is disabled for partitioned tables. Foreign Keys are disabled for partitioned innodb tables.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.