WL#6035: Add native partitioning support to InnoDB

Affects: Server-Prototype Only   —   Status: Complete   —   Priority: Medium

Current partitioning support for InnoDB relies on the ha_partition handler,
which creates a new handler for each partition, which can waste resources when
having many partitions.

By supporting native partitioning within InnoDB one can use the resources better.

Related bugs:
BUG#62536: PARTITIONS USE TOO MUCH MEMORY
BUG#37252: PARTITIONING PERFORMANCE DROPS DRASTICALLY WITH HUNDREDS
OF PARTITIONS
Functional Requirement 1:
No visible change from 5.6 partitioned innodb tables to WL#6035 except for:
FR 1.1
No .par file (using internal data dictionary instead).
FR 1.2
Minor change of behavior of auto_increment (kept between FLUSH TABLE).
I.e the next auto_increment value will not be reset by FLUSH TABLE, but kept and
used after FLUSH TABLE. (i.e. if the highest value is deleted before a FLUSH
TABLE, it will not be reused again afterwards).
FR 1.3
Minor change of statistics resulting in some changed execution plans.
FR 1.3.1
Minimum number of rows estimated for a partitioned innodb table will be 1
instead of 2. (The optimizer will not create a const or system table for InnoDB,
even if it estimates the number of rows to 1, unless it is an exact PK search?).
FR 1.3.2
Minimum number of rows estimated for range read on index will be 0 per partition
instead of 1. But the minimal total number of rows for the whole table will
still be 1.
FR 1.3.3
All partitions will that are in the read set (i.e. after pruning is done) will
be included when calculating the matching rows in an index range
(records_in_range()) instead of only the biggest partitions.


Non functional requirements:

NFR 1
No degradation in performance (CPU, Memory) of:
Statements IN (SELECT, UPDATE, DELETE)
Number of partitions = 300?
Compare with mysql-trunk revid that the WL#6035 branch is merged into.
NFR 1.1
Point statements (With PK, which are required to be a part of the partitioning
expression).
NFR 1.2
RANGE statements (With the partitioning column included in the range key)
NFR 1.3
RANGE statements (without the partitioning column included in the range key)

NFR 1.4
JOINS between N partitioned tables with M partitions (n=5, M=300?)
NFR 1.5
High concurrency (normal sysbench with partitioned tables?)
NFR 1.6
Less memory usage when having a high number of partitions open, see BUG#70641 
which tests with 30 databases with 100 tables containing 100 partitions.
What needs to be added to that test is not only a CREATE TABLE for each table,
but also at least one PK point select, one secondary index range query, update
and a delete query to get better coverage of memory usage.
Main change, Instead of having a 'main' handler (pointed out by TABLE::file)
which in turn have one handler per partition where all read/writes are forwarded:

                   +-----------------+
                   |  main handler   |
                   | (ha_partition)  |
                   +-----------------+
+---------------+ +---------------+       +---------------+ 
| part1 handler | | part2 handler | ....  | partN handler |
|  ha_innobase  | |  ha_innobase  |       |  ha_innobase  |
+---------------+ +---------------+       +---------------+ 

We encapsulate all partitioning in one single handler:

+-----------------+
|     handler     |
|  (ha_innopart)  |
+-----------------+


This can be done since:
InnoDB treats every thing as a b-tree/index:
- a table is an index clustered on the primary key
  (If there are no defined primary key, it will use the
   first non-null unique index. If no such index exists,
   it will generate an unique id for each row (DB_ROW_ID)
   and use that as hidden primary key).
- each secondary index appends the part of the primary key
  columns that are not yet defined in the index.

This will not change by moving the partitioning logic into
the innodb handler.

The generic partitioning engine calculate partition id as an
own partition-aware handler and forward
create/open/read/search/write/update/close/delete operations
to the lower level non-partitioning-aware storage engine to
execute within the correct partition.
This worklog will merge that functionality with the InnoDB
handler, so it can natively support partitioning without
the need of the generic partitioning engine and having one
InnoDB handler per partition.

To decrease the risk of regression while still taking advantage
of the current InnoDB handler this will be implemented by creating
a sub-class of the ha_innobase class, named ha_innopart. This new
class will enhance its base class with partitioning support.
This also encapsulates the partitioning implementation, without
interfering with the non-partitioned InnoDB code.
Both ha_innobase and ha_innopart will have the same handlerton,
innodb_hton.

The main reason for not pushing down the partitioning logic further
down into the low-level b-tree index innodb implementation is that
the partitioning expression is based on the Item class.
Which uses the myisam row-format and normally the table->record[0]
data buffer used by the Storage Engine API (handler base class).
Code layout:
The new ha_innopart handler is declared in ha_innodb.h and
defined/implemented in the new file ha_innopart.cc, with
its ALTER TABLE specific code added to hander0alter.cc.


Server layer changes:
Only small adjustments where needed in the server layer to support
this.


Server wide partitioning code changes:
Some refactoring to move common code from ha_partition engine into
partition_info class, to increase code reuse. Like partition name
handling.
Also creating a common Partition_share class (derived from
Handler_share, created in WL#4305) in handler.{h,cc}.
To allow ha_partition and ha_innopart to share the same code
for auto increment and partition names.


InnoDB changes:
ha_innopart handlers are created from the innobase_hton if the table
is partitioned. Which makes ha_innopart also shares the same handlerton
as ha_innobase. If the table is partitioned and the table's engine type
is InnoDB, then a ha_innopart handler is created and returned, else an
ordinary non-partitioning aware ha_innobase handler is created and
returned. This allows old ha_partition based tables (.frm based) to
still be created and usable.

Since only the handlerton is known during drop and rename table,
ha_innobase also needed to handle partitions in delete_table and
rename_table. This was implemented by extending/copying
row_drop_database_for_mysql.

INPLACE alter needs to be adjusted to not update the prebuilt
variable in a different handler instance, but only set it in
the owning instance. This results in ha_partition based tables
will no-longer support INPLACE alter. But since they should
migrate to ha_innopart based tables I don't see that as a problem.

Some functions must remove the 'inline' declaration, since it will
be used outside ha_innodb.cc. And some functions need to be changed
from private to protected. And some functions needs to be virtual,
so they can be properly overloaded in ha_innopart.


Partitioning aware InnoDB handler:
A new handler class, ha_innopart, is derived from the ha_innobase class
and algorithms from ha_partition is used to support partitioning.

Explanation of read/write operations:
- write: Check partition id for new row, insert into that partition.
  Example: write_row().
- update: Check partition id for new and old row, if same partition
  update within that partition, else insert into new partition and
  delete from old partition.
  Example: update_row().
- non-sorted read: start on first used partition, read until it is
  exhausted, continue with next partition etc.
  Example rnd_init()/rnd_next() or index_init(idx, sorted = false)/
  index_read*()/index_next().
- sorted read (one partition): If only one partition, use that
  partition.
- sorted read (several partitions): In index_init(sorted=true)
  create a merge sort buffer (one full MySQL record per used partition).
  For index_read*() calls, read one row from each matching partition
  into the buffer, sort the buffer and return the first row.
  For index_next(), replace the previously returned row by reading the
  next row from the same partition, sort that new row and return the
  first row from the updated buffer/priority queue.
  To handle cases with exact search + index_next() (used in some group
  by) we also must remember which partitions returned
  HA_ERR_KEY_NOT_FOUND and retry to fill those rows in the first
  index_next() call.

Basic types of information operations:
-  Table level info - Either already in the handler or possible
   to extract the info from a single partition. Example: table_flags.
-  Partition level info - Used by I_S to get info from a specific
   partition. Example: get_dynamic_partition_info().


ha_partition:
Using the Handler_share as base class for Ha_partition_share.

This also enables GIS to be used by partitioned innodb tables, with the
restriction that it cannot be in the partitioning expression (due to its
blob nature).

Full Text Search is disabled for partitioned tables.
Foreign Keys are disabled for partitioned innodb tables.