WL#9559: InnoDB_New_DD: Support in-place ALTER PARTITION

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

InnoDB now supports the native partitioning which is supposed to rely on
ha_innopart handler. However, for some DDL operations like 'ALTER TABLE ...
PARTITION ...' are not implemented in this way, instead, they are done by
fast_alter_partition_table().

This worklog will support the DDLs mentioned above by in-place APIs of
ha_innopart, not to depend on fast_alter_partition_table() any more.
FR1. This worklog should implement these operations by native partitioning
1) ADD PARTITION
2) DROP PARTITION(RANGE/LIST PARITTION only)
3) COALESCE PARTITION(HASH/KEY PARTITION only)
4) REORGANIZE PARTITION
5) REBUILD PARTITION
6) EXCHANGE PARTITION

After this worklog, the native partitioning will not depend on
fast_alter_partition_table().
'EXCHANGE PARITITION' will have its own handler API 
calledha_innopart::exchange_partition_low().


FR2. These following operations would be kept as is, not in this worklog's scope
1) PARTITION BY
2) REMOVE PARTITIONING
3) TRUNCATE PARTITION ... (ALL)
4) OPTIMIZE PARTITION
5) REPAIR PARTITION
6) ANALYZE PARTITION
7) CHECK PARTITION
8) DISCARD PARTITION
9) IMPORT PARTITION

FR3. This worklog can't guarantee the crash-safe DDL which will be done as
separate work.

FR4. As we know, for statements like 'ALTER TABLE ... PARTITION ...', it was
impossible to specify the 'ALGORITHM=', and the 'LOCK='. So there was no
COPY way for this kind of statements. 

Let's take COALESCE PARTITION for example, there are 2 options:

Option 1. Create a brand new partitioned table with less partitions,
and copy all the data from old table to this new one, and finally do rename,
no handler::*inplace_alter_table() APis involved.

Option 2. Through handler::*inplace_alter_table() APIs, create the new
partitions of the table internally, and copy the data inside the SE.

Only one exception here is EXCHANGE PARTITION, which already supports
'ALGORITHM=COPY/INPLACE, LOCK=*', and this would be kept after this worklog.

FR5. According to FR4, some error messages will be deprecated since option 1
is supported. Before this worklog, if we specify ALGORITHM=X or LOCK=X would
result in an error like:

ALTER TABLE t1 ALGORITHM=COPY, ADD PARTITION
ALGORITHM=COPY/INPLACE is not supported. Reason: Partition specific operations
do not yet support LOCK/ALGORITHM. Try ALGORITHM=DEFAULT.

ALTER TABLE t1 ALGORITHM=INPLACE, LOCK=NONE, ADD PARTITION
LOCK=NONE/SHARED/EXCLUSIVE is not supported. Reason: Partition specific
operations do not yet support LOCK/ALGORITHM. Try LOCK=DEFAULT.

After this worklog, there won't be these errors. That is, following combinations
are supported:
1. ALGORITHM=COPY, LOCK=DEFAULT/SHARED/LOCK.
2. ALGORITHM=INPLACE, LOCK=DEFAULT/SHARED/LOCK, this is for operations which
require copying data between partitions, like 'REORGANIZE PARTITION',
'ADD PARTITION' for HASH and KEY partitions, 'COALESCE PARTITION' and
'REBUILD PARTITION'.
3. ALGORITHM=INPLACE, LOCK=NONE, this is for 'ADD PARTITION' for RANGE and LIST
partitions, and 'DROP PARTITION'.

Any invalid combination would result in the same error messages as common
ALTER TABLE operations. For example, ALGORITHM=COPY, LOCK=NONE is not allowed,
and 'LOCK=NONE is not supported. Reason: COPY algorithm requires a lock.
Try LOCK=SHARED.' would be issued too.


FR6. During the altering, the table is MDL locked, so there would be no
concurrent DMLs. Only exceptions are for 'ADD PARTITION' for RANGE and LIST
partitions and 'DROP PARTITION', which do nothing during inplace_alter_table(),
so no need to block concurrent DMLs.


FR7. The autoinc behavior in this worklog would be:
1) The next autoinc counter should not be changed after the DDLs of 1)-5)
in FR1
2) Regarding to 6) in FR1, the next counter of the partitioned table would be
reset to the to be swapped table's counter if the latter is bigger; the next
counter of the to be swapped table would be reset to the to be swapped
partition's counter if the latter is bigger. This promises the counter would
not be re-used.
This worklog will focus on these operations:
1. ADD PARTITION
2. DROP PARTITION(RANGE/LIST)
3. COALESCE PARTITION(HASH/KEY)
4. REORGANIZE PARTITION
5. REBUILD PARTITION
6. EXCHANGE PARTITION


Requisite from server
---------------------

These ALTER PARTITION operations now should be done by the native partitioning
APIs, to make InnoDB partitioning a true native partitioning. However, some
preparation work should still be done by Server, such as:
1. There should be a way to know that the native partitioning can support
inplace alter partition(By introducing a new flag in server)
2. New APIs shall be modified/added for EXCHANGE PARTITION, which is
Partition_helper::exchange_partition_low()
3. REBUILD shall be done through the APIs
4. Server shall let the inplace alter APIs know what do the APIs should do
to every old and new partition. There are some partition states defined
already, the inplace alter APIs shall be able to get these for all partitions.
5. Server shall provide the inplace alter APIs a function as what it already
does in server layer(get_partition_id) to know which partition will a specified
row be inserted to.


Restrictions
------------

In fact, this is the same as it is before this worklog:

1. Only a single PARTITION BY, ADD PARTITION, DROP PARTITION, REORGANIZE
PARTITION, COALESCE PARTITION or EXCHANGE PARTITION clause can be used in a
given ALTER TABLE statement.
2. It is possible for an ALTER TABLE statement to contain a PARTITION BY
or REMOVE PARTITIONING clause in an addition to other alter specifications,
however, this should go like ALGORITHM = COPY, instead of INPLACE


APIs
----

Operations of ADD/DROP/COALESCE/REORGANIZE/REBUILD will be done by
ha_innopart::*inplace_alter_table()
Operation of EXCHANGE PARTITION will be done by
ha_innopart::exchange_partition_low inherited from 
Partition_helper::exchange_partition_low()



ALTER PARTITION
---------------

The partition_state for every partition can be got from the partition_elements
passed to ha_innopart::*inplace_alter_table() APIs, or else, the used
partitions should be got in ha_innopart::*partition_low() APIs.

Notes: Currently, the DROP work would simply drop the partition table, without
considering the rollback. The rollback would be implemented as separate work.

Notes: It's possible that the to be created (sub)partition has the same name
with some existing one. In this case, if there is no conflict with data file,
it should be fine to create the new partition with the new(same) name.
For example, if old partition is in tablespace 'innodb_system' and the new
one will reside in innodb_file_per_table. However, conflict of in-memory
table objects still exist in this scenario. So to make things easy for now,
as long as the name gets no change, a temporary table name/data file name
should be used, and finally rename it.

1. ADD PARTITION(RANGE/LIST)

PREPARE: newly introduced partitions should be created, with either the final
table name or a temporary table name
ALTER: do nothing
COMMIT: If a temporary table name was used, do rename, which should succeed
because the old partition should have been dropped
ROLLBACK: do nothing


2. DROP PARTITION(RANGE/LIST)

PREPARE: Update internal structures for new table
ALTER: do nothing
COMMIT: drop the partitions
ROLLBACK: do nothing


3. ADD PARTITION(HASH/KEY)

PREPARE: newly introduced partitions should be created, including the to
be added partitions N(with the final data file names) and the to be changed
ones T(with temporary data file names); T is necessary since all the data
would be re-arranged, so old partitions can't be used directly

ALTER: copy data from all necessary old partitions(specified by server) of
old table to the partitions of the new table according to the partitioning
rule; server will provide the API for native partitioning to know which
partition every row in the old table belongs to in new table

COMMIT: for those to be changed partitions, the old partitions O have to be
renamed to some other temporary ones; then every new partitions of T should be
renamed to its according one of O; no work on partitions of N; finally,
all partitions of renamed O should be dropped

ROLLBACK: do nothing for now


4. COALESCE PARTITION

This works similarly to above 3. Only exception is that there is no partitions
of N. And renaming of O may not affect all the old partitions, only the
leaving partitions should be renamed


5. REORGANIZE PARTITION

PREPARE: The reorganized into partitions should be created, with either
final name or a temporary table name

ALTER: copy data from all necessary old partitions(specified by server) of
old table to the partitions of the new table according to the partitioning
rule; this is the same as 3

COMMIT: drop the to be reorganized partitions from old table; also make sure
that drop should be done before possible rename for reorganized into partitions

ROLLBACK: do nothing for now


Example:

Supposing we have a table like:

CREATE TABLE tr (
  id INT NOT NULL AUTO_INCREMENT,
  name VARCHAR(50), purchased DATE, KEY(id))
PARTITION BY RANGE(YEAR(purchased)) (
PARTITION p0 VALUES LESS THAN (1990),
PARTITION p1 VALUES LESS THAN (1995),
PARTITION p2 VALUES LESS THAN (2000),
PARTITION p3 VALUES LESS THAN (2005)
);

Let's take REORGANIZE PARTITION for example:

ALTER TABLE tr REORGANIZE PARTITION p2, p3 INTO (
PARTITION p21 VALUES LESS THAN(1997),
PARTITION p22 VALUES LESS THAN(2000),
PARTITION p31 VALUES LESS THAN(2002),
PARTITION p32 VALUES LESS THAN(2005));

In this example, the table contains 4 partitions p0-p3 originally, and
partition p2 and p3 will be reorganized into 4 new partitions. So there will
be 6 partitions after reorganization: p0, p1, p21, p22, p31, p32.

PREPARE: new internal structure for p0 and p1 will also point to the data file
of p0 and p1; p21, p22, p31 and p32 will be created in this step
ALTER: data will be copied from original p2 and p3 to p21, p22, p31 and p32.
COMMIT: old partitions of p2 and p3 will be dropped; all internal updates will
be committed
ROLLBACK: do nothing for now

Also note that ADD PARTITION(HASH/KEY) and COALESCE PARTITION will always
create all partitions with temporary names, and copy data from old partitions
to new ones.
For example:
Assume the table tr is partitioned by hash with 4 partitions, and we issue
'ALTER TABLE tr ADD PARTITION PARTITIONS 2;'
New partitions with temporary name will be created like p0#TMP, p1#TMP ...
p5#TMP. Data from old p0 to p3 will be copied to new partitions.
During COMMIT, p0 to p3 will be renamed to some other temporary table names, 
like #sqlxxxx#p0..p3. Then p0#TMP...p5#TMP would be renamed into p0...p5.
Finally, old partitions #sqlxxxx#p0..p3 can be dropped.

This is similar to COALESCE PARTITION.


6. REBUILD PARTITION

This should work in a similar way to above 3/4/5. Only difference is data are
always copied from the old partitions to the same new partitions.


7. EXCHANGE PARTITION

In ha_innopart::exchange_partition_low(), the to be exchanged partition and
the table can be got. The structures of the partition and table shall be the
same.
We have to exchange the se_private_data/se_private_id/options, etc. so that
the partition will point to the data of table while the table will point to
the data of the partition. Also, since both partition and table's tablespaces
are of implicit(innodb_file_per_table), data files have to be renamed to each
other.



Locking
-------

Currently, all these operations are done with an upgradable shared metadata
lock which blocks all attempts to update table data, allowing reads.
Furthermore, during the change of partitioning, exclusive lock is required to
install the new partitioning. This should still be held in this worklog.
See FR6 too.


Others
------

Since we now support these operations through the native partitioning APIs,
we can remove some invalid code like ha_innopart::*_new_partition(), etc.
which should not be called any more.