WL#11250: Support Instant Add Column

Affects: Server-8.0   —   Status: Complete

Porting Tecent's patch on Instant Add Column.

To summarize the idea:
1. It introduced a new info bits on the record to signify the row has new rows
added instantly

2. If the bit is set, then there will be one or two additional bytes in the row
header to signify the extra columns added

3. A new system table is added to record the "default value" for each added 
column 

4. The rows before instant add column would always have the old number of
columns for its life time (until deleted). Such number of "core" rows will be
stored in se_private_data of the mysql.tables

5. For "old" rows, the default value will be looked up from the new system
tables and appended before return to server.

6. For the duraton of instant adding column, MDL will be placed and maintained
on the table.

In this way, we can instantly add a column with full back version compatibility
with minimum code change.

Some restrictions of this approach:

1. Columns can only be added last, it cannot be added in between existing columns

2. Instantly added columns cannot be PK, unless a table rebuild

3. Instant add column does not and will not support compressed table format

FR1: To support this instant ADD COLUMN, there is one syntax change for ALTER 
TABLE. One option for the ALGORITHM will be introduced, called INSTANT:
ALTER TABLE ... ALGORITHM = INSTANT;

This INSTANT key word will support instant ADD COLUMN in this worklog, along 
with some no-op ALTER TABLE(no change in the table). If user specifies INSTANT 
for other ALTER TABLE(s) than the previous ones, server will return an error 
like HA_ALTER_INPLACE_NOT_SUPPORTED to user without doing anything to the table.

Furthermore, the ALGORITHM = INSTANT would not work with LOCK clause, because 
this is meaningless. So if INSTANT is specified, and LOCK=NONE/SHARED/EXCLUSIVE 
is also specified, ER_WRONG_USAGE would be issued. However LOCK=DEFAULT is fine.

FR2: Support Instant ALTER TABLE ADD COLUMN
FR2.0: This only involves ADD COLUMN, and adding new columns at last(opposite to 
our existing key word FIRST)
FR2.1: Adding columns instantly should work automatically when 
ALGORITHM=DEFAULT/INSTANT
FR2.2: Support adding multiple new columns in one ALTER TABLE clause
FR2.3: Support setting DEFAULT value for new column(s) during instant ADD COLUMN
FR2.4: Support adding virtual columns along with regular columns in one clause
FR2.5: Support ROW_FORMAT=DYNAMIC/COMPACT/REDUNDANT

For example, for a table t1, following should be supported as instant ADD 
COLUMN, if ALGORITHM is INSTANT:
a. ALTER TABLE t1 ADD COLUMN ... [,ALGORITHM=DEFAULT/INSTANT];
b. ALTER TABLE t1 ADD COLUMN ..., ADD COLUMN ..., ...;
c. ALTER TABLE t1 ADD COLUMN ... DEFAULT, ADD COLUMN ... DEFAULT, ...;
d. ALTER TABLE t1 ADD COLUMN ..., ADD COLUMN ... VIRTUAL, ADD COLUMN ...;
e. ALTER TABLE t1 ADD COLUMN ... AFTER (last column) ...;

But if other operations or ALGORITHM are involved, then the behavior for them 
are not changed. For example, following are not supported:
f. ALTER TABLE t1 ADD COLUMN ..., ALGORITHM = INPLACE/COPY;
g. ALTER TABLE t1 ADD COLUMN ..., ADD KEY ..., DROP KEY ...;
h. ALTER TABLE t1 ADD COLUMN ..., DROP COLUMN ...;
i. ALTER TABLE t1 ADD COLUMN FIRST[AFTER ...(but not last column)]...;
j. ALTER TABLE t1 ADD COLUMN ..., FORCE;
k. etc.

FR3: Query of tables with added columns should return correct value

FR4: DMLs on table with newly added columns should work

FR5: Both crash recovery and rollback of the tables with added columns should 
work without problems

FR6: Adding and cascading update/delete of Foreign Keys on newly added columns 
instantly should work

FR7: IMPORT/EXPORT on a table with instantly added columns should work. There is 
also one change for generic IMPORT that after the IMPORT, the clustered index of 
the table would be checked, if the data is corrupted, the IMPORT would fail.

FR8: There should be some relevant information in I_S.innodb_tables and 
I_S.innodb_columns for user to check.
FR8.1 I_S.innodb_tables.instant_cols, which means the number of columns before 
first instant ADD COLUMN. Please note that this instant ADD COLUMN means stored 
columns, adding a virtual column doesn't introduce this number.
FR8.2 I_S.innodb_columns.has_default/default_value, once an instant ADD COLUMN 
is done, the default value of the columns added instantly would be stored here. 
Please note that future update of the default value of these columns won't 
change the values here.

FR9: There should be no requirement for data file upgrade

FR10: Tables with fulltext index are not supported by this worklog

FR11: Temporary tables are always working in ALGORITHM=COPY way, so it's not 
included in this worklog

FR12: All tables in DD tablespace are not supported by this worklog, since there 
is little chance that user will change these tables or create tables in this 
tablespace

FR13: After instant ADD COLUMN, internal record formats could be different, 
since some are not complete. However, after some kind of table rebuild or 
truncation, the record formats should become complete. Theses operations 
include:
a) All ALTER TABLE which require table rebuild, like DROP COLUMN, etc.
b) TRUNCATE TABLE
c) If it's partitioned table, then all new partitions(ALTER TABLE ... PARTITION) 
and all truncated partitions
d) OPTIMIZE TABLE, etc.

FR14: For EXCHANGE PARTITION, to simplify the logic, if either the partition or 
the table to be swapped is instant, then the operation would be rejected with 
error ER_PARTITION_EXCHANGE_DIFFERENT_OPTION.

FR15: There are some operations which may need to check data during altering, 
for example: adding a new geometry column defined as NOT NULL, etc. Because 
currently the data checking has to be done during the altering(copying data), so 
it's not possible to have them as INSTNAT.
0. Definitions
==============

COLUMN: This is used when we say ADD COLUMN in SQL.

FIELD: This is used when we say a column in a row. That is for every single row, 
there is a physical record in clustered index, and a column in the row maps to a 
field in the record in our code.

instant ADD COLUMN: This means all kinds of ADD COLUMN operations supported by 
this worklog which can be done instantly.


1. General
==========

1.1. Problems of current ADD COLUMN
-----------------------------------

Currently, InnoDB supports two types of ALTER TABLE ADD COLUMN, no matter what 
kind of columns and how many columns are added in one clause. There are 
ALGORITHM = INPLACE and ALGORITHM = COPY. Although it's called INPLACE algorithm 
which is online DDL indeed, however, a table rebuild is required to finish it. 
The reason is that the internal record format is not flexible enough for new 
fields(columns). For example, if it's a new style record, length of variable-
length field and SQL-null flags would be stored in the record header. So to add 
every new column, the record header may change too, which applies to all exiting 
rows. So it looks like a simplest way is not reorganize every single record of 
the table, but read the old records and add the new fields, then insert them 
into a new table.

Users have a strong request that some DDL like current ADD COLUMN should be done 
in a very short duration since it looks like only a metadata change. In this 
way, they don't need to wait a long time for the table coming to a final shape 
and lots of IO can be saved.

1.2. Tencent's contribution
---------------------------

To fix this issue, a team from Tencent company figured out a solution to add 
columns instantly, without modifying records of the table. The rough idea has 
been already mentioned in the HLD section.

The benefits of their contribution are:
a) ALTER TABLE .. ADD COLUMN can be done instantly, without too much IO and 
waiting time
b) Row format is compatible with existing data files


2. Operations which can be instant
==================================

As mentioned in FR2, this worklog mainly supports the instant ADD COLUMN 
operations. Multiple columns, either regular columns or virtual columns, with or 
without default values are all supported.

Refer to FR2 to see which kind of SQL is supported and which is not.

2.1 ALGORITHM and LOCK
----------------------

Basically, there are INPLACE and COPY algorithm, both of which would be kept as 
is. Only when user specifies the INSTANT algorithm, current ADD COLUMN could be 
done  instantly with only some metadata changes, rather than an internal table 
rebuilding. So after this worklog, as long as the algorithm is ignored, or 
specified as DEFAULT/INSTANT, InnoDB will try an instant ADD COLUMN is possible.

If users want to still do original online ADD COLUMN with table rebuilding, they 
can use the INPLACE algorithm or FORCE keyword, like described in FR2.

There is no any change to the LOCK for this worklog. We can always specify 
NONE/SHARED/EXCLUSIVE for the LOCK.

2.2 Row format supported
------------------------

The row formats of DYNAMIC/COMPACT/REDUNDANT would be supported in this worklog. 
COMPRESSED is no need to supported.


3. Algorithm
============

To make an ADD COLUMN metadata change only, it is necessary to mark the record 
accordingly, so that later scan can know which record is fit to which table 
definition. One solution is to give every record or page a version number, 
meaning that the record or records in a page are defined by table structure 
version X, so to parse a record, a proper table structure can be picked 
accordingly. But this is more complex and lots of work need to be done.

A simpler solution is how this worklog works. Instead of giving record/page a 
version, a bit in record would be set if there was any instant ADD COLUMN 
happened before. With this bit set, it's possible to distinguish a record is 
inserted before any instant ADD COLUMN or not. At the meantime, it's necessary 
to remember the existing column number when first instant ADD COLUMN happens.

Let's suppose there is a table with one row:
CREATE TABLE t1 (a INT NOT NULL AUTO_INCREMENT PRIMARY KEY, b INT);
INSERT INTO t1 VALUES(0, 1); -- Row 1

When executing an instant ADD COLUMN:
ALTER TABLE t1 ADD COLUMN c INT DEFAULT 10;

The column numbers in old table t1 would be remembered, so the number is 2, 
which is invariable.

Then another INSERT:
INSERT INTO t1 VALUES(0, 2, 20); -- Row 2

When inserting row two, one bit should be set to identify that it's a row after 
any instant ADD COLUMN. The bit in row one is not set. So when parsing row 1, 
only the first 2(as remembered) columns would be used. If row 2, all columns 
would be used.

To make it possible to parse a row after another new instant ADD COLUMN, a 
variable-length number of columns information should be added to the physical 
record if the record has the instant mark set. So in row 2, it's also necessary 
to remember current fields number, it's 3 now. Let's say there is another ALTER 
TABLE and a new INSERT:
ALTER TABLE t1 ADD COLUMN d INT;
INSERT INTO t1 VALUES(0, 3, 20, 10); -- Row 3

In row 3, the number of current fields would be remembered, it's 4 now. So to 
parse row 2, 3 columns are used, while to parse row 3, 4 columns are used.

Default value of the instantly added columns should be remembered too. So in 
above case, for row 1, default values of column c and d would be filled in for 
it; for row 2, default value of column d would be filled in; finally, for row 3, 
no need for default values.

Since there is only one bit to indicate the column change, so the order of 
columns are assumed to be not changed after instant ADD COLUMN. For short, the 
newly instantly added columns can only appended at last of the table.

Also, because there is no version concept here, there is no strong requirement 
to convert the old record format to the new one either in background or by some 
manual command. So physical records would always be left as is.

Only when there is a table rebuild operation, the old table would be destroyed 
and new table would be created, then all records in the new table are according 
to the latest table structure. And in this way, the number of current fields 
information don't need to be stored in record.

So, in a word, instant ADD COLUMN can be applied to a table repeatedly, and 
records in the table may have different number of fields stored. Once there is a 
table rebuild, all the record format would be in a unified and latest format.


4. Row format compatibility
===========================

There are both old style row format and new style row format in use.

4.1 New style row format
------------------------

New style row format includes DYNAMIC and COMPACT format. Different rows can be 
of different length, depending on the nullable fields and variable-length 
fields. Current physical record consists of a record header and record data 
fields.

The detailed record header which is in variable-length looks like:

+--------------------------------+----------------+---------------+
| Non-null variable-length array | SQL-null flags | Extra 5 bytes |
+--------------------------------+----------------+---------------+

Then the detailed 5 extra bytes look like:

+-----------+---------------+----------+-------------+-----------------+
| Info bits | Records owned | Heap No. | Record type | Next record ptr |
+-----------+---------------+----------+-------------+-----------------+

The lengths of above information are:
a) Info bits: 4 bits
b) Record owned: 4 bits
c) Heap No. : 13 bits
d) Record type: 3 bits
e) Next record ptr: 2 bytes

So the total length is 5 bytes.

4.2 How to mark instant bit?
----------------------------

Let's first consider the new style row format.

As it is known, there are un-used bits in both 'Info bits' and 'Record type'. 
Two bits in 'Info bits' can be used further while 4 kinds of values can be used 
in 'Record type'. So one bit in 'Info bits' would be used to indicate the 
instant mark. Also, to remember current field numbers, a variable-length number 
of fields would be added between the SQL-null flags and Extra 5 bytes. So the 
record header now becomes

+--------------------------+----------------+---------------+---------------+
| Non-null variable-length |                |               |               |
| array                    | SQL-null flags | fields number | Extra 5 bytes |
+--------------------------+----------------+---------------+---------------+

4.3 Old style row format
------------------------

Only REDUNDANT row format needs to be considered here. And since there is 
already firlds number in the physical record, there should be no physical format 
change for redundant row format. It's possible to parse the record according to 
the fields number.


5. Integration with DD
======================

What should be remembered in DD are:
a) Column numbers before first instant ADD COLUMN
b) Default values for instant added columns

Please note that if the default value gets changed later, it's still a must to 
remember the original default value, because records without this field should 
be filled in with the original default value.

It's obviously that a) should be stored in dd::Table::se_private_data, as an 
integer. Furthermore, if this is a partitioned table, a) for different 
partitions may differ, so this should be also stored in 
dd::Partition::se_private_data.

About b), considering the new DD, the best place to store it is 
dd::Column::se_private_data. The benefit is that the default value becomes an 
attribute of the dd::Table, also every dd::Column can manage its default value 
separately.

These values would be remembered during ALTER TABLE, and loaded into memory 
during opening table.


6. DMLs
=======

6.1 INSERT
----------
As mentioned above, INSERT after one instant ADD COLUMN would affect the record 
format, the number of fields currently would be remembered in the record. If no 
instant ADD COLUMN, insert doesn't care about the number of current fields.

6.2 DELETE
----------
There is no any change.

6.3 UPDATE
----------
As long as the instantly added column gets changed, it must not use the inplace 
update, instead, it should delete and re-insert the new record.

Furthermore, to handle a rollback problem easily(see next section), it has to 
check if the trailing fields of the updated record can be ignored or not. If the 
updated record has the same default values at last, the default values won't be 
stored in the row.

6.4 SELECT
----------
Default values of instantly added columns should be read and filled into the 
result whenever a record is read


7. Rollback
===========

Because some fields could be not stored in the physical record, update to this 
record may result in the row too big problem if update is done in the old way, 
expanding all default fields in the new record.

One example is:

a) create table t2( id int primary key, c1 varchar(4000), c2 varchar(4000), c3 
varchar(1000)) engine=innodb row_format=compact;
b) insert into t2 values(1, repeat('a', 4000), repeat('b', 4000), repeat('c', 
1));
c) alter table t2 add column d1 varchar(500) not null default repeat('d', 500);
d) begin;
e1) update t2 set c1 = repeat('x', 200) where id = 1;
or e2) update t2 set c4 = 'x' where id = 1;
f) rollback;

In this example, original row doesn't have any field stored externally, if it's 
instant ADD COLUMN, the row would not be changed at all, that is the default 
value of 500 chars are not in the record.

Case e1): The update new row fit in the page too. However, when rollback, the 
original row now has to include the default value of 500 chars. Then the record 
is too long to fit in one page, some fields have to be stored externally. The 
worse is if it's REDUNDANT format with 4K page size, the expanded record would 
become too big to be inserted into the page, thus a rollback failure with crash.

Case e2): The update new row also fit in the page. However, the rollback will 
now have to expand the record with 500 chars, which makes the row too big to be 
rolled back.

So it needs to keep the default values as not stored in record when possible. 
That is in both cases, new record would not include 500 chars for c4 column, 
just mark it as default value in the record.

Shall it just keep the default values as original record? No. In e2) case, the 
updated record doesn't have c4 as default value, if we don't check c4 when 
rollback, it's impossible to roll back it too. So all possible default 
values(after first INSTANT ADD COLUMN) should be checked.

Why this doesn't happen before instant ADD COLUMN? Because the fields of the row 
have already stored externally after the ADD COLUMN. During rollback, the 
existing externally stored value can be used to represent the old value before 
update, thus, no new external fields are necessary.



8. Recovery
===========

For redo recovery, InnoDB will parse the record logged, so it's a must to know 
the instant ADD COLUMN information. The only needed information is that if the 
table is after one instant ADD COLUMN and the column number before its first 
instant ADD COLUMN.

So basically when logging the index information, the number of columns before 
first instant ADD COLUMN would be logged to indicate both above information for 
new style records. If it's old style, no extra logs are necessary.


9. Replication
==============

As long as the DMLs can work correctly, there should be no impact on replication 
at all.


10. IMPORT/EXPORT
================

Since the related information as mentioned in 5. are already all stored in 
dd::Table, dd::Partition and dd::Column(mysql.tables, mysql.partition_tables and  
mysql.columns), so it would be out of box for the table to work after 
exporting/importing with SDI. However, this has to wait for WL#9761 for SDI 
EXPORT/IMPORT to work. So in this worklog, it's necessary to write the DD 
related information into the .cfg file, and read it during IMPORT.

The serialization can be done straightly by writing the metadata out along with 
other table/column metadata.

If the .cfg file is missing, there is no way to know the number of fields on 
first instant add column, so in most cases, it's impossible to use the IBD file. 
To handle the .cfg missing problem, after IMPORT, a check would be done on 
clustered index, to verify all the physical records. A table with instant 
columns would be always reported having corrupted records in clustered index. So 
in this case, the IMPORT would fail. This applies to generic IMPORT too, it's no 
need to import a corrupted table. Once WL#9761 comes in, .cfg missing problem 
would not happen so there would be no this issue.


11. Displayed in I_S.innodb*
===========================

It would be nice to provide the instant ADD COLUMN information to users through 
I_S views. Since there are two types of DD metadata related, one is in 
dd::Table::se_private_data and dd::Partition::se_private_data, others are in 
dd::Column::se_private_data, it's naturally to show these metadata in 
I_S.innodb_tables and I_S.innodb_columns accordingly. So the number of columns 
before first instant ADD COLUMN would be displayed in I_S.innodb_tables for 
every table, and every default value of columns would be displayed in 
I_S.innodb_columns. The default value would only be displayed in a binary format 
which is used in InnoDB internally.


12. Side effects
================

a) Since ADD COLUMN may be done instantly, so we may not expect an instant ADD 
COLUMN will fix a corrupted index

b) Since the INSTANT ADD COLUMN would not copy a new table from old one row by 
row, there is no chance for it to check if the rows from old table along with 
new added columns would become too big or not. So after INSTANT ADD COLUMN, some 
rows may in fact too big for the table, especially for REDUNDANT format, it 
would be detected when the record gets updated.
1. In-memory table objects
==========================

To remember the instant status of a table, new member variables were introduced:

In dict_table_t:

+       /** Number of non-virtual columns before first instant ADD COLUMN,
+       including the system columns like n_cols. */
+       unsigned                                n_instant_cols:10;

In dict_index_t:

+       unsigned        n_instant_nullable:10;
+                               /*!< number of nullable fields before first
+                               instant ADD COLUMN applied to this table.
+                               This is valid only when is_instant() is true */

A new structure to remember the default value in memory:

+/** Data structure for default value of a column in a table */
+struct dict_col_default_t {
+       /** Pointer to the column itself */
+       dict_col_t*     col;
+       /** Default value in bytes */
+       byte*           value;
+       /** Length of default value */
+       size_t          len;
+};

Also a default value pointer in dict_col_t, if this column was instantly added.

+       /** Default value when this column was added instantly.
+       If this is not a instantly added column then this is nullptr. */
+       dict_col_default_t*             instant_default;

There are also some related functions to get instant information etc. from 
table, indexes and columns.


2. Records related
==================

Remember that only the record on clustered index could be affected by this 
worklog, secondary index records don't.

2.1 Instant bit and default value
---------------------------------

To set a bit in info bits to indicate this is a record after instant ADD COLUMN, 
this bit is introduced:

+/* The 0x40UL can also be used in the future */
+/* The instant ADD COLUMN flag. When it is set to 1, it means this record
+was inserted/updated after an instant ADD COLUMN. */
+#define REC_INFO_INSTANT_FLAG  0x80UL

To indicate a field has a default value rather than an inlined value in the 
record, this bit is introduced for the offsets array:

+/* Default value flag in offsets returned by rec_get_offsets() */
+#define REC_OFFS_DEFAULT       ((ulint) 1 << 29)

All instantly add columns will have this bit set in offsets array.

If a record resides on a clustered index whose table has undergone an instant 
ADD COLUMN, then it should be parsed specially. So there should be a check to 
see if this index/table is instant affected, if so, those instantly added 
columns set with REC_OFFS_DEFAULT will ask for default values remembered in 
dict_col_t::instant_default.

2.2 Parse
---------

Since the physical record may have less fields than the ones defined on 
clustered index, if the offsets array is initialized according to the physical 
record only, then the offsets fields will also mismatch with fields on clustered 
index. To make logic and coding simple, the offsets array is always created 
according to the real number of fields on index.

There would be two different physical record format with three meanings, please 
refer to rec_init_null_and_len_comp() to know how to handle all these cases. In 
a word, the checking looks like:
if the index is not instant
    /* This is a complete record without instant ADD COLUMN, parse it as is */
else if the record has instant bit set
    /* This is a record inserted after an instant ADD COLUMN, so it should be 
able to know the number of fields by parsing the length info in the record */
else
    /* This is a record inserted before one instant ADD COLUMN, the number of 
fields is remembered in the table metadata */


3. Identify an instant ADD COLUMN
=================================

Since the keyword INSTANT would be introduced for current instant ADD COLUMN, in 
ha_innodb::check_if_supported_inplace_alter() and 
ha_innopart::check_if_supported_inplace_alter(), if we know the ALGORITHM is 
INSTANT, all relevant tables would be checked if instant ADD COLUMN is 
applicable, like it's not a table with fulltext index, it's not in COMPRESSED 
format etc. Once all are fine, a flag would be set in above variable, to 
indicate this is instant ADD COLUMN, and no rebuild later.


4. Metadata in DD
=================

Basically there are four new SE private data introduced:
a) dd::Table::se_private_data::instant_col, to indicate how many columns exist 
before first instant ADD COLUMN in table level
b) dd::Partition::se_private_data::instant_col, to indicate similarly to a), 
however, different partitions may have different numbers, as long as all are 
bigger than or equal to the one in table level
c) dd::Column::se_private_data::default_null, to indicate the default value is 
NULL
d) dd::Column::se_private_data::default, to indicate the default value if it's 
not NULL

All are easy to understand, except b).

Let's say there is a partitioned table with two partitions, and also an instant 
ADD COLUMN has happened just now. If partition two is truncated, the new 
partition doesn't have to keep the instant ADD COLUMN information and make the 
records in it more complicated. It should only works like a fresh new 
table/partition with all complete new records in the table. This is same to 
other partition related operations which will create new partitions. So a) and 
b) are both remembered, b) is per partition. And b) should be either nothing, or 
always bigger than a)

Since b) should be always bigger than a) if exists, so to open a partition 
table, it can only load default values for last b) columns, instead of a) 
columns.

When the whole table is not instant, all a), b), c) and d) would be cleared. If 
only some partitions are not instant, relevant b) would be cleared with others 
left.


5. I_S display
==============

There would be one more columns in I_S.innodb_tables called INSTANT_COLS, to 
remember the number of columns when the first instant ADD COLUMN happened on 
this table. For partitioned table, this number may differ for each partition.

There would be two more columns in I_S.innodb_columns called HAS_DEFAULT and 
DEFAULT_VALUE. HAS_DEFAULT = 1 means that this is a instant column with default 
value. DEFAULT_VALUE is only valid when HAS_DEFAULT is 1. It shows the internal 
binary for the default value. If necessary, we may also try to display the 
original default value, but this requires more work in translating the values.


6. EXPORT/IMPORT
================

For non-partitioned table, basically the tablespace can be imported to replace 
the existing one.

For partitioned table, every partition may have different instant columns 
recorded, so before one partition tablespace can be imported, it's a must to 
check if the default values remembered in the .cfg file match the default values 
in the table existing in the running server. If not match, saying different 
default values for one instant column, the tablespace can't be imported. Once a 
tablespace gets imported, its default values should be imported too.