WL#11605: Alter table with character set conversion as inplace operation, Step 1, NO INDEX

Affects: Server-8.0   —   Status: Complete   —   Priority: Medium

This WL handles the cases where columns to be changed are not part of any
primary or secondary indexes.

Collation specifies sorting order of an index. If a column is not indexed, any
collation conversion can be done as a pure metadata change,  e.g. from
utf8mb4_general_ci to utf8mb4_0900_ai_ci.

In the following cases, character set conversion can be done as a metadata
change too, because the source character set is a subset of the target character
set. Collation change has no consequence either, since there is no index on the
involved columns.

 ASCII -> UTF8MB4 *)
 utf8mb3 -> UTF8MB4

*) This is true in theory, but not in practice as inserting 8-bit values into an
ASCII column is NOT rejected, but merely produces a warning. So the ASCII column
my contain data which are not legal UTF8MB4 (and possibly not legal in other
1-byte charsets). And it is not even allowed using the COPY algorithm. E.g. 

INSERT INTO t VALUES(_ascii 0xc5);
Warning 1300    Invalid ascii character string: 'C5'
Warning 1300    Invalid ascii character string: '\xC5'
ERROR HY000: Incorrect string value: '\xC5' for column 'v' at row 1

For this we would need an ASCII_7 charset which actually rejected invalid
inserts (or cleared the top bit).
FR 1 - No charset change, except utf8mb3 to utf8mb4 and any charset to binary will
be done inplace.

FR 2 - Collation cannot be changed inplace for indexed columns/Fields

FR 3 - No change (charset or otherwise) which reduce the number of bytes needed
for storage (Field::max_display_length), may be done inplace.

FR 4 - No change (charset or otherwise) which makes Field::max_display_length
exceed 255, can be done inplace. 

FR 5 - No change (charset or otherwise) which alters the char length of a
CHAR(X)/Field_str can be done inplace.

FR 6 - Only data types which have a charset can benefit from an inplace charset

FR 7 - Data type VARCHAR(X)/Field_varstring must support inplace charset change
subject to general restrictions mentioned above.

FR 8 - Data type CHAR(X)/Field_str must support inplace charset change, subject to
general restrictions, AND also only if the char length X does not change.
Currently, 256-border restriction is only enforced for Field_varstring, because
for Field_string there is the even stricter requirement that max_display_length
cannot change at all. To allow any charset changes inplace for Field_string we
will have to relax this, and since the justifictation is the Innodb row format,
it would seem reasonable to also enforce it for Field_string when relaxing the
identical max_display_length requirement, since Innodb uses the same row format.

FR 9 - Data type TINYTEXT/Field_blob must support inplace charset change.

FR 10 - Data type TEXT/Field_blob must support inplace charset change.

FR 11 - Data type MEDIUMTEXT/Field_blob must support inplace charset change.

FR 12 - Data type LONGTEXT/Field_blob must support inplace charset change

FR 13 - Data type ENUM/Field_enum must support inplace charset change. (Note that
checking if new values appear at the end of the enum must be done using the new
charset which, for an inplace charset change, is able to compare potential new
value strings to the existing ones correctly).

Charset changes which preserve encoding:

The following charset changes can be done inplace as the encoding of all legal
characters in the source changeset is also legal in the destination charset:

-> binary, utf8mb4

-> All others -> binary

Source      | binary | utf8mb3 | utf8mb4 | 
utf8mb3     | Yes    | -       | Yes     |
All others  | Yes    | No      | No      |

Data types with charset:

SQL type   | Field subclass  | Comment
VARCHAR(X) | Field_varstring | 
CHAR(X)    | Field_str       | Char length must be fixed due to padding
TINYTEXT   | Field_blob      |
TEXT       | Field_blob      |
MEDIUMTEXT | Field_blob      |
LONGTEXT   | Field_blob      |
ENUM       | Field_enum      | Charset is largely irrelevant

API for checking column changes

The virtual member function int Field::is_equal(const Create_field*) is used to 
check if individual column/field changes prevent an alter from being done inplace.
Sub-classes of Field override this member function as necessary. The returned int 
can be one of three values: IS_EQUAL_NO, IS_EQUAL_YES and IS_EQUAL_PACK_LENGTH. 
The returned value is sometimes implicitly converted to bool, so that IS_EQUAL_NO 
becomes false, and the other two true. The exact meaning of these are, to my 
knowledge, not documented. 

For this WL I will assume the following protocol:

- If column is changed in such a way that we can infer at the server layer that 
the alter cannot be done inplace, is_equal shall return IS_EQUAL_NO

- If the change does not modify existing values, but the storage size needs to be 
increased in order to hold new values allowed after the change, is_equal shall 

- If neither existing values nor storage requirements change, is_equal() shall 
return IS_EQUAL_YES.

Column/field changes which prevent inplace alter

1) SQL type change.

2) Length change.

In general it is not possible to reduce a Field's stored size (pack_length())
inplace as that will require truncation of existing entries which exceed the new
max value.

As alluded to in the HLD, all changes which grow a column (field) from a value
less than 256 to a value greater than or equal to 256, cannot be done inplace, due 
to restrictions imposed by the SE row format. (IMO this check does not belong in 
is_equal() and should instead be performed by SE inside 
handler::check_if_supported_inplace_alter(). But making this change falls outside 
the natural scope of this wl).

As can be seen in LLD, this restriction is currently only enforced for
Field_varstring, and the check is done against max_display_length(). This seems
odd, since one would think that it is the pack_length(), the stored size, which
is relevant. A conjecture here is that max_display_length() was used because it
is equal to the pack_length() minus the size bytes for VARCHAR (which was the
only type which could be extended inplace, and needed this check).

Given that this restriction is imposed by the SE row format, we must assume that 
it applies to all data types which can be changed inplace, and consequently we 
must add enforcement of it whenever we allow a new type to be extended inplace.

Based on the tables for size member functions in LLD, the size restriction can
be expressed generally for all types as 

Field::pack_length() <= Create_field::pack_length &&
Create_field::length >= 256 && Field::row_pack_length() < 256

where Field::row_pack_length() will be the storage requirement excluding the
size bytes for all types. Expressing the restriction this way, without relying on 
max_display_length(), has the advantage that is works unmodified for all Field 

3) Incompatible charset change. As described above.

4) Change collation of an indexed column. 

As mentioned in HLD, it is also not possible to perform a change inplace if the
column is indexed (Field::m_indexed == true) AND the collation is
different(Field::field_charset->coll != Create_field::charset->coll). The latter
implies pointer identity for MY_COLLATION_HANDLE (we assume that no two
MY_COLLATION_HANDLE objects will exist which represents the same collation).
Field class hierarchy

Partial inheritance hierarchy (from sql_field.h):

Field (abstract)
+--Field_str (abstract)
|  +--Field_longstr
|  |  +--Field_string
|  |  +--Field_varstring
|  |  +--Field_blob
|  |     +--Field_geom
|  |     +--Field_json
|  |
|  +--Field_null
|  +--Field_enum
|     +--Field_set

Existing ::is_equal implementations:

The Field base class has the following implementation:

uint Field::is_equal(const Create_field *new_field) {
  return (new_field->sql_type == real_type());

Which is only used by new Create_field_wrapper.

Field_str, CHAR(x)
uint Field_str::is_equal(const Create_field *new_field) {
  return ((new_field->sql_type == real_type()) &&
          new_field->charset == field_charset &&
          new_field->length == max_display_length());

To allow inplace charset changes for CHAR(x) this must be changed to check for
compatible charsets, and the length requirement must be changed to allow growth,
as long as the char_length stays the same (in order to preserve padding). But
since the row format used for CHAR and VARCHAR is similar, it is probably
necessary to add the same 256 byte boundary check that is used for
Field_varsting (VARCHAR) below.

Field_varstring, VARCHAR(x)
uint Field_varstring::is_equal(const Create_field *new_field) {
  if (new_field->sql_type == real_type() &&
      new_field->charset == field_charset) {
    if (new_field->length == max_display_length()) return IS_EQUAL_YES;
    DBUG_ASSERT(0 == (new_field->length % field_charset->mbmaxlen));
    DBUG_ASSERT(0 == (max_display_length() % field_charset->mbmaxlen));
    if (new_field->length > max_display_length() &&
        ((new_field->length <= 255 && max_display_length() <= 255) ||
         (new_field->length > 255 && max_display_length() > 255)))
      return IS_EQUAL_PACK_LENGTH;  // VARCHAR, longer variable length
  return IS_EQUAL_NO;

Here to the pointer-equality for charsets must be replaced by a test for
compatible charsets. The DBUG_ASSERTs must be removed as they typically will not
hold when charsets are not the same.

uint Field_blob::is_equal(const Create_field *new_field) {
  return (
      (new_field->sql_type == get_blob_type_from_length(max_data_length())) &&
      new_field->charset == field_charset &&
      new_field->pack_length == pack_length());

In addition to BLOB, all the *-TEXT types (TINYTEXT, TEXT, MEDIUMTEXT and
LONGTEXT) are represented as Field_blob with the corresponding blob type internally.
So Create_field::sql_type will be _TINY_BLOB for TINYTEXT and so on. Here too it
is necessary to relax the charset condition to check for compatible charsets. It
is worth noting that the two other conditions essentially are equivalent, as
max_data_length() is derived from the packlength, and for blobs the packlength
is really an enum identifying the sub-blob type.

Field_enum, ENUM and SET
uint Field_enum::is_equal(const Create_field *new_field) {
  TYPELIB *values = new_field->interval;

    The fields are compatible if they have the same flags,
    type, charset and have the same underlying length.
  if (new_field->sql_type != real_type() ||
      new_field->charset != field_charset ||
      new_field->pack_length != pack_length())
    return IS_EQUAL_NO;

    Changing the definition of an ENUM or SET column by adding a new
    enumeration or set members to the end of the list of valid member
    values only alters table metadata and not table data.
  if (typelib->count > values->count) return IS_EQUAL_NO;

  /* Check whether there are modification before the end. */
  if (!compare_type_names(field_charset, typelib, new_field->interval))
    return IS_EQUAL_NO;

  return IS_EQUAL_YES;

Field_enum is somewhat of a special case in that only the number of enum (or
set) values affect the storage requirement. The charset used for the enum value
does not affect the the size of the field as the binary value is stored, and the
symbolic names are only stored as meta data.

So charset changes will not affect the store requirement. But we still need to 
check the collation in case of an indexed column, and also check that the charset 
change is one of the allowed ones, since the new charset must be able to compare 
old and new values to determine if the ordering of existing elements is preserved.

As can be seen from the above code snippet, the existing implementation never 
returns IS_EQUAL_PACK_LENGTH for Field_enum, but there is really no reason not to 
do this if the pack_length increases, and the other criteria are otherwise 
satisfied. But allowing this would require us to perform the same 256 boundary 
check also for Field_enum, since the SE uses the same row-format regardless of 
type (assumption). At least this would be the safer option since it is better to 
force a COPY which wasn't needed, than allowing an INPLACE that is incorrect.

Testing has revealed extending a SET column from 1 to 2 bytes inplace, will result 
in asserts when trying to access the altered table. So for now, all pack_length-
increasing changes will need to be done with COPY.

Various length member function in the Field hierarchy

Length memfuns     | Field         | Field_str   | Field_longstr
pack_length        | field_length   |             |  
row_pack_length    | 0             |             |
max_display_length | <pure virt>   | nv field_length |

Length memfuns     | Field_varstring      | Field_string
pack_length        | (field_length +       |
                   | length_bytes)        |
row_pack_length    | field_length          | field_length
max_display_length | Field_str::          |

Length memfuns     | Field_blob           | Field_enum
pack_length        | packlength +         | nv packlength
                   | portable_sizeof_char_ptr |         
row_pack_length    | nv packlength        | nv pack_length()
max_display_length | 1)                   |

uint32 Field_blob::max_display_length() {
  switch (packlength) {
    case 1:
      return 255 * field_charset->mbmaxlen;
    case 2:
      return 65535 * field_charset->mbmaxlen;
    case 3:
      return 16777215 * field_charset->mbmaxlen;
    case 4:
      return (uint32)4294967295U;
      DBUG_ASSERT(0);  // we should never go here
      return 0;