WL#11605: Alter table with character set conversion as inplace operation, Step 1, NO INDEX
Affects: Server-8.0
—
Status: Complete
This WL handles the cases where columns to be changed are not part of any primary or secondary indexes. Collation specifies sorting order of an index. If a column is not indexed, any collation conversion can be done as a pure metadata change, e.g. from utf8mb4_general_ci to utf8mb4_0900_ai_ci. In the following cases, character set conversion can be done as a metadata change too, because the source character set is a subset of the target character set. Collation change has no consequence either, since there is no index on the involved columns. ASCII -> UTF8MB4 *) utf8mb3 -> UTF8MB4 *) This is true in theory, but not in practice as inserting 8-bit values into an ASCII column is NOT rejected, but merely produces a warning. So the ASCII column my contain data which are not legal UTF8MB4 (and possibly not legal in other 1-byte charsets). And it is not even allowed using the COPY algorithm. E.g. CREATE TABLE t(v VARCHAR(10) CHARSET ASCII); INSERT INTO t VALUES(_ascii 0xc5); Warnings: Warning 1300 Invalid ascii character string: 'C5' Warning 1300 Invalid ascii character string: '\xC5' ALTER TABLE t CHANGE COLUMN v v VARCHAR(10) CHARSET LATIN1, ALGORITHM=COPY; ERROR HY000: Incorrect string value: '\xC5' for column 'v' at row 1 For this we would need an ASCII_7 charset which actually rejected invalid inserts (or cleared the top bit).
FR 1 - No charset change, except utf8mb3 to utf8mb4 and any charset to binary will be done inplace. FR 2 - Collation cannot be changed inplace for indexed columns/Fields FR 3 - No change (charset or otherwise) which reduce the number of bytes needed for storage (Field::max_display_length), may be done inplace. FR 4 - No change (charset or otherwise) which makes Field::max_display_length exceed 255, can be done inplace. FR 5 - No change (charset or otherwise) which alters the char length of a CHAR(X)/Field_str can be done inplace. FR 6 - Only data types which have a charset can benefit from an inplace charset change FR 7 - Data type VARCHAR(X)/Field_varstring must support inplace charset change subject to general restrictions mentioned above. FR 8 - Data type CHAR(X)/Field_str must support inplace charset change, subject to general restrictions, AND also only if the char length X does not change. Currently, 256-border restriction is only enforced for Field_varstring, because for Field_string there is the even stricter requirement that max_display_length cannot change at all. To allow any charset changes inplace for Field_string we will have to relax this, and since the justifictation is the Innodb row format, it would seem reasonable to also enforce it for Field_string when relaxing the identical max_display_length requirement, since Innodb uses the same row format. FR 9 - Data type TINYTEXT/Field_blob must support inplace charset change. FR 10 - Data type TEXT/Field_blob must support inplace charset change. FR 11 - Data type MEDIUMTEXT/Field_blob must support inplace charset change. FR 12 - Data type LONGTEXT/Field_blob must support inplace charset change FR 13 - Data type ENUM/Field_enum must support inplace charset change. (Note that checking if new values appear at the end of the enum must be done using the new charset which, for an inplace charset change, is able to compare potential new value strings to the existing ones correctly).
Charset changes which preserve encoding: ---------------------------------------- The following charset changes can be done inplace as the encoding of all legal characters in the source changeset is also legal in the destination charset: utf8mb3: -> binary, utf8mb4 -> All others -> binary Destination -------------------------------------------- Source | binary | utf8mb3 | utf8mb4 | -------------------------------------------- utf8mb3 | Yes | - | Yes | All others | Yes | No | No | Data types with charset: ------------------------ SQL type | Field subclass | Comment ------------------------------------------------------ VARCHAR(X) | Field_varstring | CHAR(X) | Field_str | Char length must be fixed due to padding TINYTEXT | Field_blob | TEXT | Field_blob | MEDIUMTEXT | Field_blob | LONGTEXT | Field_blob | ENUM | Field_enum | Charset is largely irrelevant API for checking column changes ------------------------------- The virtual member function int Field::is_equal(const Create_field*) is used to check if individual column/field changes prevent an alter from being done inplace. Sub-classes of Field override this member function as necessary. The returned int can be one of three values: IS_EQUAL_NO, IS_EQUAL_YES and IS_EQUAL_PACK_LENGTH. The returned value is sometimes implicitly converted to bool, so that IS_EQUAL_NO becomes false, and the other two true. The exact meaning of these are, to my knowledge, not documented. For this WL I will assume the following protocol: - If column is changed in such a way that we can infer at the server layer that the alter cannot be done inplace, is_equal shall return IS_EQUAL_NO - If the change does not modify existing values, but the storage size needs to be increased in order to hold new values allowed after the change, is_equal shall return IS_EQUAL_PACK_LENGTH. - If neither existing values nor storage requirements change, is_equal() shall return IS_EQUAL_YES. Column/field changes which prevent inplace alter ----------------------------------------------------------------------- 1) SQL type change. 2) Length change. In general it is not possible to reduce a Field's stored size (pack_length()) inplace as that will require truncation of existing entries which exceed the new max value. As alluded to in the HLD, all changes which grow a column (field) from a value less than 256 to a value greater than or equal to 256, cannot be done inplace, due to restrictions imposed by the SE row format. (IMO this check does not belong in is_equal() and should instead be performed by SE inside handler::check_if_supported_inplace_alter(). But making this change falls outside the natural scope of this wl). As can be seen in LLD, this restriction is currently only enforced for Field_varstring, and the check is done against max_display_length(). This seems odd, since one would think that it is the pack_length(), the stored size, which is relevant. A conjecture here is that max_display_length() was used because it is equal to the pack_length() minus the size bytes for VARCHAR (which was the only type which could be extended inplace, and needed this check). Given that this restriction is imposed by the SE row format, we must assume that it applies to all data types which can be changed inplace, and consequently we must add enforcement of it whenever we allow a new type to be extended inplace. Based on the tables for size member functions in LLD, the size restriction can be expressed generally for all types as Field::pack_length() <= Create_field::pack_length && Create_field::length >= 256 && Field::row_pack_length() < 256 where Field::row_pack_length() will be the storage requirement excluding the size bytes for all types. Expressing the restriction this way, without relying on max_display_length(), has the advantage that is works unmodified for all Field types. 3) Incompatible charset change. As described above. 4) Change collation of an indexed column. As mentioned in HLD, it is also not possible to perform a change inplace if the column is indexed (Field::m_indexed == true) AND the collation is different(Field::field_charset->coll != Create_field::charset->coll). The latter implies pointer identity for MY_COLLATION_HANDLE (we assume that no two MY_COLLATION_HANDLE objects will exist which represents the same collation).
Field class hierarchy --------------------- Partial inheritance hierarchy (from sql_field.h): Field (abstract) | ... +--Field_str (abstract) | +--Field_longstr | | +--Field_string | | +--Field_varstring | | +--Field_blob | | +--Field_geom | | +--Field_json | | | +--Field_null | +--Field_enum | +--Field_set Existing ::is_equal implementations: ------------------------------------ The Field base class has the following implementation: uint Field::is_equal(const Create_field *new_field) { return (new_field->sql_type == real_type()); } Which is only used by new Create_field_wrapper. Field_str, CHAR(x) ------------------ uint Field_str::is_equal(const Create_field *new_field) { return ((new_field->sql_type == real_type()) && new_field->charset == field_charset && new_field->length == max_display_length()); } To allow inplace charset changes for CHAR(x) this must be changed to check for compatible charsets, and the length requirement must be changed to allow growth, as long as the char_length stays the same (in order to preserve padding). But since the row format used for CHAR and VARCHAR is similar, it is probably necessary to add the same 256 byte boundary check that is used for Field_varsting (VARCHAR) below. Field_varstring, VARCHAR(x) --------------------------- uint Field_varstring::is_equal(const Create_field *new_field) { if (new_field->sql_type == real_type() && new_field->charset == field_charset) { if (new_field->length == max_display_length()) return IS_EQUAL_YES; DBUG_ASSERT(0 == (new_field->length % field_charset->mbmaxlen)); DBUG_ASSERT(0 == (max_display_length() % field_charset->mbmaxlen)); if (new_field->length > max_display_length() && ((new_field->length <= 255 && max_display_length() <= 255) || (new_field->length > 255 && max_display_length() > 255))) return IS_EQUAL_PACK_LENGTH; // VARCHAR, longer variable length } return IS_EQUAL_NO; } Here to the pointer-equality for charsets must be replaced by a test for compatible charsets. The DBUG_ASSERTs must be removed as they typically will not hold when charsets are not the same. Field_blob, BLOB, TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT --------------------------------------------------------- uint Field_blob::is_equal(const Create_field *new_field) { return ( (new_field->sql_type == get_blob_type_from_length(max_data_length())) && new_field->charset == field_charset && new_field->pack_length == pack_length()); } In addition to BLOB, all the *-TEXT types (TINYTEXT, TEXT, MEDIUMTEXT and LONGTEXT) are represented as Field_blob with the corresponding blob type internally. So Create_field::sql_type will be _TINY_BLOB for TINYTEXT and so on. Here too it is necessary to relax the charset condition to check for compatible charsets. It is worth noting that the two other conditions essentially are equivalent, as max_data_length() is derived from the packlength, and for blobs the packlength is really an enum identifying the sub-blob type. Field_enum, ENUM and SET ------------------------ uint Field_enum::is_equal(const Create_field *new_field) { TYPELIB *values = new_field->interval; /* The fields are compatible if they have the same flags, type, charset and have the same underlying length. */ if (new_field->sql_type != real_type() || new_field->charset != field_charset || new_field->pack_length != pack_length()) return IS_EQUAL_NO; /* Changing the definition of an ENUM or SET column by adding a new enumeration or set members to the end of the list of valid member values only alters table metadata and not table data. */ if (typelib->count > values->count) return IS_EQUAL_NO; /* Check whether there are modification before the end. */ if (!compare_type_names(field_charset, typelib, new_field->interval)) return IS_EQUAL_NO; return IS_EQUAL_YES; } Field_enum is somewhat of a special case in that only the number of enum (or set) values affect the storage requirement. The charset used for the enum value does not affect the the size of the field as the binary value is stored, and the symbolic names are only stored as meta data. So charset changes will not affect the store requirement. But we still need to check the collation in case of an indexed column, and also check that the charset change is one of the allowed ones, since the new charset must be able to compare old and new values to determine if the ordering of existing elements is preserved. As can be seen from the above code snippet, the existing implementation never returns IS_EQUAL_PACK_LENGTH for Field_enum, but there is really no reason not to do this if the pack_length increases, and the other criteria are otherwise satisfied. But allowing this would require us to perform the same 256 boundary check also for Field_enum, since the SE uses the same row-format regardless of type (assumption). At least this would be the safer option since it is better to force a COPY which wasn't needed, than allowing an INPLACE that is incorrect. Testing has revealed extending a SET column from 1 to 2 bytes inplace, will result in asserts when trying to access the altered table. So for now, all pack_length- increasing changes will need to be done with COPY. Various length member function in the Field hierarchy ----------------------------------------------------- Length memfuns | Field | Field_str | Field_longstr ---------------------------------------------------------------------- pack_length | field_length | | row_pack_length | 0 | | max_display_length || nv field_length | Length memfuns | Field_varstring | Field_string ---------------------------------------------------------------------- pack_length | (field_length + | | length_bytes) | row_pack_length | field_length | field_length max_display_length | Field_str:: | Length memfuns | Field_blob | Field_enum ---------------------------------------------------------------------- pack_length | packlength + | nv packlength | portable_sizeof_char_ptr | row_pack_length | nv packlength | nv pack_length() max_display_length | 1) | 1) uint32 Field_blob::max_display_length() { switch (packlength) { case 1: return 255 * field_charset->mbmaxlen; case 2: return 65535 * field_charset->mbmaxlen; case 3: return 16777215 * field_charset->mbmaxlen; case 4: return (uint32)4294967295U; default: DBUG_ASSERT(0); // we should never go here return 0; } }
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.