WL#7554: Switch to new default character set and change mtr test cases
Affects: Server-8.0 — Status: Complete — Priority: Medium
User Story ---------- Steve is the Development Manager at WidgetTech, a retailer with thousands of stores in the USA and expanding fast. Steve maintains an internal application to project manage the process of opening new retail stores. He has found that since introducing mobile support to their application, many of the project managers are using this interface as their primary interaction! As most updates are brief, the project managers also like to use emoji and short hand language when entering information. In order to offer this, Steve needs to make sure his application and database both use the utf8 character set. Steve has a reasonable expectation that this change should not result in a loss of performance or functionality. Background ---------- There is some interests in changing the default charset from latin1 to utf8mb4. One of the strong motivations for this, is that even in English-speaking markets, users are generating multi-byte characters at an increasing rate. ~85.8% of web pages are also now utf8: http://w3techs.com/technologies/details/en-utf8/all/all We want to be #1 in web and mobile; and because there is some confusion between "utf8" and "utf8mb4" not being the default causes usability issues. Migrating data is also hard, so defaults are important. This WL is created to document issues that were brought up during the discussion, and which are agreed to be required in scope for the default changes versus highly desirable. Issues for changing the default to utf8mb4 ---------------------------------------------------- 1. Functional issues: 1.1 We lack case and accent sensitive collations. Implemented in WL#9109 Add case and accent sensitive collations for utf8mb4 (Xing) 1.2 Collation support. utf8mb4_general_ci and all language specific collations treat all characters in the supplementary multilingual plan (SMP) as if they have the same weight, e.g. all emojis are evaluated to be equal. We want utf8mb4_0900_ai_ci to become the default collation for utf8mb4. The language specific collations corresponding to utf8mb4_0900 will need to be added (~25 collations) in order to assist this switch. --Implemented in WL#9109 Add case and accent sensitive collations for utf8mb4 (Xing) 2. Performance issues: 2.1 MEMORY SE used for tmp tables does not support variable length columns: The current plan is to replace MEMORY SE with InnoDB In-Memory Intrinsic Tables. 2.2 Performance issues: 2.1 Filesort can be optimized more for for variable length columns: - Varlen keys for sorting JSON values - Varlen keys for sorting multibyte String values 2.3 ALTER TABLE operations that change character set or collation on a column are currently performed by copying the table contents and affected indexes to new storage. Copying is not always needed. Upgrade Story ------------- The new default character set will not apply immediately to existing tables or schemas which will continue to exist in their previously defined character set. Users wishing to revert to the previous default character set of latin1 can do so by setting character-set=latin1. Test Changes -------------- To make this change possible, we noticed that we also need to pay attention to legacy test cases. All these test cases assume the default character set is latin1 and default collation is latin1_swedish_ci. With this big change, we foresee many test cases will fail. This WL is not only used for tracking the few lines of code changes on the default charset and collation, it is mostly used for tracking test cases needed to make them pass with new default charset and collation.
WL#9108: Add language specific case insensitive collations of utf8mb4
WL#9125: Add utf8mb4_800_ci_ai
WL#9125: Add utf8mb4_800_ci_ai
The InnoDB change buffer reserves only 8 bits for the charset-collation code. If we need to introduce a large number of charset-collation codes for each collation of 4-byte UTF-8, then we may need to spend time on refactoring the InnoDB change buffer so that it will look up the metadata from the Global DD instead of storing it internally. For now, we will have to disable change buffering on secondary indexes that are defined on affected utf8mb4 columns. (Similarly, we must disable change buffering on DESC indexes).
Copyright (c) 2000, 2018, Oracle Corporation and/or its affiliates. All rights reserved.