WL#7554: Switch to new default character set and change mtr test cases

Affects: Server-8.0 — Status: Complete

Description
Dependent Tasks
High Level Architecture

User Story
----------
Steve is the Development Manager at WidgetTech, a retailer with thousands of 
stores in the USA and expanding fast.  Steve maintains an internal application
to project manage the process of opening new retail stores.

He has found that since introducing mobile support to their application, many of 
the project managers are using this interface as their primary interaction!  As 
most updates are brief, the project managers also like to use emoji and short hand 
language when entering information.  In order to offer this, Steve needs to make 
sure his application and database both use the utf8 character set.

Steve has a reasonable expectation that this change should not result in a loss
of performance or functionality.

Background
----------

There is some interests in changing the default charset from latin1 to utf8mb4.  
One of the strong motivations for this, is that even in English-speaking markets, 
users are generating multi-byte characters at an increasing rate.  ~85.8% of web 
pages are also now utf8: http://w3techs.com/technologies/details/en-utf8/all/all

We want to be #1 in web and mobile; and because there is some confusion between 
"utf8" and "utf8mb4" not being the default causes usability issues.  Migrating 
data is also hard, so defaults are important.

This WL is created to document issues that were brought up during
the discussion, and which are agreed to be required in scope for the default 
changes versus highly desirable.

Issues for changing the default to utf8mb4
----------------------------------------------------

1. Functional issues:
1.1 We lack case and accent sensitive collations. 
Implemented in WL#9109 Add case and accent sensitive collations for utf8mb4 (Xing)

1.2 Collation support. 
utf8mb4_general_ci and all language specific collations treat all characters in
the supplementary multilingual plan (SMP) as if they have the same weight, e.g.
all emojis are evaluated to be equal. 

We want utf8mb4_0900_ai_ci to become the default collation for utf8mb4.  The
language specific collations corresponding to utf8mb4_0900 will need to be added
(~25 collations) in order to assist this switch. 
--Implemented in WL#9109 Add case and accent sensitive collations for utf8mb4 (Xing)


2. Performance issues:  
2.1 MEMORY SE used for tmp tables does not support variable length 
columns: The current plan is to replace MEMORY SE with InnoDB In-Memory
Intrinsic Tables. 

2.2 Performance issues:
2.1 Filesort can be optimized more for for variable length columns: 
- Varlen keys for sorting JSON values
- Varlen keys for sorting multibyte String values

2.3 ALTER TABLE operations that change character set or collation on a column
are currently performed by copying the table contents and affected indexes
to new storage. Copying is not always needed. 

Upgrade Story
-------------

The new default character set will not apply immediately to existing tables or
schemas which will continue to exist in their previously defined character set.  

Users wishing to revert to the previous default character set of latin1 can do
so by setting character-set=latin1.

Test Changes
--------------

To make this change possible, we noticed that we also need to pay attention to 
legacy test cases. All these test cases assume the default character set is 
latin1 and default collation is latin1_swedish_ci. With this big change, we 
foresee many test cases will fail.

This WL is not only used for tracking the few lines of code changes on the
default charset and collation, it is mostly used for tracking test cases needed
to make them pass with new default charset and collation.

WL#9108: Add language specific case insensitive collations of utf8mb4
WL#9125: Add utf8mb4_800_ci_ai

The InnoDB change buffer reserves only 8 bits for the charset-collation code. If
we need to introduce a large number of charset-collation codes for each
collation of 4-byte UTF-8, then we may need to spend time on refactoring the
InnoDB change buffer so that it will look up the metadata from the
Global DD instead of storing it internally.

For now, we will have to disable change buffering on secondary indexes that are
defined on affected utf8mb4 columns. (Similarly, we must disable change
buffering on DESC indexes).