WL#9479: Upgrade Unicode data to 9.0.0
Affects: Server-8.0
—
Status: Complete
We have added collation utf8mb4_800_ai_ci with WL#9125, and added 20 language specific collations with WL#9108. All these collations are using Unicode data of version 8.0.0. But Unicode committee has announced Unicode 9.0.0 on Jun 21. To have our new collations built on latest Unicode data, we'll upgrade our data and collations as well.
F-1: Make collation weight tables of new collations use latest data F-2: Make case mapping tables of new collations use latest data F-3: Make new collations sort characters correctly
What's the difference between Unicode 9.0.0 and 8.0.0? 1. Unicode 9.0.0 added new case mapping for 10 characters 2. Unicode 9.0.0 added Tangut and a few lesser-used characters. 3. Unicode 9.0.0 added a few emoji characters. What we do to upgrade to Unicode 9.0.0? The difference between Unicode 9.0.0 and 8.0.0 is not too much. To upgrade to Unicode 9.0.0, we need to: 1. Import all collation weights defined in DUCET 9.0.0 to replace the weight table we are using now. Many character's weight is changed in new DUCET. For example, in DUCET 8.0.0, the weight of 'a' is: [.1BC2.0020.0002], but in DUCET 9.0.0, it is [.1C47.0020.0002]. 2. Import the case mapping info defined in UnicodeData.txt and CaseFolding.txt published by Unicode to replace the case mapping table we are using now. 9 Cyrillic characters and 1 Latin character has new case mapping. These characters are: 1C80;CYRILLIC SMALL LETTER ROUNDED VE 1C81;CYRILLIC SMALL LETTER LONG-LEGGED DE 1C82;CYRILLIC SMALL LETTER NARROW O 1C83;CYRILLIC SMALL LETTER WIDE ES 1C84;CYRILLIC SMALL LETTER TALL TE 1C85;CYRILLIC SMALL LETTER THREE-LEGGED TE 1C86;CYRILLIC SMALL LETTER TALL HARD SIGN 1C87;CYRILLIC SMALL LETTER TALL YAT 1C88;CYRILLIC SMALL LETTER UNBLENDED UK A7AE;LATIN CAPITAL LETTER SMALL CAPITAL I 3. Add code lines to calculate implicit weight of Tangut characters, because Unicode defined special algorithm for them. All new added Tangut characters are in range U+17000..U+187EC. For these characters, we compose their implicit weight [FB00.0020.0002][BBBB.0000. 0000] (BBBB = (codepoint - 0x17000) | 0x8000). 4. Change all collation names to include the correct Unicode version, "0900". Change the version string from "800" to "0900" is for the coming Unicode 10. This change can make the sorting of collation names right. Reference: http://www.unicode.org/versions/Unicode9.0.0/
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.