WL#13054: Add utf8mb4 binary no-pad collation
Affects: Server-8.0
—
Status: Complete
We have a binary collation for utf8mb4, utf8mb4_bin. But it has PAD_SPACE attribute, so it will add pad spaces at the trailing. JSON needs a binary collation for utf8mb4 that doesn't add pad space. We use this WL to track the work adding a new binary collation for utf8mb4, utf8mb4_0900_bin.
F-1: The collation shall be a compiled one. F-2: The collation shall work for all valid Unicode code points in range [U+0, U+10FFFF]. F-3: The collation sorts Unicode code points by their code point order. F-4: This new collation will be NO_PAD, which means it won't add trailing space.
Because we only need to change the collation not to add pad spaces, we can re-use most of the code of the collation utf8mb4_bin. But there is one more thing we want to change, that is what to return as a character's weight. utf8mb4_bin returns three bytes for any one character. The three bytes are, the bytes of this character's Unicode code point, and leading zero bytes if the code point does not have three bytes. For example, the weight bytes for U+1234 is, 0x00, 0x12, and 0x34. (Please see my_strnxfrm_unicode_full_bin()). We can make it simpler, to return the same bytes as the utf8mb4 code points. For example, U+1234, its utf8mb4 code points is 0xE1, 0x88, 0xB4. We can give U+1234 the weight of 0xE188B4 too. For utf8mb4 byte, we don't need to consider the endian problem, so we don't need to do the bit shift. And since the length of the weight is easy to know, we don't need to check the boundary of the weight buffer for every byte. This gives some performance improvement. For the collating result, utf8mb4 code point's first byte is always greater than the following bytes. Utf8mb4's first byte might be (we only think about the value it might be, don't think about the character validity): 0xxx xxxx // First byte of one byte encoding character, from 0x00 to 0x7F 110x xxxx // First byte of two bytes encoding character, from 0xC0 to 0xDF 1110 xxxx // First byte of three bytes encoding character, from 0xE0 to 0xEF 1111 0xxx // First byte of four bytes encoding character, from 0xF0 to 0xF7 Except for the first byte, the other bytes are in same form: 10xx xxxx. Its value varies from 0x80 to 0xBF. We can see that there is no overlap between the value of the leading byte and the following bytes. When we compare two characters, we start from the first byte to the last byte, the collating result is same as we use Unicode code point as character's weight.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.