WL#1386: CTYPE table for unicode character sets
Affects: Server-5.1
—
Status: Complete
We should have CTYPE table for Unicode characters. This should have full text index to work more accurately. There are a lot of space and punctuation Unicode character that are considered as letters now. This feature is already requested by Unicode+fulltext users. This CTYPE table will be also usefull for Unicode SQL query parsers. As this task is not very big, it's probably worth to implement it even in 4.1.x branch.
CTYPE data for Unicode character sets should be generated from the Unicode character database. For example, the version 5.0.0d9 can be found here: ftp://ftp.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d9.txt We're interested in 3th column. For example: 0020;SPACE;Zs;0;WS;;;;;N;;;;; 0028;LEFT PARENTHESIS;Ps;0;ON;;;;;Y;OPENING PARENTHESIS;;;; 0029;RIGHT PARENTHESIS;Pe;0;ON;;;;;Y;CLOSING PARENTHESIS;;;; 0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;; 0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061 0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041 Zs in SPACE means "Separator - space" Ps and Pe in PARENTHESIS mean "Punctuation" Nd in DIGIT ZERO means "decimal number" Lu and Ll in LETTER A mean letters with upper/lower case correspondingly. A new tool should be written to parse a file of this format and dump it as C code, which could be incorporated into MySQL sources. The parse tool itself should be added into MySQL source tree too, to regenerate ctype data when it's necessary (for example to update from the next Unicode version).
Uncode to MySQL ctype mapping should be done using these rules: Letters: {"Lu", _MY_U}, /* Letter, Uppercase */ {"Ll", _MY_L}, /* Letter, Lowercase */ {"Lt", _MY_U}, /* Letter, Titlecase */ {"Lm", _MY_L}, /* Letter, Modifier */ {"Lo", _MY_L}, /* Letter, other */ Digits and numbers: {"Nd", _MY_NMR}, /* Number, Decimal Digit */ {"Nl", _MY_NMR|_MY_U|_MY_L}, /* Number, Letter */ {"No", _MY_NMR|_MY_PNT}, /* Number, Other */ Marks: {"Mn", _MY_L|_MY_PNT}, /* Mark, Nonspacing */ {"Mc", _MY_L|_MY_PNT}, /* Mark, Spacing Combining */ {"Me", _MY_L|_MY_PNT}, /* Mark, Enclosing */ Punctuation characters: {"Pc", _MY_PNT}, /* Punctuation, Connector */ {"Pd", _MY_PNT}, /* Punctuation, Dash */ {"Ps", _MY_PNT}, /* Punctuation, Open */ {"Pe", _MY_PNT}, /* Punctuation, Close */ {"Pi", _MY_PNT}, /* Punctuation, Initial quote */ {"Pf", _MY_PNT}, /* Punctuation, Final quote */ {"Po", _MY_PNT}, /* Punctuation, Other */ Symbols: {"Sm", _MY_PNT}, /* Symbol, Math */ {"Sc", _MY_PNT}, /* Symbol, Currency */ {"Sk", _MY_PNT}, /* Symbol, Modifier */ {"So", _MY_PNT}, /* Symbol, Other */ Separators: {"Zs", _MY_SPC}, /* Separator, Space */ {"Zl", _MY_SPC}, /* Separator, Line */ {"Zp", _MY_SPC}, /* Separator, Paragraph */ Other characters: {"Cc", _MY_CTR}, /* Other, Control */ {"Cf", _MY_CTR}, /* Other, Format */ {"Cs", _MY_CTR}, /* Other, Surrogate */ {"Co", _MY_CTR}, /* Other, Private Use */ {"Cn", _MY_CTR}, /* Other, Not Assigned */ Note, the full ctype array for the whole Basic Multilingual Plane (BMP) would consist of 64K elements. It is a good idea to dump ctype data in a smart way, to safe memory, not to waste 64K bytes. We'll do compression in the following order: - break the BMP into 256 pages with 256 elements each - if a whole page consists of the same type of elements, don't dump individual characters in this page. - if a page consists of mixed type characters, then dump individual characters. - there will be an index with pointers to 256 pages many of them will be NULL, meaning that all characters for this page have the same type. A non-NULL pointer will mean that this page is of mixed type characters.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.