WL#1386: CTYPE table for unicode character sets

Affects: Server-5.1   —   Status: Complete

We should have CTYPE table for Unicode characters.

This should have full text index to work more accurately.
There are a lot of space and punctuation Unicode character
that are considered as letters now.  This feature is already
requested by Unicode+fulltext users.

This CTYPE table will be also usefull for Unicode SQL query parsers.

As this task is not very big, it's probably worth to implement it
even in 4.1.x branch. 
CTYPE data for Unicode character sets should be
generated from the Unicode character database.
For example, the version 5.0.0d9 can be found here:
ftp://ftp.unicode.org/Public/5.0.0/ucd/UnicodeData-5.0.0d9.txt

We're interested in 3th column.

For example:

0020;SPACE;Zs;0;WS;;;;;N;;;;;
0028;LEFT PARENTHESIS;Ps;0;ON;;;;;Y;OPENING PARENTHESIS;;;;
0029;RIGHT PARENTHESIS;Pe;0;ON;;;;;Y;CLOSING PARENTHESIS;;;;
0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061
0061;LATIN SMALL LETTER A;Ll;0;L;;;;;N;;;0041;;0041

Zs in SPACE means "Separator - space"
Ps and Pe in PARENTHESIS mean "Punctuation"
Nd in DIGIT ZERO means "decimal number"
Lu and Ll in LETTER A mean letters with upper/lower case correspondingly.

A new tool should be written to parse a file
of this format and dump it as C code,
which could be incorporated into MySQL sources.

The parse tool itself should be added into MySQL source tree too,
to regenerate ctype data when it's necessary (for example
to update from the next Unicode version).

Uncode to MySQL ctype mapping should be done using these rules:

Letters:
  {"Lu", _MY_U},                /* Letter, Uppercase          */
  {"Ll", _MY_L},                /* Letter, Lowercase          */
  {"Lt", _MY_U},                /* Letter, Titlecase          */
  {"Lm", _MY_L},                /* Letter, Modifier           */
  {"Lo", _MY_L},                /* Letter, other              */

Digits and numbers:
  {"Nd", _MY_NMR},              /* Number, Decimal Digit      */
  {"Nl", _MY_NMR|_MY_U|_MY_L},  /* Number, Letter             */
  {"No", _MY_NMR|_MY_PNT},      /* Number, Other              */

Marks:
  {"Mn", _MY_L|_MY_PNT},        /* Mark, Nonspacing           */
  {"Mc", _MY_L|_MY_PNT},        /* Mark, Spacing Combining    */
  {"Me", _MY_L|_MY_PNT},        /* Mark, Enclosing            */

Punctuation characters:
  {"Pc", _MY_PNT},              /* Punctuation, Connector     */
  {"Pd", _MY_PNT},              /* Punctuation, Dash          */
  {"Ps", _MY_PNT},              /* Punctuation, Open          */
  {"Pe", _MY_PNT},              /* Punctuation, Close         */
  {"Pi", _MY_PNT},              /* Punctuation, Initial quote */
  {"Pf", _MY_PNT},              /* Punctuation, Final quote   */
  {"Po", _MY_PNT},              /* Punctuation, Other         */

Symbols:
  {"Sm", _MY_PNT},              /* Symbol, Math               */
  {"Sc", _MY_PNT},              /* Symbol, Currency           */
  {"Sk", _MY_PNT},              /* Symbol, Modifier           */
  {"So", _MY_PNT},              /* Symbol, Other              */

Separators:
  {"Zs", _MY_SPC},              /* Separator, Space           */
  {"Zl", _MY_SPC},              /* Separator, Line            */
  {"Zp", _MY_SPC},              /* Separator, Paragraph       */

Other characters:
  {"Cc", _MY_CTR},              /* Other, Control             */
  {"Cf", _MY_CTR},              /* Other, Format              */
  {"Cs", _MY_CTR},              /* Other, Surrogate           */
  {"Co", _MY_CTR},              /* Other, Private Use         */
  {"Cn", _MY_CTR},              /* Other, Not Assigned        */


Note, the full ctype array for the whole Basic Multilingual Plane
(BMP) would consist of 64K elements. It is a good idea to dump
ctype data in a smart way, to safe memory, not to waste 64K bytes.

We'll do compression in the following order:
- break the BMP into 256 pages with 256 elements each
- if a whole page consists of the same type of elements,
  don't dump individual characters in this page.
- if a page consists of mixed type characters, then dump
  individual characters.
- there will be an index with pointers to 256 pages
  many of them will be NULL, meaning that all characters
  for this page have the same type. A non-NULL pointer
  will mean that this page is of mixed type characters.