WL#4164: Two-byte collation IDs

Affects: Server-5.5   —   Status: Complete

After adding UTF32, UTF16 and a new version of UTF8 (with MB4 support),
we have used up almost all charset+collation IDs which can fit into 1 byte.

We need to switch to two-byte IDs.


Character set + collation encoding
==================================
It would be convenient to encode  separately the character set
ID and collation ID into these two bytes.
It would help to maintain connectors. Because only the character
set is usually important on the client side, and collation does
not really matter. Adding new collations to the server won't
cause an urgent need to recompile all connectors to understand a
new charset/collation pair. It will be enough to put new collation
ID into the same range with all collations for the same character set.

Charset+collation IDs can be encoded as:

Proposal 1:
- 7 bits to encode character set (128 character sets)
- 9 bits to encode collation (512 collations)

Proposal 2:
- 6 bits to encode character set (64 character sets)
- 10 bits to encode collation (1024 collations)

Proposal 3:
- 8 bits to encode character set (256 character sets)
- 8 bits to encode collation (256 collations)

Proposal 4:
Or it can be floating encoding:
- 32 character sets with 1024 collations (32768 charset+collation pairs total)
- 64 character sets with 512 collations  (32768 charset+collation pairs total)

Floating encoding is preferable, because many character sets
have only a limited number of collations. They are 8bit character
sets, Eastern Asian character sets (Big5, GB2312, SJIS, UJIS, CP932),
and some other ones.

Only a few character sets can have many collations.
They are Unicode character sets: UTF8, UCS2, UTF16, UTF32.


User-defined collations
=======================
A special ID range for user defined collations
which can be added by editing PREFIX/share/charsets/*.xml files
is defined as 1024..2047.

A user-defined ID range should guarantee that IDs for built-in collations
don't conflict with IDs for user-defined collations.


The affected code parts
=======================

- FRM files
- client-server protocol
- connectors
- other parts (TODO: list all other parts here)

Binary log and replication already use 2 bytes per charset+collation ID.
Most likely they won't need any changes.


Old client compatibility
========================
Upgrading client part is usually painful because
it causes a need for recompiling of user applications.
One should be able to upgrade server without having to upgrade client. 
So old clients should be able to understand server with
new charset+collation ID encoding, at least on the old
single byte ID range.

To reach this goal, new server can still send old IDs using single byte,
and send only new IDs using two bytes.

A special prefix can designate a two-byte sequence.
The NULL byte 0x00 can work as this prefix.

For example:

0xAB           - old ID 171, one byte total, 
0x00 0x01 0x01 - new ID 257, three bytes total.

Other considerations during upgrade:
The version number byte of .frm file will change.
The implementor will investigate whether the bytes'
previous values were = 0, or were undefined.
We want to know whether there are other items
which might need these spare bytes soon because
their ranges are almost completely used, e.g. flags;
maybe the connectors people can tell us.

Time estimate by Bar
====================
FRM and protocol should not take much time.  Maybe one week.

But I'm afraid that some of the engines will be affected,
if they use only one byte for IDs.
Related code parts:

client-server protocol
======================
It seems to be safe - it already uses 2-byte IDs for columns.


client.c - client part:

MYSQL_FIELD * unpack_fields(...)
{
  ...
  field->charsetnr= uint2korr(pos); // Note, two bytes here!
  ...
}


protocol.cc - server part:

bool Protocol::send_fields()
{
  ...
  int2store(pos, field.charsetnr);
  ...
  int2store(pos, thd_charset->number);
  ...
}



client tools
============

frm format
==========
Stores table default character set and column character set.

Table default character set
---------------------------
table.cc:

File create_frm(...)
{
  ...
  fileinfo[32]=0;                             // No filename anymore

  ...

  // Table charset
  fileinfo[38]= (create_info->default_table_charset ?
                create_info->default_table_charset->number : 0);

  ...

  /* Next few bytes were for RAID support */
  fileinfo[41]= 0;
  fileinfo[42]= 0;
  fileinfo[43]= 0;
  fileinfo[44]= 0;
  fileinfo[45]= 0;
  fileinfo[46]= 0;
  ...
}

We can use one of the above unused bytes to store the high
byte of ID.

Column character set
--------------------

unireg.cc:

static bool pack_fields()
{
  ...
  int3store(buff+5,recpos);
  int2store(buff+8,field->pack_flag);
  int2store(buff+10,field->unireg_check); // NOTE, write two bytes here
  buff[12]= (uchar) field->interval_id;
  buff[13]= (uchar) field->sql_type; 
  ...
}
 

table.cc:


static int open_binary_frm()
{
  ...
  recpos=       uint3korr(strpos+5);
  pack_flag=    uint2korr(strpos+8);
  unireg_type=  (uint) strpos[10]; //  NOTE, read one byte here
  interval_nr=  (uint) strpos[12];
  uint comment_length=uint2korr(strpos+15);
  field_type=(enum_field_types) (uint) strpos[13];
  ...
}

strpos[11] is not really used, so it can be reused for
high byte of collation ID.


handler/archieve
================

handler/blackhole
=================

handler/csv
===========

handler/falcon
==============

handler/federated
=================

handler/heap
============

handler/innobase
================

handler/maria
=============
Guilhem wrote:
>if you manage to make new MyISAM code (2-byte
>collation-ID enabled) read old MyISAM tables (with 1-byte
>collation-ID) then we can use the same solution for Maria.
>...
>any change to MyISAM we can propagate to Maria.

handler/myisam
==============
mi_open.c:

These functions need to be changed to be able
store and read 2-byte IDs:

int mi_keyseg_write();
uchar *mi_keyseg_read();

Possibly, one of the three bytes used for keyseg->bit_start,
keyseg->bit_end and keyseg->bit_length can be used for high byte
of a collation ID. Bit type doesn't have a charset, so these members are most 
likely equal to 0 for charset and collation-aware data types like
CHAR, VARCHAR, TEXT, ENUM, SET.

This needs to be checked.

bit_end is only initialized, is written to myisam file header
in mi_create() and loaded in mi_open(). It's not used otherwise.

handler/myisammrg
=================

handler/ndb
===========


libmysqld
=========

meta-data - SHOW and INFORMATION_SCHEMA
=======================================
fill_schema_collation - safe:
- uses "longlong" C type to send collation ID
- uses "bigint" SQL type

Other meta-data tables do not seem to have collation IDs.

replication and binary logging
==============================
Seems to be safe, already uses two bytes for collation IDs
(Thanks to Guilhem).