WL#3759: Optimize identifier conversion in client-server protocol

Affects: Server-5.5 — Status: Complete

Description
High Level Architecture
Low Level Design

Since 4.1, we use utf8 to store idenfifiers on disk,
in memory, for lookups, for comparison, and so on.
Move to utf8 was done as a consequence of introducing
multiple character set support under the same
server, in the same database, in the same table,
or even in the same SQL statement - character set
of identifiers must be a super set for all supported
character sets.

Tests with "valgrind --cachegrind" profiler detected
some peformance degradation between mysqld versions 4.0 and 4.1.
The source of slow down is in latin1->utf8->latin1
identifier conversion.

A test client program using latin1 client character set
was sending "SELECT a FROM t1", 100000 times, against an
empty heap table:

CREATE TABLE t1 (a int NOT NULL) TYPE=HEAP;

Version 4.1 generated  1,813,494,123 work units on mysqld side.
While version 4.0 produced only 1,393,268,776.
4.0 was 25% faster for this kind of client application.

There were extra 420,225,347 work units, with most important being:

     74,502,664 sql_string.cc:String::copy()
     43,901,866 ctype-utf8.c:my_utf8_uni
     42,902,112 ctype-latin1.c:my_wc_mb_latin1
     34,800,812 protocol.cc:Protocol::store_string_aux
     21,600,676 charset.c:my_charset_same
      3,000,060 ctype-utf8.c:my_ismbchar_utf8
      6,000,000 protocol.cc:Protocol::send_fields

This is because of utf8->latin1 conversion is done
during Protocol::send_fields().

This is very unpleasant performance degradation, especially
for the users who want only a single character set (like in 4.0).

The WL#1898 proposes to compile a "light" version mysqld,
with a single character set, which will mean that
no character set conversion is necessary at all, and performance
should return closer towards performance of 4.0.

However, even in "full" version, we can improve performance
significantly. In many cases "full featured" conversion
is not really necessary. For example, the test program was
using just pure ASCII identifiers which are compatible between
utf8 and latin1.

We can optimize the code for the cases like utf8->latin1,
and even for some multibyte character sets, for example utf8->gbk.

Typical conversion scenarios and the ways of their optimization
===============================================================

1. Conversion from utf8 to a 8bit character set:
quickly copy a sequence of leading 7bit (ASCII) values until
the end of the string - then exit,
or until a 8bit value met - then switch to loop
"get utf8 character -> put 8bit character".

2. Conversion from utf8 to a ASCII-based multi-byte 
character set (i.e. with mbminlen=1):
quickly copy a sequence of 7bit (ASCII) values until
the end of the string - then exit,
or until a 8bit value met - then switch to 
loop "get utf8 character -> put multi-byte character"

3. Conversion from utf8 to a non-ascii-based multi-byte
character set (e.g. with mbminlen>1 like in ucs2):
It will use traditional (non-optimized) loop:
"get utf8 character -> put multi-byte character".

Other optimization ideas
========================

- mb_wc_quick():
  A new function into MY_CHARSET_HANDLER can be added.
  It will work almost like mb_wc, but won't check if the destination
  string has enough space, assuming that the caller allocated enough
  space for the destination string before calling conversion routines.
  It will be faster than mb_wc.
  This is to optimize conversion of non-ASCII characters for identifiers.

- Cache data in THD:
  Some data can be cached in THD structure whenever 
  thd->variables.character_set_results is changed.
  The cached data can contain pointers to functions,
  for example, identifier_to_client_converter(),
  or some flags, for example;
  "bool quickly_do_ascii_characters_for_identifiers"
  This needs further investigations.

- Four bytes at once:
  Copying of "leading pure ASCII part" can be implementing
  using "copy four bytes at once" approach. This is possible
  on i386 platforms, because i386 processor allows to cast
  non-aligned data as 32-bit integer.
  This needs further investigations, because it may slow down
  conversion of short identifiers with length 1 to 3 bytes.

- Assembler:
  Copying of "leading pure ASCII part" can be written in assembler.
  at least for i386.

1. A new method for the "Protocol" class will be added:

bool Protocol::net_store_data(const char *from, uint length,
                              CHARSET_INFO *from_cs, CHARSET_INFO *to_cs)

Storing text with character set conversion. It will
quickly copy ASCII-compatible leading characters,
as described in scenario N1 and N2.


2. Protocol::store_sting_aux() will be extended:

2a. It will detect the cases when fast conversion is
    possible and call the new method.

2b. If quick conversion is not possible,
    it will use the old slow method using "convert"
    as an intermediate storage for the converted data:
    {
      uint dummy_errors;
      return convert->copy(from, length, fromcs, tocs, &dummy_errors) ||
             net_store_data(convert->ptr(), convert->length());
    }