WL#6607: InnoDB FULLTEXT SEARCH: CJK support

Status: Complete

This is the place holder for CJK (Chinese, Japanese and Korea) support for
InnoDB FULLTEXT search.

1) N-GRAM support
The problem with CJK language is that each word could come from multiple
characters. There is no fixed separators for individual words. So the idea is to
index contiguous groups of N characters, or n-grams, instead.

This N-gram is needed during tokenization phase, which eventually generates
tokens in the FTS index. It is also needed when we parse the search phrase, so
to find the needed words.

The N-gram is relatively simple to implement, it simply chop off phrases with
fixed segment size. For example, if the phrase is "abcd", then the tokens after
2-gram segmentation will be "ab", "bc", "cd".

2) MeCab support
MeCab is a part-of-speech and morphological analyzer for 
Japanese.http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html. and it's 
used as a tokenizer by groonga(groonga.org) & mroonga(mroonga.org), which is a 
mysql storage engine for fulltext search.

We need to build a fulltext plugin parser based on MeCab. MeCab only support 
three Japanese charsets: ujis, sjis, and utf8.

Mecab parser is superior in time, space and precision, which is the fraction of 
relevant documents in a search result. On the other hand, ngram parser is 
superior in recall, which is the fraction of retrieved documents in the perfect 
search result.

3) Stopword

The default stopword is utf-8 based, and none of them are CJK word. Although
user can add their own stopword in stopword tables, we will need to
1)Consider the need of default CJK stopword
2)Test and make sure user supplied stopword table can support CJK

4) Default Setting
MySQL support following CJK charset and collation

big5     | Big5 Traditional Chinese        | big5_chinese_ci 
ujis     | EUC-JP Japanese                 | ujis_japanese_ci
sjis     | Shift-JIS Japanese              | sjis_japanese_ci
euckr    | EUC-KR Korean                   | euckr_korean_ci
cp932    | SJIS for Windows Japanese       | cp932_japanese_ci   
eucjpms  | UJIS for Windows Japanese       | eucjpms_japanese_ci
gbk      | GBK Simplified Chinese          | gbk_chinese_ci
gb18030  | China National Standard GB18030 | gb18030_chinese_ci

mysql> SHOW COLLATION LIKE '%chinese%';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| big5_chinese_ci    | big5    |   1 | Yes     | Yes      |       1 |
| gb2312_chinese_ci  | gb2312  |  24 | Yes     | Yes      |       1 |
| gbk_chinese_ci     | gbk     |  28 | Yes     | Yes      |       1 |
| gb18030_chinese_ci | gb18030 | 248 | Yes     | Yes      |       2 |
+--------------------+---------+-----+---------+----------+---------+

mysql> SHOW COLLATION LIKE '%japan%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| ujis_japanese_ci    | ujis    | 12 | Yes     | Yes      |       1 |
| sjis_japanese_ci    | sjis    | 13 | Yes     | Yes      |       1 |
| cp932_japanese_ci   | cp932   | 95 | Yes     | Yes      |       1 |
| eucjpms_japanese_ci | eucjpms | 97 | Yes     | Yes      |       1 |
+---------------------+---------+----+---------+----------+---------+

mysql> SHOW COLLATION LIKE '%korea%';
+-----------------+---------+----+---------+----------+---------+
| Collation       | Charset | Id | Default | Compiled | Sortlen |
+-----------------+---------+----+---------+----------+---------+
| euckr_korean_ci | euckr   | 19 | Yes     | Yes      |       1 |
+-----------------+---------+----+---------+----------+---------+
1 row in set (0.01 sec)


When these charset and collation is specified for the column, the FTS index
switch to following behavior BY DEFAULT:

1) Disable min len requirement, or make it 0
2) Use CJK ngram parser
3) User appropriate default stopword set.

Note:
- Query with '-' operator in boolean mode are expected to run with NOT MATCH 
clause
e.g  Write , NOT match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE)
INSTEAD OF  match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE)  

- Search term having Numeric as well as CJK characters are not searched
match (page_title) against ('2年' in boolean mode);

- If a CJK fulltext index uses ngram as default parser, but ngram parser is not 
loaded by wrong configuration(rare case), or the index is created when ngram 
parser is not load, and used with ngram parser loaded, we will get inconsistent 
results, because the existing data and query string are parsed by different 
parsers in both cases.
Functional requirements:
F-1: Innodb should provide default fulltext parser for CJK.
F-1: Innodb should provide mecab parser for Japanese.

Non-Functional requirements:
NF-1: User can also have their own fulltext parser for CJK.
NF-2: Implicit requirements: No new SQL needed, work on all  platforms, do not 
break replication, backup, partitioning, FK, or any  other exiting features.
Problem Statement:
------------------

Write a built-in ngram fulltext plugin parser to support CJK in InnoDB.

Let's understand current implementation in InnoDB:
- Use space as delimiter to tokenize document by default.
  Languages like CJK don't have such delimiters, so it doesn't work for CJK.

- User can write their own plugin parser to support CJK.
  There are several issues:
  a. User should have a good understanding of our fts plugin parser framework, 
and certain programming skills.
     Even there are several opensource plugins.
  b. User should manage plugin installation and uninstallation.
  c. User should take care of the plugin library when we do system backup or 
upgrade, Since these plugin parser is not part of MySQL.

High Level Approach Description:
--------------------------------

- Write a ngram fulltext plugin parser.
We decide to use ngram, bigram by default. because bigram is popular and enough 
for CJK after investigating CJK support of products like Sphinx, Lucence, 
Sybase, Sql Server, DB2, and Oracle Text.

There are some properties or features of this ngram parser.
1. Character based.
Take "abc def" for example, we have "ab", "bc", "de" and "ef". Word based bigram
has only one: "abc def".

2. Built-in plugin.
It's loaded when server starts, and is unloaded when server ends, just like any 
other built-in plugins.
It's a library, and will be placed in "mysql-install-dir/lib/plugin".

3. Named "ngram".
We can use it as below:
 CREATE TABLE t1
  (
    id INT AUTO_INCREMENT PRIMARY KEY,
    doc CHAR(255),
    FULLTEXT INDEX (doc) WITH PARSER ngram
  ) ENGINE=InnoDB;

 ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER ngram;

 CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER ngram;

4. Ngram Token Size Configuration
  We have a configuration variable named "ngram_token_size", and it's default 
value is 2. We can change its value to any number between 1 to 10.
  We will ignore fts_min_token_size & fts_max_token_size checking for ngram. 
Actually the min value of fts_max_token_size is 10.

5. Space Elimination
  Space is eliminated when parsing.

  For example:
  a. 'ab cd' is parsed to 'ab', 'cd'.
  b. 'a bc' is parsed to 'bc'.

6. Stopword Handling
  We can have some too frequent and meaningless chars and words in stopword 
list, and also punctuation. 

  Stopword is very important in two aspects:
  a. Reduce index size, especially in ngram.
  b. Improve query performance and accuracy.

  We usually check stopword after tokenization. We compare a token with 
stopword, and it's equal to a stopword, we will not index it. But for ngram, we 
should use "contain" not "compare" to handle stopwords. For example, we take ',' 
as a stopword, and a document is 'a,b', then we have 'a,' and ',b'. if we use 
"compare", both bigrams are not stopwrods. That's not we want: 'a,' and ',b' 
should be filtered out, because ',' is a stopwrod.

  the other way is that we handle stopwords inside ngram parser. We need to pass 
stopword list in, and it makes the internal logical a little complicated.

  So we prefer the former way to handle stopwords.

7. Query Processing
  Space is still separator of terms in query string.

  Examples blow use bigram.
  a. Term Search
     (1) Natural Language Mode
     Term search is converted to union of ngram term search. e.g. 'abc' is 
equivalent to 'ab bc'.

     (2) Boolean Mode
      Term search is converted to ngrammed phrase search. e.g. 'abc' is 
equivalent to '"ab bc"'.

  b. Wildcard Search
     Wildcard searching can return unexpected results since an ngram text index 
contains only n-grams, and contains no information about the beginning of terms.  
the following behaviors should be noted:

    (1) if the prefix term is shorter than ngram token size, the query returns 
all indexed rows that contain n-grams starting with the prefix term. 

    (2) If the prefix term is longer than ngram token size, the prefix term is 
converted to an n-grammed phrase and the wildcard is dropped. For example: 
'abc*' is equivalent to '"ab bc"'.

  c. Phrase Search
     Phrase search is conveted to n-grammed phrase search.
     For example:
     (1)Phrase "abc" is converted to "ab bc". Doc "abc" and "ab bc" 
match.
     (2)Phrase "abc def" is converted to "ab bc de ef". Doc "abc def", and "ab 
bc de ef" match, but 'abcdef' doesn't match.

- Write a mecab fulltext plugin parser.
1. Word based
Mecab parser tokenize document into meaningful words. e.g. 'データベース管理' is 
tokenized to 'データベース' and '管理'.

2. Named "mecab".
If we want to use mecab plugin parser, we need mecab & mecab-ipadic installed.
We might distribute mecab-ipadic with server package in some platforms. In
server package we have a directory named 'mecab' in /lib/plugin, and we package
three dictionaries in mecab:/mecab/dic/ipadic_ujis,/mecab/dic/ipadic_sjis, and 
/mecab/dic/ipadic_utf8. we link mecab library statically(linking libmecab.a).

we will need to install mecab & mecab-ipadic in other platforms.

a. Install MeCab Packages
We package mecab & mecab-ipadic into mysql binary for those platforms without a 
binary distribution.

Note: we only support mecab 0.993 & 0.996 now.

We have two ways to install mecab & mecab-ipadic.

1) Installing mecab & mecab-ipadic from a Binary Distribution
On Fedora host, use yum:
yum mecab-devel

On a Debian or Ubuntu host, use apt-get:
apt-get install mecab
apt-get install mecab-ipadic

2) Building memcached from Source(not suggested)
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html#install-unix

Note: we support mecab-0.996(latest version now).

download mecab-0.996.tar.gz from 
http://code.google.com/p/mecab/downloads/detail?name=mecab-0.996.tar.gz&can=2&q= 
and mecab-ipadic-2.7.0-20070801.tar.gz from 
http://code.google.com/p/mecab/downloads/detail?name=mecab-ipadic-2.7.0-
20070801.tar.gz&can=2&q=

The default installation script is below:
tar zxfv mecab-0.996.tar
cd mecab-0.996
./configure
make
make check
su
make install

tar zxfv mecab-ipadic-2.7.0-20070801.tar
cd mecab-ipadic-2.7.0-20070801
./configure
make
su
make install 

Note:
A import option for configure mecab-ipadic is
--with-charset=charset  set default charset (euc-jp/sjis/utf-8)

b. Compile Mecab Parser.
If we want to build mecab parser by ourselves, we need to use 'WITH_MECAB' 
option for cmake.

Set WITH_MECAB to 'system', when we install mecab by default. e.g. /usr/local in
LINUX. Set WITH_MECAB to ', which is the customized
installation directory of mecab. e.g. cmake -DWITH_MECAB=/opt/mecab. We package 
mecab & mecab-ipadic in this case.

c. Run Mecab Parser
We can use the script blow to load and unload mecab plugin:
INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';
UNINSTALL PLUGIN mecab;

Note:
If we want to load mecab as server starts, refer to '--plugin-load' option in 
server options.

We can use the scripts to test if mecab parser is loaded successfully.

 CREATE TABLE t1
  (
    id INT AUTO_INCREMENT PRIMARY KEY,
    doc CHAR(255),
    FULLTEXT INDEX (doc) WITH PARSER mecab
  ) ENGINE=InnoDB DEFAULT CHARACTER SET utf8;

 ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER mecab;

 CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER mecab;

3. Mecab Status Variables and System Variables
We have a status variable "mecab_charset" to show the charset supported by 
mecab.

SHOW STATUS LIKE 'mecab_charset';
Variable_name	Value
mecab_charset	utf8

We have two system variables:
a. mecab_ipadic_charset
It's effective only when we distribute mecab with server package. We package 
three different dictionaries, so we can set this variable to tell mecab parser 
to load which dictionary.

The valid values are: euc-jp, sjis, utf-8. euc-jp is default value.

We search for '/plugin_dir/mecab/dic/ipadic-charset/'. If we can't find it, 
search for '/plugin_dir/mecab/dic/ipadic/' instead.

b. mecab_rc_file_path
mecabrc is a configuration file of mecab. We can find mecabrc in 
/path/to/mecab/etc or /usr/local/ect if we install mecab by ourselves, or 
/path/to/mysql/data if we distribute mecab with server package(in this case, we 
overwrite mecabrc file when mecab parser is loaded if mecab_rc_file_path is not 
set).

we can set mecab_rc_file_path to tell mecab to use a customized mecabrc file.

4. Query Processing
Space is still separator of terms in query string.

'データベース管理' is tokenized to two tokens: 'データベース' and '管理' by mecab.

  a. Term Search
     (1) Natural Language Mode
     Term search is converted to union of token search. e.g. 'データベース管理' is 
equivalent to 'データベース 管理'.

     (2) Boolean Mode
      Term search is converted to phrase search. e.g. 'データベース管理' is 
equivalent to '"データベース 管理"'.

  b. Wildcard Search
     We don't tokenize the term of wildcard search. e.g. 'データベース管理*' we 
will search prefix 'データベース管理', and may have no result.

  c. Phrase Search
     Phrase is tokenized. For example: "データベース管理" is equivalent to "データ
ベ
ース 管理".


========================================================================

Ngram parser users:

- Users want to use fulltext for CJK and other languages without word 
delimiters.
Query Parser Implementation
---------------------------
   Just like fts0plugin.cc, we reuse function 'ft_get_word' and related macros 
from MyISAM. (Used by both ngram and mecab parser)
	uchar
	query_get_word(
	/*===========*/
	>_______const CHARSET_INFO*>____cs,>____>_______/*!< in: charset */
	>_______uchar**>>_______>_______start,>_>_______/*!< in/out: doc start 
ptr */
	>_______uchar*>_>_______>_______end,>___>_______/*!< in/out: doc end ptr 
*/
	>_______FT_WORD*>_______>_______word,>__>_______/*!< in/out: token */
	>_______MYSQL_FTPARSER_BOOLEAN_INFO*>___info)>__/*!< in/out: token info*

	/** Macros and structs below are from ftdefs.h in MYISAM */
	/** Check a char is true word */
	#define true_word_char(c, ch) ((c) & (_MY_U | _MY_L | _MY_NMR) || (ch) 
== '_')

	/** Check if a char is misc word */
	#define misc_word_char(X)       0

	const char* fts_boolean_syntax = DEFAULT_FTB_SYNTAX;

	/** Boolean search operators */
	#define FTB_YES   (fts_boolean_syntax[0])
	#define FTB_EGAL  (fts_boolean_syntax[1])
	#define FTB_NO    (fts_boolean_syntax[2])
	#define FTB_INC   (fts_boolean_syntax[3])
	#define FTB_DEC   (fts_boolean_syntax[4])
	#define FTB_LBR   (fts_boolean_syntax[5])
	#define FTB_RBR   (fts_boolean_syntax[6])
	#define FTB_NEG   (fts_boolean_syntax[7])
	#define FTB_TRUNC (fts_boolean_syntax[8])
	#define FTB_LQUOT (fts_boolean_syntax[10])
	#define FTB_RQUOT (fts_boolean_syntax[11])

	/** FTS query token */
	typedef struct st_ft_word
	{
	>_______uchar* pos;>____/*!< word start pointer */
	>_______uint   len;>____/*!< word len */
	>_______double weight;>_/*!< word weight, unused in innodb */
	} FT_WORD;

Ngram Parser Implementation
---------------------------

We implement ngram parser as an external parser, and put the source code under 
'plugin/innodb_ngram',
other than part of innodb. But ngram parser will be loaded automatically when 
server starts.

Ngram parser has the following functions:

.) ngram parser entrance function
   we do ngram tokenization and query parsing here.
	int
	ngram_parser_parse(
	/*===============*/
	>_______MYSQL_FTPARSER_PARAM* param)>___/*!< in: plugin parser param */

.) ngram tokenization function
   We split a doc into ngrams in this function. The ngram algorithm is simple.
	int
	ngram_parse(
	/*========*/
	>_______MYSQL_FTPARSER_PARAM*>__param,>_/*!< in: plugin parser param */
	>_______const char*>____>_______doc,>___/*!< in: document to parse */
	>_______int>____>_______>_______len,>___/*!< in: document length */
	>_______MYSQL_FTPARSER_BOOLEAN_INFO* bool_info) /*!< in/out: boolean 
info */

.) Term search convert function
   We use this function to do convertion from term search to phrase search and 
handle wildcard search.
   See convert rule in 6.a-b of HLS.
	int
	ngram_term_convert(
	>_______MYSQL_FTPARSER_PARAM*>__param,>_>_______/*!< in: plugin parser 
param */
	>_______const uchar*>___>_______token,>_>_______/*!< in: token to parse 
*/
	>_______int>____>_______>_______len,>___>_______/*!< in: token length */
	>_______MYSQL_FTPARSER_BOOLEAN_INFO*
	>_______>_______>_______>_______bool_info)>_____/*!< in/out: boolean 
info */

Stopword Handling
-----------------

We do sopword handling inside innodb, and we distinguish ngram parser from other 
parsers.

.) fts token check function
   We do fts_min_token_size & fts_max_token_size & stopword checking here, and 
make this function as the only
entrance to do the checks.
	bool
	fts_check_token(
	/*============*/
		fts_string_t*           token,          /* in: token string */
		ib_rbt_t*               stopwords,      /* in: stopwords */
		bool                    is_ngram,       /* in: is ngram parser 
*/
		CHARSET_INFO*           cs)             /* in: token charset */

   So ralated functions needed to modify are below:
   a. fts_add_token in fts0fts.cc
   b. fts_tokenizer_word_get in fts0fts.cc
   c. row_merge_fts_doc_tokenize in row0ftsort.cc
   d. fts_query_phrase_split in fts0que.cc

   The stopword check alogrithm for ngram:
   a. first we split the token into ngrams from unigram to bigram(bigram for 
example).
   b. then we check each ngram in stopwrods, if it's a stopword, we treat the 
token as stopword.

.) is_ngram flag check and set
   First we define a macro for ngram parser name in ft_global.h.
    	#define FTS_NGRAM_PARSER_NAME "ngram"

   TODO: optimizer team can also use this macro to set ngram as default parser 
for CJK if no parser is specified.

   Then we add 'is_ngram' in index_def, dict_index_t and fts_doc_t, and check 
ngram parser by name in handler.

   In ha_innobase::open and innobase_create_index_def, we do:

       index->is_ngram = strncmp(plugin_name(parser)->str,
                FTS_NGRAM_PARSER_NAME,
                plugin_name(parser)->length)
                == 0 ? true : false;

Mecab Parser Implementation
---------------------------
.) Mecab parser plugin init function
We create MeCab::Model and MeCab::Tagger objects shared by all threads.
Note:
We create Model by default, but groonga uses '-Owakati' for quick parsing. The 
reason is that we need word position.

int
mecab_parser_plugin_init(
        void*   arg __attribute__((unused)));

.) Mecab parser plugin deinit function
int
mecab_parser_plugin_deinit(
        void*   arg __attribute__((unused)));

.) Mecab parser entry function
Create MeCab::Lattice object for parsing. Lattice contains all local variables 
required for the analysis and we create a lattice object for every call.

int
mecab_parser_parse(
        MYSQL_FTPARSER_PARAM*   param)  /*!< in: plugin parser param */

.) Mecab parse function
We parse document and do term-to-phrase conversion here.

int
mecab_parse(
        MeCab::Lattice*         mecab_lattice,  /*!< in: MeCab lattice */
        MYSQL_FTPARSER_PARAM*   param,          /*!< in: plugin parser param */
        char*                   doc,            /*!< in: document to parse */
        int                     len,            /*!< in: document length */
        MYSQL_FTPARSER_BOOLEAN_INFO*
                                bool_info)      /*!< in/out: boolean info */