WL#6607: InnoDB FULLTEXT SEARCH: CJK support
Status: Complete
This is the place holder for CJK (Chinese, Japanese and Korea) support for
InnoDB FULLTEXT search.
1) N-GRAM support
The problem with CJK language is that each word could come from multiple
characters. There is no fixed separators for individual words. So the idea is to
index contiguous groups of N characters, or n-grams, instead.
This N-gram is needed during tokenization phase, which eventually generates
tokens in the FTS index. It is also needed when we parse the search phrase, so
to find the needed words.
The N-gram is relatively simple to implement, it simply chop off phrases with
fixed segment size. For example, if the phrase is "abcd", then the tokens after
2-gram segmentation will be "ab", "bc", "cd".
2) MeCab support
MeCab is a part-of-speech and morphological analyzer for
Japanese.http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html. and it's
used as a tokenizer by groonga(groonga.org) & mroonga(mroonga.org), which is a
mysql storage engine for fulltext search.
We need to build a fulltext plugin parser based on MeCab. MeCab only support
three Japanese charsets: ujis, sjis, and utf8.
Mecab parser is superior in time, space and precision, which is the fraction of
relevant documents in a search result. On the other hand, ngram parser is
superior in recall, which is the fraction of retrieved documents in the perfect
search result.
3) Stopword
The default stopword is utf-8 based, and none of them are CJK word. Although
user can add their own stopword in stopword tables, we will need to
1)Consider the need of default CJK stopword
2)Test and make sure user supplied stopword table can support CJK
4) Default Setting
MySQL support following CJK charset and collation
big5 | Big5 Traditional Chinese | big5_chinese_ci
ujis | EUC-JP Japanese | ujis_japanese_ci
sjis | Shift-JIS Japanese | sjis_japanese_ci
euckr | EUC-KR Korean | euckr_korean_ci
cp932 | SJIS for Windows Japanese | cp932_japanese_ci
eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci
gbk | GBK Simplified Chinese | gbk_chinese_ci
gb18030 | China National Standard GB18030 | gb18030_chinese_ci
mysql> SHOW COLLATION LIKE '%chinese%';
+--------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| big5_chinese_ci | big5 | 1 | Yes | Yes | 1 |
| gb2312_chinese_ci | gb2312 | 24 | Yes | Yes | 1 |
| gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 |
| gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 2 |
+--------------------+---------+-----+---------+----------+---------+
mysql> SHOW COLLATION LIKE '%japan%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| ujis_japanese_ci | ujis | 12 | Yes | Yes | 1 |
| sjis_japanese_ci | sjis | 13 | Yes | Yes | 1 |
| cp932_japanese_ci | cp932 | 95 | Yes | Yes | 1 |
| eucjpms_japanese_ci | eucjpms | 97 | Yes | Yes | 1 |
+---------------------+---------+----+---------+----------+---------+
mysql> SHOW COLLATION LIKE '%korea%';
+-----------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-----------------+---------+----+---------+----------+---------+
| euckr_korean_ci | euckr | 19 | Yes | Yes | 1 |
+-----------------+---------+----+---------+----------+---------+
1 row in set (0.01 sec)
When these charset and collation is specified for the column, the FTS index
switch to following behavior BY DEFAULT:
1) Disable min len requirement, or make it 0
2) Use CJK ngram parser
3) User appropriate default stopword set.
Note:
- Query with '-' operator in boolean mode are expected to run with NOT MATCH
clause
e.g Write , NOT match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE)
INSTEAD OF match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE)
- Search term having Numeric as well as CJK characters are not searched
match (page_title) against ('2年' in boolean mode);
- If a CJK fulltext index uses ngram as default parser, but ngram parser is not
loaded by wrong configuration(rare case), or the index is created when ngram
parser is not load, and used with ngram parser loaded, we will get inconsistent
results, because the existing data and query string are parsed by different
parsers in both cases.
Functional requirements: F-1: Innodb should provide default fulltext parser for CJK. F-1: Innodb should provide mecab parser for Japanese. Non-Functional requirements: NF-1: User can also have their own fulltext parser for CJK. NF-2: Implicit requirements: No new SQL needed, work on all platforms, do not break replication, backup, partitioning, FK, or any other exiting features.
Problem Statement:
------------------
Write a built-in ngram fulltext plugin parser to support CJK in InnoDB.
Let's understand current implementation in InnoDB:
- Use space as delimiter to tokenize document by default.
Languages like CJK don't have such delimiters, so it doesn't work for CJK.
- User can write their own plugin parser to support CJK.
There are several issues:
a. User should have a good understanding of our fts plugin parser framework,
and certain programming skills.
Even there are several opensource plugins.
b. User should manage plugin installation and uninstallation.
c. User should take care of the plugin library when we do system backup or
upgrade, Since these plugin parser is not part of MySQL.
High Level Approach Description:
--------------------------------
- Write a ngram fulltext plugin parser.
We decide to use ngram, bigram by default. because bigram is popular and enough
for CJK after investigating CJK support of products like Sphinx, Lucence,
Sybase, Sql Server, DB2, and Oracle Text.
There are some properties or features of this ngram parser.
1. Character based.
Take "abc def" for example, we have "ab", "bc", "de" and "ef". Word based bigram
has only one: "abc def".
2. Built-in plugin.
It's loaded when server starts, and is unloaded when server ends, just like any
other built-in plugins.
It's a library, and will be placed in "mysql-install-dir/lib/plugin".
3. Named "ngram".
We can use it as below:
CREATE TABLE t1
(
id INT AUTO_INCREMENT PRIMARY KEY,
doc CHAR(255),
FULLTEXT INDEX (doc) WITH PARSER ngram
) ENGINE=InnoDB;
ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER ngram;
CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER ngram;
4. Ngram Token Size Configuration
We have a configuration variable named "ngram_token_size", and it's default
value is 2. We can change its value to any number between 1 to 10.
We will ignore fts_min_token_size & fts_max_token_size checking for ngram.
Actually the min value of fts_max_token_size is 10.
5. Space Elimination
Space is eliminated when parsing.
For example:
a. 'ab cd' is parsed to 'ab', 'cd'.
b. 'a bc' is parsed to 'bc'.
6. Stopword Handling
We can have some too frequent and meaningless chars and words in stopword
list, and also punctuation.
Stopword is very important in two aspects:
a. Reduce index size, especially in ngram.
b. Improve query performance and accuracy.
We usually check stopword after tokenization. We compare a token with
stopword, and it's equal to a stopword, we will not index it. But for ngram, we
should use "contain" not "compare" to handle stopwords. For example, we take ','
as a stopword, and a document is 'a,b', then we have 'a,' and ',b'. if we use
"compare", both bigrams are not stopwrods. That's not we want: 'a,' and ',b'
should be filtered out, because ',' is a stopwrod.
the other way is that we handle stopwords inside ngram parser. We need to pass
stopword list in, and it makes the internal logical a little complicated.
So we prefer the former way to handle stopwords.
7. Query Processing
Space is still separator of terms in query string.
Examples blow use bigram.
a. Term Search
(1) Natural Language Mode
Term search is converted to union of ngram term search. e.g. 'abc' is
equivalent to 'ab bc'.
(2) Boolean Mode
Term search is converted to ngrammed phrase search. e.g. 'abc' is
equivalent to '"ab bc"'.
b. Wildcard Search
Wildcard searching can return unexpected results since an ngram text index
contains only n-grams, and contains no information about the beginning of terms.
the following behaviors should be noted:
(1) if the prefix term is shorter than ngram token size, the query returns
all indexed rows that contain n-grams starting with the prefix term.
(2) If the prefix term is longer than ngram token size, the prefix term is
converted to an n-grammed phrase and the wildcard is dropped. For example:
'abc*' is equivalent to '"ab bc"'.
c. Phrase Search
Phrase search is conveted to n-grammed phrase search.
For example:
(1)Phrase "abc" is converted to "ab bc". Doc "abc" and "ab bc"
match.
(2)Phrase "abc def" is converted to "ab bc de ef". Doc "abc def", and "ab
bc de ef" match, but 'abcdef' doesn't match.
- Write a mecab fulltext plugin parser.
1. Word based
Mecab parser tokenize document into meaningful words. e.g. 'データベース管理' is
tokenized to 'データベース' and '管理'.
2. Named "mecab".
If we want to use mecab plugin parser, we need mecab & mecab-ipadic installed.
We might distribute mecab-ipadic with server package in some platforms. In
server package we have a directory named 'mecab' in /lib/plugin, and we package
three dictionaries in mecab:/mecab/dic/ipadic_ujis,/mecab/dic/ipadic_sjis, and
/mecab/dic/ipadic_utf8. we link mecab library statically(linking libmecab.a).
we will need to install mecab & mecab-ipadic in other platforms.
a. Install MeCab Packages
We package mecab & mecab-ipadic into mysql binary for those platforms without a
binary distribution.
Note: we only support mecab 0.993 & 0.996 now.
We have two ways to install mecab & mecab-ipadic.
1) Installing mecab & mecab-ipadic from a Binary Distribution
On Fedora host, use yum:
yum mecab-devel
On a Debian or Ubuntu host, use apt-get:
apt-get install mecab
apt-get install mecab-ipadic
2) Building memcached from Source(not suggested)
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html#install-unix
Note: we support mecab-0.996(latest version now).
download mecab-0.996.tar.gz from
http://code.google.com/p/mecab/downloads/detail?name=mecab-0.996.tar.gz&can=2&q=
and mecab-ipadic-2.7.0-20070801.tar.gz from
http://code.google.com/p/mecab/downloads/detail?name=mecab-ipadic-2.7.0-
20070801.tar.gz&can=2&q=
The default installation script is below:
tar zxfv mecab-0.996.tar
cd mecab-0.996
./configure
make
make check
su
make install
tar zxfv mecab-ipadic-2.7.0-20070801.tar
cd mecab-ipadic-2.7.0-20070801
./configure
make
su
make install
Note:
A import option for configure mecab-ipadic is
--with-charset=charset set default charset (euc-jp/sjis/utf-8)
b. Compile Mecab Parser.
If we want to build mecab parser by ourselves, we need to use 'WITH_MECAB'
option for cmake.
Set WITH_MECAB to 'system', when we install mecab by default. e.g. /usr/local in
LINUX. Set WITH_MECAB to ', which is the customized
installation directory of mecab. e.g. cmake -DWITH_MECAB=/opt/mecab. We package
mecab & mecab-ipadic in this case.
c. Run Mecab Parser
We can use the script blow to load and unload mecab plugin:
INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';
UNINSTALL PLUGIN mecab;
Note:
If we want to load mecab as server starts, refer to '--plugin-load' option in
server options.
We can use the scripts to test if mecab parser is loaded successfully.
CREATE TABLE t1
(
id INT AUTO_INCREMENT PRIMARY KEY,
doc CHAR(255),
FULLTEXT INDEX (doc) WITH PARSER mecab
) ENGINE=InnoDB DEFAULT CHARACTER SET utf8;
ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER mecab;
CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER mecab;
3. Mecab Status Variables and System Variables
We have a status variable "mecab_charset" to show the charset supported by
mecab.
SHOW STATUS LIKE 'mecab_charset';
Variable_name Value
mecab_charset utf8
We have two system variables:
a. mecab_ipadic_charset
It's effective only when we distribute mecab with server package. We package
three different dictionaries, so we can set this variable to tell mecab parser
to load which dictionary.
The valid values are: euc-jp, sjis, utf-8. euc-jp is default value.
We search for '/plugin_dir/mecab/dic/ipadic-charset/'. If we can't find it,
search for '/plugin_dir/mecab/dic/ipadic/' instead.
b. mecab_rc_file_path
mecabrc is a configuration file of mecab. We can find mecabrc in
/path/to/mecab/etc or /usr/local/ect if we install mecab by ourselves, or
/path/to/mysql/data if we distribute mecab with server package(in this case, we
overwrite mecabrc file when mecab parser is loaded if mecab_rc_file_path is not
set).
we can set mecab_rc_file_path to tell mecab to use a customized mecabrc file.
4. Query Processing
Space is still separator of terms in query string.
'データベース管理' is tokenized to two tokens: 'データベース' and '管理' by mecab.
a. Term Search
(1) Natural Language Mode
Term search is converted to union of token search. e.g. 'データベース管理' is
equivalent to 'データベース 管理'.
(2) Boolean Mode
Term search is converted to phrase search. e.g. 'データベース管理' is
equivalent to '"データベース 管理"'.
b. Wildcard Search
We don't tokenize the term of wildcard search. e.g. 'データベース管理*' we
will search prefix 'データベース管理', and may have no result.
c. Phrase Search
Phrase is tokenized. For example: "データベース管理" is equivalent to "データ
ベ
ース 管理".
========================================================================
Ngram parser users:
- Users want to use fulltext for CJK and other languages without word
delimiters.
Query Parser Implementation
---------------------------
Just like fts0plugin.cc, we reuse function 'ft_get_word' and related macros
from MyISAM. (Used by both ngram and mecab parser)
uchar
query_get_word(
/*===========*/
>_______const CHARSET_INFO*>____cs,>____>_______/*!< in: charset */
>_______uchar**>>_______>_______start,>_>_______/*!< in/out: doc start
ptr */
>_______uchar*>_>_______>_______end,>___>_______/*!< in/out: doc end ptr
*/
>_______FT_WORD*>_______>_______word,>__>_______/*!< in/out: token */
>_______MYSQL_FTPARSER_BOOLEAN_INFO*>___info)>__/*!< in/out: token info*
/** Macros and structs below are from ftdefs.h in MYISAM */
/** Check a char is true word */
#define true_word_char(c, ch) ((c) & (_MY_U | _MY_L | _MY_NMR) || (ch)
== '_')
/** Check if a char is misc word */
#define misc_word_char(X) 0
const char* fts_boolean_syntax = DEFAULT_FTB_SYNTAX;
/** Boolean search operators */
#define FTB_YES (fts_boolean_syntax[0])
#define FTB_EGAL (fts_boolean_syntax[1])
#define FTB_NO (fts_boolean_syntax[2])
#define FTB_INC (fts_boolean_syntax[3])
#define FTB_DEC (fts_boolean_syntax[4])
#define FTB_LBR (fts_boolean_syntax[5])
#define FTB_RBR (fts_boolean_syntax[6])
#define FTB_NEG (fts_boolean_syntax[7])
#define FTB_TRUNC (fts_boolean_syntax[8])
#define FTB_LQUOT (fts_boolean_syntax[10])
#define FTB_RQUOT (fts_boolean_syntax[11])
/** FTS query token */
typedef struct st_ft_word
{
>_______uchar* pos;>____/*!< word start pointer */
>_______uint len;>____/*!< word len */
>_______double weight;>_/*!< word weight, unused in innodb */
} FT_WORD;
Ngram Parser Implementation
---------------------------
We implement ngram parser as an external parser, and put the source code under
'plugin/innodb_ngram',
other than part of innodb. But ngram parser will be loaded automatically when
server starts.
Ngram parser has the following functions:
.) ngram parser entrance function
we do ngram tokenization and query parsing here.
int
ngram_parser_parse(
/*===============*/
>_______MYSQL_FTPARSER_PARAM* param)>___/*!< in: plugin parser param */
.) ngram tokenization function
We split a doc into ngrams in this function. The ngram algorithm is simple.
int
ngram_parse(
/*========*/
>_______MYSQL_FTPARSER_PARAM*>__param,>_/*!< in: plugin parser param */
>_______const char*>____>_______doc,>___/*!< in: document to parse */
>_______int>____>_______>_______len,>___/*!< in: document length */
>_______MYSQL_FTPARSER_BOOLEAN_INFO* bool_info) /*!< in/out: boolean
info */
.) Term search convert function
We use this function to do convertion from term search to phrase search and
handle wildcard search.
See convert rule in 6.a-b of HLS.
int
ngram_term_convert(
>_______MYSQL_FTPARSER_PARAM*>__param,>_>_______/*!< in: plugin parser
param */
>_______const uchar*>___>_______token,>_>_______/*!< in: token to parse
*/
>_______int>____>_______>_______len,>___>_______/*!< in: token length */
>_______MYSQL_FTPARSER_BOOLEAN_INFO*
>_______>_______>_______>_______bool_info)>_____/*!< in/out: boolean
info */
Stopword Handling
-----------------
We do sopword handling inside innodb, and we distinguish ngram parser from other
parsers.
.) fts token check function
We do fts_min_token_size & fts_max_token_size & stopword checking here, and
make this function as the only
entrance to do the checks.
bool
fts_check_token(
/*============*/
fts_string_t* token, /* in: token string */
ib_rbt_t* stopwords, /* in: stopwords */
bool is_ngram, /* in: is ngram parser
*/
CHARSET_INFO* cs) /* in: token charset */
So ralated functions needed to modify are below:
a. fts_add_token in fts0fts.cc
b. fts_tokenizer_word_get in fts0fts.cc
c. row_merge_fts_doc_tokenize in row0ftsort.cc
d. fts_query_phrase_split in fts0que.cc
The stopword check alogrithm for ngram:
a. first we split the token into ngrams from unigram to bigram(bigram for
example).
b. then we check each ngram in stopwrods, if it's a stopword, we treat the
token as stopword.
.) is_ngram flag check and set
First we define a macro for ngram parser name in ft_global.h.
#define FTS_NGRAM_PARSER_NAME "ngram"
TODO: optimizer team can also use this macro to set ngram as default parser
for CJK if no parser is specified.
Then we add 'is_ngram' in index_def, dict_index_t and fts_doc_t, and check
ngram parser by name in handler.
In ha_innobase::open and innobase_create_index_def, we do:
index->is_ngram = strncmp(plugin_name(parser)->str,
FTS_NGRAM_PARSER_NAME,
plugin_name(parser)->length)
== 0 ? true : false;
Mecab Parser Implementation
---------------------------
.) Mecab parser plugin init function
We create MeCab::Model and MeCab::Tagger objects shared by all threads.
Note:
We create Model by default, but groonga uses '-Owakati' for quick parsing. The
reason is that we need word position.
int
mecab_parser_plugin_init(
void* arg __attribute__((unused)));
.) Mecab parser plugin deinit function
int
mecab_parser_plugin_deinit(
void* arg __attribute__((unused)));
.) Mecab parser entry function
Create MeCab::Lattice object for parsing. Lattice contains all local variables
required for the analysis and we create a lattice object for every call.
int
mecab_parser_parse(
MYSQL_FTPARSER_PARAM* param) /*!< in: plugin parser param */
.) Mecab parse function
We parse document and do term-to-phrase conversion here.
int
mecab_parse(
MeCab::Lattice* mecab_lattice, /*!< in: MeCab lattice */
MYSQL_FTPARSER_PARAM* param, /*!< in: plugin parser param */
char* doc, /*!< in: document to parse */
int len, /*!< in: document length */
MYSQL_FTPARSER_BOOLEAN_INFO*
bool_info) /*!< in/out: boolean info */
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.