WL#6607: InnoDB FULLTEXT SEARCH: CJK support
Status: Complete
This is the place holder for CJK (Chinese, Japanese and Korea) support for InnoDB FULLTEXT search. 1) N-GRAM support The problem with CJK language is that each word could come from multiple characters. There is no fixed separators for individual words. So the idea is to index contiguous groups of N characters, or n-grams, instead. This N-gram is needed during tokenization phase, which eventually generates tokens in the FTS index. It is also needed when we parse the search phrase, so to find the needed words. The N-gram is relatively simple to implement, it simply chop off phrases with fixed segment size. For example, if the phrase is "abcd", then the tokens after 2-gram segmentation will be "ab", "bc", "cd". 2) MeCab support MeCab is a part-of-speech and morphological analyzer for Japanese.http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html. and it's used as a tokenizer by groonga(groonga.org) & mroonga(mroonga.org), which is a mysql storage engine for fulltext search. We need to build a fulltext plugin parser based on MeCab. MeCab only support three Japanese charsets: ujis, sjis, and utf8. Mecab parser is superior in time, space and precision, which is the fraction of relevant documents in a search result. On the other hand, ngram parser is superior in recall, which is the fraction of retrieved documents in the perfect search result. 3) Stopword The default stopword is utf-8 based, and none of them are CJK word. Although user can add their own stopword in stopword tables, we will need to 1)Consider the need of default CJK stopword 2)Test and make sure user supplied stopword table can support CJK 4) Default Setting MySQL support following CJK charset and collation big5 | Big5 Traditional Chinese | big5_chinese_ci ujis | EUC-JP Japanese | ujis_japanese_ci sjis | Shift-JIS Japanese | sjis_japanese_ci euckr | EUC-KR Korean | euckr_korean_ci cp932 | SJIS for Windows Japanese | cp932_japanese_ci eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci gbk | GBK Simplified Chinese | gbk_chinese_ci gb18030 | China National Standard GB18030 | gb18030_chinese_ci mysql> SHOW COLLATION LIKE '%chinese%'; +--------------------+---------+-----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +--------------------+---------+-----+---------+----------+---------+ | big5_chinese_ci | big5 | 1 | Yes | Yes | 1 | | gb2312_chinese_ci | gb2312 | 24 | Yes | Yes | 1 | | gbk_chinese_ci | gbk | 28 | Yes | Yes | 1 | | gb18030_chinese_ci | gb18030 | 248 | Yes | Yes | 2 | +--------------------+---------+-----+---------+----------+---------+ mysql> SHOW COLLATION LIKE '%japan%'; +---------------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +---------------------+---------+----+---------+----------+---------+ | ujis_japanese_ci | ujis | 12 | Yes | Yes | 1 | | sjis_japanese_ci | sjis | 13 | Yes | Yes | 1 | | cp932_japanese_ci | cp932 | 95 | Yes | Yes | 1 | | eucjpms_japanese_ci | eucjpms | 97 | Yes | Yes | 1 | +---------------------+---------+----+---------+----------+---------+ mysql> SHOW COLLATION LIKE '%korea%'; +-----------------+---------+----+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +-----------------+---------+----+---------+----------+---------+ | euckr_korean_ci | euckr | 19 | Yes | Yes | 1 | +-----------------+---------+----+---------+----------+---------+ 1 row in set (0.01 sec) When these charset and collation is specified for the column, the FTS index switch to following behavior BY DEFAULT: 1) Disable min len requirement, or make it 0 2) Use CJK ngram parser 3) User appropriate default stopword set. Note: - Query with '-' operator in boolean mode are expected to run with NOT MATCH clause e.g Write , NOT match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE) INSTEAD OF match (page_title) against ("-曖昧 -さ 回" in BOOLEAN MODE) - Search term having Numeric as well as CJK characters are not searched match (page_title) against ('2年' in boolean mode); - If a CJK fulltext index uses ngram as default parser, but ngram parser is not loaded by wrong configuration(rare case), or the index is created when ngram parser is not load, and used with ngram parser loaded, we will get inconsistent results, because the existing data and query string are parsed by different parsers in both cases.
Functional requirements: F-1: Innodb should provide default fulltext parser for CJK. F-1: Innodb should provide mecab parser for Japanese. Non-Functional requirements: NF-1: User can also have their own fulltext parser for CJK. NF-2: Implicit requirements: No new SQL needed, work on all platforms, do not break replication, backup, partitioning, FK, or any other exiting features.
Problem Statement: ------------------ Write a built-in ngram fulltext plugin parser to support CJK in InnoDB. Let's understand current implementation in InnoDB: - Use space as delimiter to tokenize document by default. Languages like CJK don't have such delimiters, so it doesn't work for CJK. - User can write their own plugin parser to support CJK. There are several issues: a. User should have a good understanding of our fts plugin parser framework, and certain programming skills. Even there are several opensource plugins. b. User should manage plugin installation and uninstallation. c. User should take care of the plugin library when we do system backup or upgrade, Since these plugin parser is not part of MySQL. High Level Approach Description: -------------------------------- - Write a ngram fulltext plugin parser. We decide to use ngram, bigram by default. because bigram is popular and enough for CJK after investigating CJK support of products like Sphinx, Lucence, Sybase, Sql Server, DB2, and Oracle Text. There are some properties or features of this ngram parser. 1. Character based. Take "abc def" for example, we have "ab", "bc", "de" and "ef". Word based bigram has only one: "abc def". 2. Built-in plugin. It's loaded when server starts, and is unloaded when server ends, just like any other built-in plugins. It's a library, and will be placed in "mysql-install-dir/lib/plugin". 3. Named "ngram". We can use it as below: CREATE TABLE t1 ( id INT AUTO_INCREMENT PRIMARY KEY, doc CHAR(255), FULLTEXT INDEX (doc) WITH PARSER ngram ) ENGINE=InnoDB; ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER ngram; CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER ngram; 4. Ngram Token Size Configuration We have a configuration variable named "ngram_token_size", and it's default value is 2. We can change its value to any number between 1 to 10. We will ignore fts_min_token_size & fts_max_token_size checking for ngram. Actually the min value of fts_max_token_size is 10. 5. Space Elimination Space is eliminated when parsing. For example: a. 'ab cd' is parsed to 'ab', 'cd'. b. 'a bc' is parsed to 'bc'. 6. Stopword Handling We can have some too frequent and meaningless chars and words in stopword list, and also punctuation. Stopword is very important in two aspects: a. Reduce index size, especially in ngram. b. Improve query performance and accuracy. We usually check stopword after tokenization. We compare a token with stopword, and it's equal to a stopword, we will not index it. But for ngram, we should use "contain" not "compare" to handle stopwords. For example, we take ',' as a stopword, and a document is 'a,b', then we have 'a,' and ',b'. if we use "compare", both bigrams are not stopwrods. That's not we want: 'a,' and ',b' should be filtered out, because ',' is a stopwrod. the other way is that we handle stopwords inside ngram parser. We need to pass stopword list in, and it makes the internal logical a little complicated. So we prefer the former way to handle stopwords. 7. Query Processing Space is still separator of terms in query string. Examples blow use bigram. a. Term Search (1) Natural Language Mode Term search is converted to union of ngram term search. e.g. 'abc' is equivalent to 'ab bc'. (2) Boolean Mode Term search is converted to ngrammed phrase search. e.g. 'abc' is equivalent to '"ab bc"'. b. Wildcard Search Wildcard searching can return unexpected results since an ngram text index contains only n-grams, and contains no information about the beginning of terms. the following behaviors should be noted: (1) if the prefix term is shorter than ngram token size, the query returns all indexed rows that contain n-grams starting with the prefix term. (2) If the prefix term is longer than ngram token size, the prefix term is converted to an n-grammed phrase and the wildcard is dropped. For example: 'abc*' is equivalent to '"ab bc"'. c. Phrase Search Phrase search is conveted to n-grammed phrase search. For example: (1)Phrase "abc" is converted to "ab bc". Doc "abc" and "ab bc" match. (2)Phrase "abc def" is converted to "ab bc de ef". Doc "abc def", and "ab bc de ef" match, but 'abcdef' doesn't match. - Write a mecab fulltext plugin parser. 1. Word based Mecab parser tokenize document into meaningful words. e.g. 'データベース管理' is tokenized to 'データベース' and '管理'. 2. Named "mecab". If we want to use mecab plugin parser, we need mecab & mecab-ipadic installed. We might distribute mecab-ipadic with server package in some platforms. In server package we have a directory named 'mecab' in /lib/plugin, and we package three dictionaries in mecab:/mecab/dic/ipadic_ujis,/mecab/dic/ipadic_sjis, and /mecab/dic/ipadic_utf8. we link mecab library statically(linking libmecab.a). we will need to install mecab & mecab-ipadic in other platforms. a. Install MeCab Packages We package mecab & mecab-ipadic into mysql binary for those platforms without a binary distribution. Note: we only support mecab 0.993 & 0.996 now. We have two ways to install mecab & mecab-ipadic. 1) Installing mecab & mecab-ipadic from a Binary Distribution On Fedora host, use yum: yum mecab-devel On a Debian or Ubuntu host, use apt-get: apt-get install mecab apt-get install mecab-ipadic 2) Building memcached from Source(not suggested) http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html#install-unix Note: we support mecab-0.996(latest version now). download mecab-0.996.tar.gz from http://code.google.com/p/mecab/downloads/detail?name=mecab-0.996.tar.gz&can=2&q= and mecab-ipadic-2.7.0-20070801.tar.gz from http://code.google.com/p/mecab/downloads/detail?name=mecab-ipadic-2.7.0- 20070801.tar.gz&can=2&q= The default installation script is below: tar zxfv mecab-0.996.tar cd mecab-0.996 ./configure make make check su make install tar zxfv mecab-ipadic-2.7.0-20070801.tar cd mecab-ipadic-2.7.0-20070801 ./configure make su make install Note: A import option for configure mecab-ipadic is --with-charset=charset set default charset (euc-jp/sjis/utf-8) b. Compile Mecab Parser. If we want to build mecab parser by ourselves, we need to use 'WITH_MECAB' option for cmake. Set WITH_MECAB to 'system', when we install mecab by default. e.g. /usr/local in LINUX. Set WITH_MECAB to ', which is the customized installation directory of mecab. e.g. cmake -DWITH_MECAB=/opt/mecab. We package mecab & mecab-ipadic in this case. c. Run Mecab Parser We can use the script blow to load and unload mecab plugin: INSTALL PLUGIN mecab SONAME 'libpluginmecab.so'; UNINSTALL PLUGIN mecab; Note: If we want to load mecab as server starts, refer to '--plugin-load' option in server options. We can use the scripts to test if mecab parser is loaded successfully. CREATE TABLE t1 ( id INT AUTO_INCREMENT PRIMARY KEY, doc CHAR(255), FULLTEXT INDEX (doc) WITH PARSER mecab ) ENGINE=InnoDB DEFAULT CHARACTER SET utf8; ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER mecab; CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER mecab; 3. Mecab Status Variables and System Variables We have a status variable "mecab_charset" to show the charset supported by mecab. SHOW STATUS LIKE 'mecab_charset'; Variable_name Value mecab_charset utf8 We have two system variables: a. mecab_ipadic_charset It's effective only when we distribute mecab with server package. We package three different dictionaries, so we can set this variable to tell mecab parser to load which dictionary. The valid values are: euc-jp, sjis, utf-8. euc-jp is default value. We search for '/plugin_dir/mecab/dic/ipadic-charset/'. If we can't find it, search for '/plugin_dir/mecab/dic/ipadic/' instead. b. mecab_rc_file_path mecabrc is a configuration file of mecab. We can find mecabrc in /path/to/mecab/etc or /usr/local/ect if we install mecab by ourselves, or /path/to/mysql/data if we distribute mecab with server package(in this case, we overwrite mecabrc file when mecab parser is loaded if mecab_rc_file_path is not set). we can set mecab_rc_file_path to tell mecab to use a customized mecabrc file. 4. Query Processing Space is still separator of terms in query string. 'データベース管理' is tokenized to two tokens: 'データベース' and '管理' by mecab. a. Term Search (1) Natural Language Mode Term search is converted to union of token search. e.g. 'データベース管理' is equivalent to 'データベース 管理'. (2) Boolean Mode Term search is converted to phrase search. e.g. 'データベース管理' is equivalent to '"データベース 管理"'. b. Wildcard Search We don't tokenize the term of wildcard search. e.g. 'データベース管理*' we will search prefix 'データベース管理', and may have no result. c. Phrase Search Phrase is tokenized. For example: "データベース管理" is equivalent to "データ ベ ース 管理". ======================================================================== Ngram parser users: - Users want to use fulltext for CJK and other languages without word delimiters.
Query Parser Implementation --------------------------- Just like fts0plugin.cc, we reuse function 'ft_get_word' and related macros from MyISAM. (Used by both ngram and mecab parser) uchar query_get_word( /*===========*/ >_______const CHARSET_INFO*>____cs,>____>_______/*!< in: charset */ >_______uchar**>>_______>_______start,>_>_______/*!< in/out: doc start ptr */ >_______uchar*>_>_______>_______end,>___>_______/*!< in/out: doc end ptr */ >_______FT_WORD*>_______>_______word,>__>_______/*!< in/out: token */ >_______MYSQL_FTPARSER_BOOLEAN_INFO*>___info)>__/*!< in/out: token info* /** Macros and structs below are from ftdefs.h in MYISAM */ /** Check a char is true word */ #define true_word_char(c, ch) ((c) & (_MY_U | _MY_L | _MY_NMR) || (ch) == '_') /** Check if a char is misc word */ #define misc_word_char(X) 0 const char* fts_boolean_syntax = DEFAULT_FTB_SYNTAX; /** Boolean search operators */ #define FTB_YES (fts_boolean_syntax[0]) #define FTB_EGAL (fts_boolean_syntax[1]) #define FTB_NO (fts_boolean_syntax[2]) #define FTB_INC (fts_boolean_syntax[3]) #define FTB_DEC (fts_boolean_syntax[4]) #define FTB_LBR (fts_boolean_syntax[5]) #define FTB_RBR (fts_boolean_syntax[6]) #define FTB_NEG (fts_boolean_syntax[7]) #define FTB_TRUNC (fts_boolean_syntax[8]) #define FTB_LQUOT (fts_boolean_syntax[10]) #define FTB_RQUOT (fts_boolean_syntax[11]) /** FTS query token */ typedef struct st_ft_word { >_______uchar* pos;>____/*!< word start pointer */ >_______uint len;>____/*!< word len */ >_______double weight;>_/*!< word weight, unused in innodb */ } FT_WORD; Ngram Parser Implementation --------------------------- We implement ngram parser as an external parser, and put the source code under 'plugin/innodb_ngram', other than part of innodb. But ngram parser will be loaded automatically when server starts. Ngram parser has the following functions: .) ngram parser entrance function we do ngram tokenization and query parsing here. int ngram_parser_parse( /*===============*/ >_______MYSQL_FTPARSER_PARAM* param)>___/*!< in: plugin parser param */ .) ngram tokenization function We split a doc into ngrams in this function. The ngram algorithm is simple. int ngram_parse( /*========*/ >_______MYSQL_FTPARSER_PARAM*>__param,>_/*!< in: plugin parser param */ >_______const char*>____>_______doc,>___/*!< in: document to parse */ >_______int>____>_______>_______len,>___/*!< in: document length */ >_______MYSQL_FTPARSER_BOOLEAN_INFO* bool_info) /*!< in/out: boolean info */ .) Term search convert function We use this function to do convertion from term search to phrase search and handle wildcard search. See convert rule in 6.a-b of HLS. int ngram_term_convert( >_______MYSQL_FTPARSER_PARAM*>__param,>_>_______/*!< in: plugin parser param */ >_______const uchar*>___>_______token,>_>_______/*!< in: token to parse */ >_______int>____>_______>_______len,>___>_______/*!< in: token length */ >_______MYSQL_FTPARSER_BOOLEAN_INFO* >_______>_______>_______>_______bool_info)>_____/*!< in/out: boolean info */ Stopword Handling ----------------- We do sopword handling inside innodb, and we distinguish ngram parser from other parsers. .) fts token check function We do fts_min_token_size & fts_max_token_size & stopword checking here, and make this function as the only entrance to do the checks. bool fts_check_token( /*============*/ fts_string_t* token, /* in: token string */ ib_rbt_t* stopwords, /* in: stopwords */ bool is_ngram, /* in: is ngram parser */ CHARSET_INFO* cs) /* in: token charset */ So ralated functions needed to modify are below: a. fts_add_token in fts0fts.cc b. fts_tokenizer_word_get in fts0fts.cc c. row_merge_fts_doc_tokenize in row0ftsort.cc d. fts_query_phrase_split in fts0que.cc The stopword check alogrithm for ngram: a. first we split the token into ngrams from unigram to bigram(bigram for example). b. then we check each ngram in stopwrods, if it's a stopword, we treat the token as stopword. .) is_ngram flag check and set First we define a macro for ngram parser name in ft_global.h. #define FTS_NGRAM_PARSER_NAME "ngram" TODO: optimizer team can also use this macro to set ngram as default parser for CJK if no parser is specified. Then we add 'is_ngram' in index_def, dict_index_t and fts_doc_t, and check ngram parser by name in handler. In ha_innobase::open and innobase_create_index_def, we do: index->is_ngram = strncmp(plugin_name(parser)->str, FTS_NGRAM_PARSER_NAME, plugin_name(parser)->length) == 0 ? true : false; Mecab Parser Implementation --------------------------- .) Mecab parser plugin init function We create MeCab::Model and MeCab::Tagger objects shared by all threads. Note: We create Model by default, but groonga uses '-Owakati' for quick parsing. The reason is that we need word position. int mecab_parser_plugin_init( void* arg __attribute__((unused))); .) Mecab parser plugin deinit function int mecab_parser_plugin_deinit( void* arg __attribute__((unused))); .) Mecab parser entry function Create MeCab::Lattice object for parsing. Lattice contains all local variables required for the analysis and we create a lattice object for every call. int mecab_parser_parse( MYSQL_FTPARSER_PARAM* param) /*!< in: plugin parser param */ .) Mecab parse function We parse document and do term-to-phrase conversion here. int mecab_parse( MeCab::Lattice* mecab_lattice, /*!< in: MeCab lattice */ MYSQL_FTPARSER_PARAM* param, /*!< in: plugin parser param */ char* doc, /*!< in: document to parse */ int len, /*!< in: document length */ MYSQL_FTPARSER_BOOLEAN_INFO* bool_info) /*!< in/out: boolean info */
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.