WL#6943: InnoDB FULLTEXT INDEX: support external parser
Status: Complete — Priority: Medium
Fulltext index on MyISAM support external parser. And this is something we will need to make it compatible in terms of functionality. The syntax of adding external parser is like: CREATE TABLE t1 ( id INT AUTO_INCREMENT PRIMARY KEY, doc CHAR(255), FULLTEXT INDEX (doc) WITH PARSER my_parser ) ENGINE=InnoDB; ALTER TABLE articles ADD FULLTEXT INDEX (body) WITH PARSER my_parser; CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER my_parser; We can also create a myisam table with parser, and alter it to innodb. http://dev.mysql.com/doc/refman/5.5/en/writing-full-text-plugins.html http://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html Differences with MYISAM: 1. plugin parser should return word position info when tokenizing. See High-Level Specification for more details. 2. 'mysql_add_word' return value should be checked. See High-Level Specification for more details. 3. stopword handling In myisam, it check stopword & ft_min_token_size & ft_max_token_size inside tokenizer or plugin parser. But in innodb, we check these outside tokenizer. It means that plugin parsers itself don't need to check stopword, etc. Just return every single word to innodb(strongly suggest). However if plugin parsers like to handle stopword, it's fine with mode 'MYSQL_FTPARSER_SIMPLE_MODE', which is for fts index build, and natural language search. But in 'MYSQL_FTPARSER_WITH_STOPWORDS' and 'MYSQL_FTPARSER_FULL_BOOLEAN_INFO', we should return every single word including stopword to InnoDB in case of phrase search, or we may get unexpected result. Limitations: 1. Don't support proximity search. "@ distance" is not in the current plugin parser framework, myisam has no such proximity search. There is no info for proximity in 'st_mysql_ftparser_boolean_info', so we have no idea about proximity, the distance will be treated as 'word', '@' will be ignore. Let me take AGAINST('("msyql database")@3' IN BOOLEAN MODE) for example. It is equal to phrase search ('"mysql database" 3'). If we want to support proximity search, we need to extend the plugin parser framework.
Problem Statement: ------------------ Support fulltext plugin parser to do tokenization and query parsing in InnoDB. Let's understand current implementation in InnoDB: - Use 'innobase_mysql_fts_get_token' to tokenize document directly. - Use bison/flex to parse query. - No interface or framework for fullext plugin parser. High Level Approach Description: -------------------------------- - Extend fulltext framework to support word position. myisam doesn't care about word's position, but we do. we use word position to speed up phrease and proximity search. - Check the return value of 'mysql_add_word' to end tokenization when necessary. In myisam, it doesn't check the return value, but we need. In phrase search, we use different algorithm(use word position) from myisam to match phrase.when we find a match or not-match at the first time, we can stop tokenization. Furthermore, we also need to check the return value in query parsing when we find a syntax error. - Get parser for fulltext indexes in data dictionary(dict_index_t). We don't check if a fulltext index is created with a plugin parser or not, so first of all, we need to get the parser object from MySQL and put it as a member for the index object, then we can use it to tokenize document or parse query. - Use fulltext framework to tokenize document or parse query. Note we use the parser to get tokens, we should follow the same logic as we do without the parser, such as stopword, fts_min_token_size, fts_max_token_size check. There are four places where we need to use plugin parser. a. Tokenize a document when doing INSERT. b. Tokenize a document when building INDEX parallelly. c. Tokenize a document when matching PHRASE. d. Parse a query when doing fulltext SELECT. - Provide default/internal parser. In myisam, it use default parser to tokenize document or parse query, but we don't. The reason we still have default parser is that the plugin parser may use it and we can use it for testing. We don't use default parse to replace our current tokenizer and parser now. ======================================================================== Plugin parser users: - InnoDB developer (We don't support CKJ in fulltext search now, plugin parser is the best framwork for this.) - MySQL end-user (these users want to do customized tokenization, such as 'full- text' can be treated a whole word, not 'full' and 'text', and they can also have their own CJK plugin parsers.)
Copyright (c) 2000, 2017, Oracle Corporation and/or its affiliates. All rights reserved.