WL#6943: InnoDB FULLTEXT INDEX: support external parser

Status: Complete   —   Priority: Medium

Fulltext index on MyISAM support external parser. And this is something we will
need to make it compatible in terms of functionality.

The syntax of adding external parser is like:

    doc CHAR(255),
    FULLTEXT INDEX (doc) WITH PARSER my_parser
  ) ENGINE=InnoDB;


 CREATE FULLTEXT INDEX ft_index ON articles(body) WITH PARSER my_parser;

 We can also create a myisam table with parser, and alter it to innodb.


Differences with MYISAM:
1. plugin parser should return word position info when tokenizing.
See High-Level Specification for more details.

2. 'mysql_add_word' return value should be checked.
See High-Level Specification for more details.

3. stopword handling
In myisam, it check stopword & ft_min_token_size & ft_max_token_size inside 
tokenizer or plugin parser. But in innodb, we check these outside tokenizer. It 
means that plugin parsers itself don't need to check stopword, etc. Just return 
every single word to innodb(strongly suggest). 
However if plugin parsers like to handle stopword, it's fine with mode 
'MYSQL_FTPARSER_SIMPLE_MODE', which is for fts index build, and natural language 
'MYSQL_FTPARSER_FULL_BOOLEAN_INFO', we should return every single word including 
stopword to InnoDB in case of phrase search, or we may get unexpected result.

1. Don't support proximity search.
   "@ distance" is not in the current plugin parser framework, myisam has no 
such proximity search. There is no info for proximity in 
'st_mysql_ftparser_boolean_info', so we have no idea about 
proximity, the distance will be treated as 'word', '@' will be ignore.
    Let me take AGAINST('("msyql database")@3' IN BOOLEAN MODE) for example. It 
equal to phrase search ('"mysql database" 3').
    If we want to support proximity search, we need to extend the plugin parser 
Problem Statement:

Support fulltext plugin parser to do tokenization and query parsing in InnoDB.

Let's understand current implementation in InnoDB:
- Use 'innobase_mysql_fts_get_token' to tokenize document directly.
- Use bison/flex to parse query.
- No interface or framework for fullext plugin parser.

High Level Approach Description:

- Extend fulltext framework to support word position. myisam doesn't care about 
word's position, but we do. we use word position to speed up phrease and 
proximity search.

- Check the return value of 'mysql_add_word' to end tokenization when necessary. 
In myisam, it doesn't check the return value, but we need. In phrase search, we 
use different algorithm(use word position) from myisam to match phrase.when we 
find a match or not-match at the first time, we can stop tokenization. 
Furthermore, we also need to check the return value in query parsing when we 
find a syntax error.

- Get parser for fulltext indexes in data dictionary(dict_index_t). We don't 
check if a fulltext index is created with a plugin parser or not, so first of 
all, we need to get the parser object from MySQL and put it as a member for the 
index object, then we can use it to tokenize document or parse query.

- Use fulltext framework to tokenize document or parse query. Note we use the 
parser to get tokens, we should follow the same logic as we do without the 
parser, such as stopword, fts_min_token_size, fts_max_token_size check.
  There are four places where we need to use plugin parser.
  a. Tokenize a document when doing INSERT.
  b. Tokenize a document when building INDEX parallelly.
  c. Tokenize a document when matching PHRASE.
  d. Parse a query when doing fulltext SELECT.

- Provide default/internal parser. In myisam, it use default parser to tokenize 
document or parse query, but we don't. The reason we still have default parser 
is that the plugin parser may use it and we can use it for testing. We don't use 
default parse to replace our current tokenizer and parser now.


Plugin parser users:

- InnoDB developer (We don't support CKJ in fulltext search now, plugin parser 
is the best framwork for this.)
- MySQL end-user (these users want to do customized tokenization, such as 'full-
text' can be treated a whole word, not 'full' and 'text', and they can also have 
their own CJK plugin parsers.)