WL#2575: Fulltext: Parser plugin for FTS

Affects: Server-5.1   —   Status: Complete   —   Priority: Medium

let a user to provide a function that acts an an input filter on the indexed
text - fulltext parser sees the text after it.

This can be used for anything.  E.g stopwords, always-index words, stemming,
thesaurus, fuzzy matching, CJK parser can be implemented completely in this
function. Microsoft Word, PDF, RTF, HTML, XML parsers can be implemented in this

It could be an UDF - that is UDF framework and API can be reused - if we'll
define that this preparser can return something an UDF can return. That is a
string of non-spaces separated by spaces.

On the other hand if it'll work as an iterator - something an UDF can hardly do
- it could avoid huge mallocs and memcopy's.

We have agreed as follows:
"MySQL will allow client to customize the function that extracts indexed 
items from a document.  This function will be used at indexing, to parse 
documents, and at search as well, to parse search queries. The parser 
will be used to extract words only, e.g. always-index and synonyms feature 
will still be usable even if the parser used for a particular index is not 
the default one.  Current Interface for this feature is: 

   CREATE TABLE t1 (title VARCHAR(255), keywords VARCHAR(255), 
                    body VARCHAR(3000)), 
                    FULLTEXT KEY k1 (title, keywords, body) 
                    WITH PARSER {udf_function_name});"
-- Trudy Pelzer
it won't be an UDF - fulltext [pre-]parser will be a MySQL plugin (WL#2761).

struct st_fulltext_parser {
  int interface_version;
  const char *name;
  int (*parse)(void *state, int mode, CHARSET_INFO *cs, byte *doc, uint length,
void *param);
  int (*init)(void *state);
  int (*deinit)(void *state);
  int (*init_once)();
  int (*deinit_once)();

Functions are:

  init_once() - it's called when the plugin is loaded.
  deinit_once() - when it's unloaded.

  init() - is called in the beginning of the query
  deinit() - at the end

  void *state - is just a place where [pre-]parser can store anything it likes,
and be sure to get these data back in parse() and deinit() calls.
[de]init_once() family does not need a 'state', as they can use global variables
for that.

  parse() does the real job. It is called by MySQL to, well, parse a document.
It has then two possibilities:

    - if it's a preparser that wants MySQL fulltext parser to do the parsing
itself (example: pdf-to-text convertor) it calls ft_parse() with the converted text.

    - if it's a complete parser that splits the text into the stream of words,
it calls ft_parse_add_word() for every word in the document.

ft_parse() is basically the current ft_parse() in ft_parser.c
ft_parse_get_word() is basically the body of the inner loop of the current
ft_parse() - the code that adds the word to the TREE.
void *param argument of the parse() method is something that this method should
pass directly to the ft_parse/ft_parse_get_word functions.

ft_parse/ft_parse_get_word are tentative names, better names are welcome.


Q & A

* what to do if a plugin is dropped ?
  (udf [pre]parser, but applies to storage engine, etc)
- do nothing, just drop it. on access of the table that has that ft index
  complain, and refuse to open.
* one can drop a [pre]parser and load a new one with the same name, that
  parses differently
- do nothing, not our problem. we cannot control plugin code, we cannot
  even guarantee that it'll parse the same document always the same way.
  we cannot control plugins - so we won't even try to.
* when to call a [pre]parser ? document only or document+query ?
- strictly speaking, we need both. how to do it ?
- simple solution, for now, - call parser both for the query and document.
  for pdf->txt case, plugin should pass-though plain text.
* there should be no direct calls to ft_get_word, ft_simple_word, ft_parse
  currently they are called in:
    _ftb_parse_query - simple. it should be converted to a function that
                       will be called by plugin/ft_parse instead of
    ft_parse()        - leave it where it is
    ft_stopwords.c    - leave it where it is. no easy way to define what
                        parser should be used
    _ftb_check_phrase - no choice, should be rewritten to be called as
    ft_boolean_find_relevance - same here, inner loop should be moved to a
                        function that'll be called instead of
    ft_update.c       - call plugin
    ft_nlq_search.c   - call plugin

myisamchk apparently won't be able to check/repair table that has a fulltext
index with a parser plugin, so it'll refuse to do so (one will have to use
CHECK/REPAIR from within MySQL server).
MI_KEYDEF type needs to be extended with:
uint32 ftparser_nr - distinct parser number starting with 1. 0 is for
boolean search w/o index. Initially set to 0.

MI_INFO type needs to be extended with:
MYSQL_FTPARSER_PARAM *ftparser_param. Initially set to 0.
uint32 ftparsers. Number of distinct parsers.

In mi_open() ftparser_nr must be renumbered from 0 to N, where N is
number of distinct parsers. Set info->ftparsers to N.

Two additional functions must be added in ft_parser.c:
  if (! ftparser_param)
    - allocate ftparser_param - sizeof(MYSQL_FTPARSER_PARAM) * (N + 1);
    - bzero ftparser_param;
  if (! ftparser_param[ftparser_nr].mysql_add_word)
    ftparser_param[ftparser_nr].mysql_add_word= 1;
    if (parser->init && parser->init(&ftparser_param[ftparser_nr]))
      return 0;
  return &ftparser_param[ftparser_nr];

  - call deinit for each parser;
  - free ftparser_param;

parser->init() must be called in ft_init_boolean_search(),
ft_init_nlq_search(), _mi_ft_parserecord() before a call to parser->parse().

parser->deinit() function must be called in mi_lock_database(F_UNLCK).

Each call to parser->parse() needs to be modified to use
ftparser_param[ftparser_nr] as an argument.

Relies on sql layer flag:
Additional flags to be added into include/my_base.h:
#define HA_OPTION_RELIES_ON_SQL_LAYER 512 /* for the table */
#define HA_OPEN_FROM_SQL_LAYER 64 /* for the mi_open() */
#define HA_CREATE_RELIES_ON_SQL_LAYER 128 /* for the mi_create() */

As soon as options are read (mi_open), additional check must be
if (! (open_flags & HA_OPEN_FROM_SQL_LAYER) &&
    share->options & HA_OPTION_RELIES_ON_SQL_LAYER)
  goto err;

ha_myisam::open() must pass HA_OPEN_FROM_SQL_LAYER flag to mi_open():
mi_open(name, mode, test_if_locked | HA_OPEN_FROM_SQL_LAYER);

ha_myisam::create() must pass HA_CREATE_RELIES_ON_SQL_LAYER must
be passed to mi_create():
if (keydef[i].flag & HA_USES_PARSER)

mi_create() must save HA_OPTION_RELIES_ON_SQL_LAYER: