WL#2575: Fulltext: Parser plugin for FTS
Affects: Server-5.1
—
Status: Complete
let a user to provide a function that acts an an input filter on the indexed text - fulltext parser sees the text after it. This can be used for anything. E.g stopwords, always-index words, stemming, thesaurus, fuzzy matching, CJK parser can be implemented completely in this function. Microsoft Word, PDF, RTF, HTML, XML parsers can be implemented in this function. It could be an UDF - that is UDF framework and API can be reused - if we'll define that this preparser can return something an UDF can return. That is a string of non-spaces separated by spaces. On the other hand if it'll work as an iterator - something an UDF can hardly do - it could avoid huge mallocs and memcopy's. 2005-06-22 We have agreed as follows: "MySQL will allow client to customize the function that extracts indexed items from a document. This function will be used at indexing, to parse documents, and at search as well, to parse search queries. The parser will be used to extract words only, e.g. always-index and synonyms feature will still be usable even if the parser used for a particular index is not the default one. Current Interface for this feature is: CREATE TABLE t1 (title VARCHAR(255), keywords VARCHAR(255), body VARCHAR(3000)), FULLTEXT KEY k1 (title, keywords, body) WITH PARSER {udf_function_name});" -- Trudy Pelzer
it won't be an UDF - fulltext [pre-]parser will be a MySQL plugin (WL#2761). #define FTPARSER_INTERFACE_VERSION 0x0000 #define FTPARSER_BOOLEAN_MODE 1 #define FTPARSER_SIMPLE_MODE 0 struct st_fulltext_parser { int interface_version; const char *name; int (*parse)(void *state, int mode, CHARSET_INFO *cs, byte *doc, uint length, void *param); int (*init)(void *state); int (*deinit)(void *state); int (*init_once)(); int (*deinit_once)(); }; Functions are: init_once() - it's called when the plugin is loaded. deinit_once() - when it's unloaded. init() - is called in the beginning of the query deinit() - at the end void *state - is just a place where [pre-]parser can store anything it likes, and be sure to get these data back in parse() and deinit() calls. [de]init_once() family does not need a 'state', as they can use global variables for that. parse() does the real job. It is called by MySQL to, well, parse a document. It has then two possibilities: - if it's a preparser that wants MySQL fulltext parser to do the parsing itself (example: pdf-to-text convertor) it calls ft_parse() with the converted text. - if it's a complete parser that splits the text into the stream of words, it calls ft_parse_add_word() for every word in the document. ft_parse() is basically the current ft_parse() in ft_parser.c ft_parse_get_word() is basically the body of the inner loop of the current ft_parse() - the code that adds the word to the TREE. void *param argument of the parse() method is something that this method should pass directly to the ft_parse/ft_parse_get_word functions. ft_parse/ft_parse_get_word are tentative names, better names are welcome. e.g. mysql_ft_parse_text() mysql_ft_parse_store_word() Q & A ^^^^^ * what to do if a plugin is dropped ? (udf [pre]parser, but applies to storage engine, etc) - do nothing, just drop it. on access of the table that has that ft index complain, and refuse to open. * one can drop a [pre]parser and load a new one with the same name, that parses differently - do nothing, not our problem. we cannot control plugin code, we cannot even guarantee that it'll parse the same document always the same way. we cannot control plugins - so we won't even try to. * when to call a [pre]parser ? document only or document+query ? - strictly speaking, we need both. how to do it ? - simple solution, for now, - call parser both for the query and document. for pdf->txt case, plugin should pass-though plain text. * there should be no direct calls to ft_get_word, ft_simple_word, ft_parse currently they are called in: ft_get_word: _ftb_parse_query - simple. it should be converted to a function that will be called by plugin/ft_parse instead of ft_parse_store_word ft_simple_word: ft_parse() - leave it where it is ft_stopwords.c - leave it where it is. no easy way to define what parser should be used _ftb_check_phrase - no choice, should be rewritten to be called as ft_parse_store_word ft_boolean_find_relevance - same here, inner loop should be moved to a function that'll be called instead of ft_parse_store_word ft_parse: ft_update.c - call plugin ft_nlq_search.c - call plugin myisamchk apparently won't be able to check/repair table that has a fulltext index with a parser plugin, so it'll refuse to do so (one will have to use CHECK/REPAIR from within MySQL server).
init()/deinit(): MI_KEYDEF type needs to be extended with: uint32 ftparser_nr - distinct parser number starting with 1. 0 is for boolean search w/o index. Initially set to 0. MI_INFO type needs to be extended with: MYSQL_FTPARSER_PARAM *ftparser_param. Initially set to 0. uint32 ftparsers. Number of distinct parsers. In mi_open() ftparser_nr must be renumbered from 0 to N, where N is number of distinct parsers. Set info->ftparsers to N. Two additional functions must be added in ft_parser.c: ftparser_call_initializer() { if (! ftparser_param) { - allocate ftparser_param - sizeof(MYSQL_FTPARSER_PARAM) * (N + 1); - bzero ftparser_param; } if (! ftparser_param[ftparser_nr].mysql_add_word) { ftparser_param[ftparser_nr].mysql_add_word= 1; if (parser->init && parser->init(&ftparser_param[ftparser_nr])) return 0; } return &ftparser_param[ftparser_nr]; } ftparser_call_deinitializer() { - call deinit for each parser; - free ftparser_param; } parser->init() must be called in ft_init_boolean_search(), ft_init_nlq_search(), _mi_ft_parserecord() before a call to parser->parse(). parser->deinit() function must be called in mi_lock_database(F_UNLCK). Each call to parser->parse() needs to be modified to use ftparser_param[ftparser_nr] as an argument. Relies on sql layer flag: Additional flags to be added into include/my_base.h: #define HA_OPTION_RELIES_ON_SQL_LAYER 512 /* for the table */ #define HA_OPEN_FROM_SQL_LAYER 64 /* for the mi_open() */ #define HA_CREATE_RELIES_ON_SQL_LAYER 128 /* for the mi_create() */ As soon as options are read (mi_open), additional check must be performed: if (! (open_flags & HA_OPEN_FROM_SQL_LAYER) && share->options & HA_OPTION_RELIES_ON_SQL_LAYER) { my_errno= HA_ERR_UNSUPPORTED; goto err; } ha_myisam::open() must pass HA_OPEN_FROM_SQL_LAYER flag to mi_open(): mi_open(name, mode, test_if_locked | HA_OPEN_FROM_SQL_LAYER); ha_myisam::create() must pass HA_CREATE_RELIES_ON_SQL_LAYER must be passed to mi_create(): if (keydef[i].flag & HA_USES_PARSER) create_flags|= HA_CREATE_RELIES_ON_SQL_LAYER; mi_create() must save HA_OPTION_RELIES_ON_SQL_LAYER: if (flags & HA_CREATE_RELIES_ON_SQL_LAYER) options|= HA_OPTION_RELIES_ON_SQL_LAYER.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.