WL#7123: Additional query optimization for Fulltext Search
Status: Complete
This is a placeholder on some optimization needed from optimizer side for InnoDB fulltext searches. It has 2 areas: Area 1. Some condition push down from server layer to InnoDB a) hint on no ranking is needed. For example, select count(*) b) hint on "select limit" with no where clause (asc/desc), so we only need to keep top/bottom ranked items c) hint on "order by match" (ranking), whether it is asc or desc d) hint on "select limit" with where clause on match (ranking) > X (asc/desc) Area 2. Simplify post search result processing: I notice there are additional call to innobase_fts_retrieve_ranking() for a simple select match call: SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('Database' IN NATURAL LANGUAGE MODE); Stack: innobase_fts_retrieve_ranking Item_func_match::val_real .. evaluate_join_record sub_select JOIN::exec mysql_execute_select Please notice when it comes to ha_innobase::ft_read to retrieve result data, we sort the result (doc_id) by fts_query_sort_result_on_rank(). So the resulting Doc ID is already sorted with ranking order. Why there is additional request for ranking? it seems to be overhead to do such retrieval. So I would like this result retrieval process could be checked and straightened out. User Documentation ================== None required.
Functional requirements: None Non-Functional requirements:
Simplified algorithm for processing InnoDB FTS index (for better understanding of possible optimization): 1. ha_innobase::ft_init_ext() -> fts_query() Executed at optimization stage, initialize FTS index. Steps in fts_query(): 1.1 Build DOC_ID tree depending on key phrase. excluding deleted list and ignore list DOC_IDs ( fts_ast_visit(), fts_expand_query() ) 1.2 Calculate the inverse document frequency, necessary for ranking ( fts_query_calculate_idf() ) 1.3 Populate rank values for DOC_ID tree ( fts_query_get_result() -> fts_query_prepare_result() ) 2. ha_innobase::ft_read() Executed at execution stage. 2.1 Sort DOC_ID tree on first call by rank, created alternative tree sorted by rank ( fts_query_sort_result_on_rank() ) 2.2 Subsequent call of ft_read() walks through the tree RANK tree. There are several cases when some steps of this algorithm can be omitted or improved. I will use following test case: CREATE TABLE t1 ( FTS_DOC_ID BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT, title VARCHAR(255) NOT NULL DEFAULT '', text MEDIUMTEXT NOT NULL, PRIMARY KEY (FTS_DOC_ID), UNIQUE KEY FTS_DOC_ID_INDEX (FTS_DOC_ID), FULLTEXT KEY ft_idx (title,text) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; INSERT INTO t1 (title, text) VALUES ('MySQL Tutorial','DBMS stands for MySQL DataBase ...'), ('How To Use MySQL Well','After you went through a ...'), ('Optimizing MySQL','In this tutorial we will show ...'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL vs. YourSQL','In the following database to database comparison ...'), ('MySQL Security','When configured properly, MySQL ...'), ('InnoDB', 'InnoDB is a transaction-safe (ACID compliant) storage engine'), ('MySQL is a database management system', 'A database is a structured collection of data...'), ('MySQL databases are relational', 'A relational database stores data in separate tables rather than putting all the data in one big storeroom...'), ('MySQL software is Open Source', 'Open Source means that it is possible for anyone to use and modify the software...'), ('The MySQL Database Server is very fast, reliable, scalable, and easy to use', 'MySQL Server can run comfortably on a desktop or laptop...'), ('MySQL Server works in client/server or embedded systems', 'The MySQL Database Software is a client/server system...'), ('MyISAM', 'MyISAM is based on the older (and no longer available) ISAM storage engine but has many useful extensions'), ('A large amount of contributed MySQL software is available', 'MySQL Server has a practical set of features developed in close cooperation with our users'); Possible optimization: 1. No DOC_ID sorting by rank(skip 2.1 step in the algorithm above) In some cases DOC_ID sorting is redundant. Examples: SELECT FTS_DOC_ID, TITLE FROM t1 WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) ORDER BY title; Note: redundant sorting as we should sort result by 'title' field anyway. Conditions: ORDER exists and this order is not by rank SELECT FTS_DOC_ID, MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) FROM t1 WHERE MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) Note: redundant sorting as manual states: 'They do not automatically sort rows in order of decreasing relevance' Condition: No ORDER, BOOLEAN MODE This behavior is controlled by FT_SORTED flag but it's not used in InnoDB handler(fix it). 2. No ranking(automatically means no sorting by rank too) In some cases rank calculation is not necessary and can be omitted. Examples: SELECT count(*) FROM t1 WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE); Note: DOC_ID tree after 1.1 step of the algorithm above. already have necessary information (number of docs satisfied WHERE condition), so we can skip 1.2, 1.3 steps. Corresponds to HLD 1a) item. Conditions: WHERE has single MATCH expression, no MATCH in SELECT list, COUNT. SELECT count(*) FROM t1 WHERE MATCH (title, text) AGAINST ('data*' IN BOOLEAN MODE) ORDER BY title; Note: Despite ORDER presence 'no ranking' optimization is still applicable. Corresponds to HLD 1a) item. Conditions: WHERE has single MATCH expression, no MATCH in SELECT list, COUNT. SELECT FTS_DOC_ID, title FROM t1 WHERE MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) Note: Manual states that internal sorting is performed only for NATURAL LANGUAGE MODE. Boolean mode does not require sorting, rank value is not used in any expression or as return value. So ranking is not needed. Conditions: WHERE has single MATCH expression, BOOLEAN MODE, no rank field in SELECT list, no order by rank New flag FT_NO_RANKING is necessary. 3. LIMIT optimization. If no ranking is supported then we can add optimization using LIMIT clause. Examples: SELECT count(*) FROM t1 WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) LIMIT 3; Note: we don't need to build whole DOC_ID tree. we can interrupt the building when number of elements exceeds LIMIT value (1.1 in the algorithm above). Conditions: no ranking is supported, LIMIT presence if ranking is needed and number of matched FTS rows is more or equal than LIMIT value then LIMIT optimization can be used for partial sorting(only top 'limit value' should be sorted). Example: SELECT FTS_DOC_ID, title, MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) AS rank FROM t1 ORDER BY rank LIMIT 10; Note: Corresponds to HLD 1b) item. 4. WHERE optimization on MATCH > some_rank. if WHERE condition is MATCH > 'some_rank' we can use some_rank to filter out unnecessary documents before result sorting(2.1 in the algorithm above) Examples: SELECT MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) FROM t1 WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) > 0.5; Note: Corresponds to HLD 1d) item. Conditions: presence of MATCH > 'some_rank' condition. 5. innobase_fts_retrieve_ranking() function innobase_fts_retrieve_ranking() returns rank value. It can return current element rank if read_just_key is set. In other cases it uses fts_retrieve_ranking() function. fts_retrieve_ranking() is looking for doc_id in doc_id tree and returns appropriate rank. SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('Database' IN NATURAL LANGUAGE MODE); In example above read_just_key is not set and fts_retrieve_ranking() is used but actually we can return current element rank, no need to call fts_retrieve_ranking(). So read_just_key check should be changed to allow return current element rank for example above. Actually if result->rankings_by_rank exists we can avoid fts_retrieve_ranking() call and just return current rank. So instead of ft_prebuilt->read_just_key check we should use result->rankings_by_rank. It solves the problem. Taking into account future optimization(skip result sorting) it can happen that this check will be extended to be able to. work with unsorted result(in other words with DOC_ID tree). 6. HLD requirements which are not covered by WL implementation: 6.1. ASC order optimization does not supported as not important (no considerable use cases). 6.2 HLD 1c) item is not covered by this WL. Since result is not limited in this case(no WHERE, no LIMIT), there is no way to optimize it. Following structure is introduced: typedef struct st_ft_hints FT_HINTS; struct st_ft_hints { /** FULLTEXT flags, see FT_NL, etc */ uint ft_flags; /** Operation type */ enum ft_operation ft_min_rank_op; /** Operation value */ double ft_min_rank; /** LIMIT value, HA_POS_ERROR if not set */ ha_rows ft_limit; }; ft_flags variable has two special flags: FT_SORTED means that internal sorting should be performed. FT_NO_RANKING means that ranking is not necessary ft_min_rank_op variable has following values: FT_OP_NO, No operation, single MATCH function FT_OP_GT, 'Greater than' operation FT_OP_GE 'Greater than or equal to' operation ft_min_rank variable is rank value extracted from WHERE condition. ft_limit variable is limit value if set, otherwise HA_POS_ERROR. FT_HINTS is passed to the engine at Item_func_match::init_search() stage. Following rules should be used on engine side: 1. If FT_SORTED is set then 1.1 Preform internal sorting, otherwise skip internal sorting. 1.2 If LIMIT is set it can be used at internal sorting stage. 2. If FT_NO_RANKING is set then 2.1 Skip ranking procedure. 2.2 If LIMIT is set it can be used to interrupt DOC_ID tree building when number of DOC_ID tree elements exceeds LIMIT value. 3. ft_min_rank_op & ft_min_rank can be used before internal sorting to filter out elements which have lesser rank. Another place for use is filtering of DOC_ID tree elements in ::ft_read(). On optimizer level FT_HINTS values are set using following criteria: 1. FT_SORTED is set if any of the following conditions is true: 1.1 No ORDER BY and MATCH uses NATURAL LANGUAGE MODE 1.2 Engine supports extended API and ORDER BY is a single MATCH expression. 2. FT_NO_RANKING is set if FT_SORTED is not set and MATCH is a single MATCH expression which used only in WHERE condition. 3. ft_min_rank_op & ft_min_rank are set if there is appropriate MATCH expression in WHERE condition. Condition should be part of AND condition and should be of the following form: MATCH > 'const_value' or MATCH >= 'const_value'. 4. LIMIT is applicable if we optimize single table and there is no WHERE condition or WHERE condition is a single MATCH function. Addon(some bug fixed): #Crash EXPLAIN SELECT * FROM t1 ORDER BY MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) DESC LIMIT 10; #Wrong result SELECT FTS_DOC_ID FROM t1 WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) AND MATCH (title, text) AGAINST ('mysql' IN NATURAL LANGUAGE MODE) LIMIT 6; #Wrong result, MyISAM(see innodb_fts_opt.test) SELECT FTS_DOC_ID docid, MATCH(title, text) AGAINST ('database') AS score FROM wp ORDER BY score DESC LIMIT 10;
There are three functions responsible for FT optimization: 1. add_ft_keys() Function parses WHERE condition and add key_use for FT index into key_use array if appropriate MATCH function is found. If no MATCH function is found in WHERE then check ORDER BY clause. If ORDER BY clause has single MATCH expression then add key_use for FT index into key_use array. It also sets FT_HINTS values(ft_min_rank_op, ft_min_rank). 2. JOIN::pre_init_optimize_fts_query() Function sets FT_SORTED, FT_NO_RANKING depending on conditions for FT_HINTS structure. It also sets LIMIT hint. FT_SORTED flag is set if 1. No ORDER BY clause and MATCH is not in BOOLEAN MODE 2. There is a single MATCH function in ORDER BY clause FT_NO_RANKING flag is set if 1. FT_SORTED flags is not set and MATCH function is used only in WHERE condition LIMIT hint is set if we optimize single table and there is no WHERE condition or WHERE condition is a single MATCH function. 3. JOIN::post_init_optimize_fts_query() Old name JOIN::optimize_fts_query(). Meaning is not changed, the function is just adapted according to new code. Folowing changes are done: sql/handler.h Added new handlerton flag HTON_CAN_FULLTEXT_EXT. Removed handler flag HTON_CAN_FULLTEXT_EXT. Changed parameters in handler::ft_init_ext() function: removed 'uint flags' parameter. added FT_HINTS *ft_hints parameter. sql/item_func.cc Item_func_match::flags is replaced with Item_func_match::ft_hints.ft_flags. sql/item_func.h Item_func_match changes: Added new parameter(parsing context) to the constructor. It's used to determine where MATCH function is used. Added new class function: Item_func_match *get_master(); void init_used_place(enum enum_parsing_context place); void init_ft_hints(uint ft_flags, enum ft_operation ft_min_rank_op, double ft_min_rank, ha_rows ft_limit); void set_ft_hints_op(enum ft_operation ft_min_rank_op, double ft_min_rank); void set_ft_hints_limit(ha_rows ft_limit); bool extension_supported(); bool can_optimize_sorting(); bool can_no_ranking(); bool is_same(Item *item); sql/opt_sum.cc opt_sum_query() changes: Added new arguments: List- &fields_list - used to determine if count() function is the only element in select list. ha_rows limit - used for FT_HINTS. sql/sql_base.cc setup_ftfuncs() change: Added populating of parsing context. sql/sql_optimizer.cc Removed: optimize_fts_limit_query(). Added: JOIN::pre_init_optimize_fts_query(); static Item_func_match *get_ft_order_match(JOIN_TAB *stat); Renamed: JOIN::optimize_fts_query() to JOIN::post_init_optimize_fts_query(); sql/sql_optimizer.h optimize_fts_limit_query(). Added: JOIN::pre_init_optimize_fts_query(); Renamed: JOIN::optimize_fts_query() to JOIN::post_init_optimize_fts_query(); sql/sql_select.h opt_sum_query() changes: Added new arguments. sql/sql_yacc.yy Item_func_match changes: Added new parameter(parsing context) to the constructor. storage/innobase/handler/ha_innodb.cc Removed HA_CAN_FULLTEXT_EXT flag from handler. Added HTON_CAN_FULLTEXT_EXT flag initialization to handlerton. Changed parameters in ::ft_init_ext() function. innobase_fts_retrieve_ranking() changes: now function does not use fts_retrieve_ranking() at all and always returns current element rank. storage/innobase/handler/ha_innodb.h Changed parameters in ::ft_init_ext() function. storage/innobase/include/fts0fts.h added FTS_NO_RANKING flag. storage/myisam/ha_myisam.h Changed parameters in ::ft_init_ext() function.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.