WL#7123: Additional query optimization for Fulltext Search

Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

This is a placeholder on some optimization needed from optimizer side for InnoDB
fulltext searches.

It has 2 areas:

Area 1. Some condition push down from server layer to InnoDB

  a) hint on no ranking is needed. For example, select count(*)
  b) hint on "select limit" with no where clause (asc/desc), so we only need to
keep top/bottom ranked items
  c) hint on "order by match" (ranking), whether it is asc or desc
  d) hint on "select limit" with where clause on match (ranking) > X (asc/desc)


Area 2. Simplify post search result processing:

I notice there are additional call to innobase_fts_retrieve_ranking() for a
simple select match call:

SELECT * FROM articles
        WHERE MATCH (title,body)
        AGAINST ('Database' IN NATURAL LANGUAGE MODE);

Stack:

innobase_fts_retrieve_ranking
Item_func_match::val_real
..
evaluate_join_record
sub_select
JOIN::exec
mysql_execute_select


Please notice when it comes to ha_innobase::ft_read to retrieve result data, we
sort the result (doc_id) by fts_query_sort_result_on_rank(). So the resulting
Doc ID is already sorted with ranking order. Why there is additional request for
ranking? it seems to be overhead to do such retrieval.

So I would like this result retrieval process could be checked and straightened 
out.

User Documentation
==================

None required.

Functional requirements:

None

Non-Functional requirements:

Simplified algorithm for processing InnoDB FTS index
(for better understanding of possible optimization):

1. ha_innobase::ft_init_ext() -> fts_query()
   Executed at optimization stage, initialize FTS index.

   Steps in fts_query():

   1.1 Build DOC_ID tree depending on key phrase.
       excluding deleted list and ignore list DOC_IDs
       ( fts_ast_visit(), fts_expand_query() )

   1.2 Calculate the inverse document frequency, necessary for ranking
       ( fts_query_calculate_idf() )

   1.3 Populate rank values for DOC_ID tree
       ( fts_query_get_result() -> fts_query_prepare_result() )

2. ha_innobase::ft_read()
   Executed at execution stage.

   2.1 Sort DOC_ID tree on first call by rank,
       created alternative tree sorted by rank
       ( fts_query_sort_result_on_rank() )

   2.2 Subsequent call of ft_read() walks through the tree
       RANK tree.

There are several cases when some steps of this algorithm
can be omitted or improved.

I will use following test case:

CREATE TABLE t1
(
  FTS_DOC_ID BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(255) NOT NULL DEFAULT '',
  text MEDIUMTEXT NOT NULL,
  PRIMARY KEY (FTS_DOC_ID),
  UNIQUE KEY FTS_DOC_ID_INDEX (FTS_DOC_ID),
  FULLTEXT KEY ft_idx (title,text)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

   
INSERT INTO t1 (title, text) VALUES
  ('MySQL Tutorial','DBMS stands for MySQL DataBase ...'),
  ('How To Use MySQL Well','After you went through a ...'),
  ('Optimizing MySQL','In this tutorial we will show ...'),
  ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
  ('MySQL vs. YourSQL','In the following database to database comparison ...'),
  ('MySQL Security','When configured properly, MySQL ...'),
  ('InnoDB', 'InnoDB is a transaction-safe (ACID compliant) storage engine'),
  ('MySQL is a database management system', 'A database is a structured
   collection of data...'),
  ('MySQL databases are relational', 'A relational database stores data in
    separate tables rather than putting all the data in one big storeroom...'),
  ('MySQL software is Open Source', 'Open Source means that it is possible
   for anyone to use and modify the software...'),
  ('The MySQL Database Server is very fast, reliable, scalable, and easy to
   use', 'MySQL Server can run comfortably on a desktop or laptop...'),
  ('MySQL Server works in client/server or embedded systems', 'The MySQL
   Database Software is a client/server system...'),
  ('MyISAM', 'MyISAM is based on the older (and no longer available) ISAM 
   storage engine but has many useful extensions'),
  ('A large amount of contributed MySQL software is available', 'MySQL Server 
    has a practical set of features developed in close cooperation with our 
    users');


Possible optimization:

1. No DOC_ID sorting by rank(skip 2.1 step in the algorithm above)
   In some cases DOC_ID sorting is redundant.

   Examples:

   SELECT FTS_DOC_ID, TITLE FROM t1
   WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE)
   ORDER BY title;

   Note: redundant sorting as we should sort result by 'title' field anyway.
   Conditions: ORDER exists and this order is not by rank

   SELECT FTS_DOC_ID,
     MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) FROM t1
   WHERE MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE)

   Note: redundant sorting as manual states:
   'They do not automatically sort rows in order of decreasing relevance'
   Condition: No ORDER, BOOLEAN MODE

This behavior is controlled by FT_SORTED flag but it's not used in
InnoDB handler(fix it).


2. No ranking(automatically means no sorting by rank too)
   In some cases rank calculation is not necessary and can be omitted.

   Examples:

   SELECT count(*) FROM t1
   WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE);

   Note: DOC_ID tree after 1.1 step of the algorithm above.
         already have necessary information
         (number of docs satisfied WHERE condition), so we can
         skip 1.2, 1.3 steps. Corresponds to HLD 1a) item.
   Conditions: WHERE has single MATCH expression, no MATCH in SELECT list,
               COUNT.

   SELECT count(*) FROM t1
   WHERE MATCH (title, text) AGAINST ('data*' IN BOOLEAN MODE)
   ORDER BY title;

   Note: Despite ORDER presence 'no ranking' optimization is still
         applicable. Corresponds to HLD 1a) item.
   Conditions: WHERE has single MATCH expression, no MATCH in SELECT list,
               COUNT.

   SELECT FTS_DOC_ID, title FROM t1
   WHERE MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE)


   Note: Manual states that internal sorting is performed only for
         NATURAL LANGUAGE MODE. Boolean mode does not require sorting,
         rank value is not used in any expression or as return value.
         So ranking is not needed. 

   Conditions: WHERE has single MATCH expression, BOOLEAN MODE,
               no rank field in SELECT list, no order by rank

New flag FT_NO_RANKING is necessary.


3. LIMIT optimization.

   If no ranking is supported then we can add optimization
   using LIMIT clause.

   Examples:

   SELECT count(*) FROM t1
   WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE)
   LIMIT 3;

   Note: we don't need to build whole DOC_ID tree. we can interrupt the building
         when number of elements exceeds LIMIT value
         (1.1 in the algorithm above).
   Conditions: no ranking is supported, LIMIT presence


   if ranking is needed and number of matched FTS rows is more or equal
   than LIMIT value then LIMIT optimization can be used for partial
   sorting(only top 'limit value' should be sorted).

   Example:

   SELECT FTS_DOC_ID, title,
       MATCH(title, text) AGAINST ('+fast +database' IN BOOLEAN MODE) AS rank
   FROM t1
   ORDER BY rank
   LIMIT 10; 

   Note: Corresponds to HLD 1b) item.

4. WHERE optimization on MATCH > some_rank.

   if WHERE condition is MATCH > 'some_rank' we can use
   some_rank to filter out unnecessary documents before
   result sorting(2.1 in the algorithm above)

   Examples:

   SELECT
     MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) 
   FROM t1
   WHERE
     MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE) >
       0.5;

   Note: Corresponds to HLD 1d) item.  

   Conditions: presence of MATCH > 'some_rank' condition.


5. innobase_fts_retrieve_ranking() function

   innobase_fts_retrieve_ranking() returns rank value.
   It can return current element rank if read_just_key is set.
   In other cases it uses fts_retrieve_ranking() function.
   fts_retrieve_ranking() is looking for doc_id in doc_id tree
   and returns appropriate rank.

   SELECT * FROM articles
   WHERE MATCH (title,body)
   AGAINST ('Database' IN NATURAL LANGUAGE MODE);

   In example above read_just_key is not set and fts_retrieve_ranking() is
   used but actually we can return current element rank, no need to call
   fts_retrieve_ranking(). So read_just_key check should be changed to
   allow return current element rank for example above.

   Actually if result->rankings_by_rank exists
   we can avoid fts_retrieve_ranking() call and just
   return current rank.

   So instead of ft_prebuilt->read_just_key check we should
   use result->rankings_by_rank. It solves the problem.

   Taking into account future optimization(skip result sorting)
   it can happen that this check will be extended to be able to.
   work with unsorted result(in other words with DOC_ID tree).


6. HLD requirements which are not covered by WL implementation:

   6.1. ASC order optimization does not supported as not important
        (no considerable use cases).

   6.2 HLD 1c) item is not covered by this WL. Since result is
       not limited in this case(no WHERE, no LIMIT), there is no
       way to optimize it.


Following structure is introduced:

typedef struct st_ft_hints FT_HINTS;
struct st_ft_hints
{
  /** FULLTEXT flags, see FT_NL, etc */
  uint              ft_flags;
  /** Operation type */
  enum ft_operation ft_min_rank_op;
  /** Operation value */
  double            ft_min_rank;
  /** LIMIT value, HA_POS_ERROR if not set */
  ha_rows           ft_limit;
};

ft_flags variable has two special flags:

 FT_SORTED      means that internal sorting should be performed.
 FT_NO_RANKING  means that ranking is not necessary

ft_min_rank_op variable has following values:

  FT_OP_NO,       No operation, single MATCH function
  FT_OP_GT,       'Greater than' operation
  FT_OP_GE        'Greater than or equal to' operation

ft_min_rank variable is rank value extracted from WHERE condition.

ft_limit variable is limit value if set, otherwise HA_POS_ERROR.

FT_HINTS is passed to the engine at Item_func_match::init_search() stage.
Following rules should be used on engine side:

1. If FT_SORTED is set then

   1.1 Preform internal sorting, otherwise skip internal sorting.
   1.2 If LIMIT is set it can be used at internal sorting stage.

2. If FT_NO_RANKING is set then

   2.1 Skip ranking procedure.
   2.2 If LIMIT is set it can be used to interrupt DOC_ID tree building
       when number of DOC_ID tree elements exceeds LIMIT value.

3. ft_min_rank_op & ft_min_rank can be used before internal sorting
   to filter out elements which have lesser rank. Another place for use
   is filtering of DOC_ID tree elements in ::ft_read().

On optimizer level FT_HINTS values are set using following criteria:

1. FT_SORTED is set if any of the following conditions is true:

   1.1 No ORDER BY and MATCH uses NATURAL LANGUAGE MODE
   1.2 Engine supports extended API and ORDER BY is a
       single MATCH expression.

2. FT_NO_RANKING is set if FT_SORTED is not set and
   MATCH is a single MATCH expression which used only in
   WHERE condition.

3. ft_min_rank_op & ft_min_rank are set if there is
   appropriate MATCH expression in WHERE condition.
   Condition should be part of AND condition and 
   should be of the following form: 

   MATCH >  'const_value'

   or

   MATCH >= 'const_value'.


4. LIMIT is applicable if we optimize single table and
   there is no WHERE condition or WHERE condition is a
   single MATCH function.



Addon(some bug fixed):

#Crash
EXPLAIN
SELECT * FROM t1
ORDER BY MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE)
DESC LIMIT 10;

#Wrong result
SELECT FTS_DOC_ID FROM t1
WHERE MATCH (title, text) AGAINST ('fast database' IN NATURAL LANGUAGE MODE)
      AND
      MATCH (title, text) AGAINST ('mysql' IN NATURAL LANGUAGE MODE)
LIMIT 6;

#Wrong result, MyISAM(see innodb_fts_opt.test)
SELECT FTS_DOC_ID docid, MATCH(title, text) AGAINST ('database') AS score 
FROM wp 
ORDER BY score DESC LIMIT 10;

There are three functions responsible for FT optimization:

1. add_ft_keys()

  Function parses WHERE condition and add key_use for FT index
  into key_use array if appropriate MATCH function is found.
  If no MATCH function is found in WHERE then check ORDER BY
  clause. If ORDER BY clause has single MATCH expression then
  add key_use for FT index into key_use array.

  It also sets FT_HINTS values(ft_min_rank_op, ft_min_rank).

2. JOIN::pre_init_optimize_fts_query()

  Function sets FT_SORTED, FT_NO_RANKING depending on conditions for
  FT_HINTS structure. It also sets LIMIT hint. 
  
  FT_SORTED flag is set if
  1. No ORDER BY clause and MATCH is not in BOOLEAN MODE 
  2. There is a single MATCH function in ORDER BY clause

  FT_NO_RANKING flag is set if
  1. FT_SORTED flags is not set and MATCH function is used
     only in WHERE condition

  LIMIT hint is set if we optimize single table and there is
  no WHERE condition or WHERE condition is a single MATCH function.

3. JOIN::post_init_optimize_fts_query() 

   Old name JOIN::optimize_fts_query().
   Meaning is not changed, the function is just adapted
   according to new code.


Folowing changes are done:

sql/handler.h

Added new handlerton flag HTON_CAN_FULLTEXT_EXT.
Removed handler flag HTON_CAN_FULLTEXT_EXT.
Changed parameters in handler::ft_init_ext() function:
  removed 'uint flags' parameter.
  added FT_HINTS *ft_hints parameter.

sql/item_func.cc

Item_func_match::flags is replaced with Item_func_match::ft_hints.ft_flags.

sql/item_func.h

Item_func_match changes:

Added new parameter(parsing context) to the constructor.
It's used to determine where MATCH function is used.

Added new class function:
Item_func_match *get_master();
void init_used_place(enum enum_parsing_context place);
void init_ft_hints(uint ft_flags, enum ft_operation ft_min_rank_op,
                   double ft_min_rank, ha_rows ft_limit);
void set_ft_hints_op(enum ft_operation ft_min_rank_op, double ft_min_rank);
void set_ft_hints_limit(ha_rows ft_limit);
bool extension_supported();
bool can_optimize_sorting();
bool can_no_ranking();
bool is_same(Item *item);

sql/opt_sum.cc

opt_sum_query() changes:

Added new arguments:
List &fields_list - used to determine if count() function
                          is the only element in select list.
ha_rows limit - used for FT_HINTS.

sql/sql_base.cc

setup_ftfuncs() change:
Added populating of parsing context.

sql/sql_optimizer.cc

Removed:
optimize_fts_limit_query().

Added:
JOIN::pre_init_optimize_fts_query();
static Item_func_match *get_ft_order_match(JOIN_TAB *stat);

Renamed:
JOIN::optimize_fts_query() to JOIN::post_init_optimize_fts_query();

sql/sql_optimizer.h

optimize_fts_limit_query().

Added:
JOIN::pre_init_optimize_fts_query();

Renamed:
JOIN::optimize_fts_query() to JOIN::post_init_optimize_fts_query();

sql/sql_select.h

opt_sum_query() changes:

Added new arguments.

sql/sql_yacc.yy

Item_func_match changes:

Added new parameter(parsing context) to the constructor.


storage/innobase/handler/ha_innodb.cc

Removed HA_CAN_FULLTEXT_EXT flag from handler.
Added HTON_CAN_FULLTEXT_EXT flag initialization to handlerton.
Changed parameters in ::ft_init_ext() function.
innobase_fts_retrieve_ranking() changes:
now function does not use fts_retrieve_ranking() at all
and always returns current element rank.

storage/innobase/handler/ha_innodb.h

Changed parameters in ::ft_init_ext() function.

storage/innobase/include/fts0fts.h

added FTS_NO_RANKING flag.

storage/myisam/ha_myisam.h

Changed parameters in ::ft_init_ext() function.