WL#5538: InnoDB Full-Text Search Support

Status: Complete

Description
High Level Architecture

This is the placeholder worklog for the ongoing development of the Full Text
Search (FTS) project in InnoDB.

The purpose of the Project is to provide user the ability to build Full Text
Index on Text Document stored in the InnoDB storage engine, and provide them
fast and accurate search on the Document Content.

The project design is based on a thesis work (see attached FTSProject.htm) by
Osku. In essence, the non-relational text data are tokenized and stored into a
set of relational tables (auxiliary tables) along with their Doc ID and
position, and thus we build the so called "inverted index" on the tokenized word
and their positions.

In brief summary, we will provide following FTS functionalities in the first
release of the feature (please refer to attached function spec "FTS_FUNC" for
more info):

1) Create FTS through FIC (Fast Index Creation) interface

2) FTS search with following options:
    Natural Language Full-Text Searches
    Boolean Full-Text Searches
    Full-Text Searches with Query Expansion
    Proximity Search

3)Default and User supplied Full-Text Stopwords

The initial release shall provide these basic FTS functionality support with
efficiency and robustness.


=============== DOC ID =============

To create reverted index, we will need an unique DOC ID to identify the
document. Since we use a delayed delete mechanism, the deleted document could
still be in the index, and we use a DELETED table to register deleted Doc IDs,
so we need to make sure Doc ID is never reused. So our Doc ID have following
properties:

1) If user do not supply Doc ID, we will generate an internal Doc ID column (of
name "FTS_DOC_ID") for them. This would require a rebuild of primary index, so
it could be time consuming

2) User can supply Doc IDs (just like many other FT search engine). As of now,
it needs to be a 8 bytes unique non-null BIGINT column. Currently it must have
the reserved name "FTS_DOC_ID". If such column exists, we will not need to add
Doc ID column. And no primary index is needed.

3) We require the Doc ID to be monotonically increasing. That is, we register
latest (largest) Doc ID in our internal CONFIG table and its internal structure.
So for each inserting Doc, we will check the ID. If there is violation, the
insertion can be either rejected or be warned (in the later case user takes his
own risk).

4) If server crashes, and reboot, again, the inserting doc's Doc ID needs to be
larger than the value we registered in the config table

5) If users do want to reuse ID, then he should check the existing table and the
deleted table do not have IDs larger than where he restarted.  But the burden
can leave to him. Since there is unique index on the Doc ID, he can't inserting
anything already in the table, but he could inserting some Doc with Doc ID in
our deleted table. The end result would be such Doc will not shown as result and
could be optimized out of our inverted dictionary.

6) User can manage Doc ID in any way they want. Auto-increment column is one way
to do so.

User Interface
The current user Interfaces and Syntax mostly mirror with those provided by
MyISAM MySQL v5.5 FTS Documentation. 

Natural Language Full-Text Searches
By default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function
performs a natural language search for a string against a text collection. A
collection is a set of one or more columns included in a FULLTEXT index. The
search string is given as the argument to AGAINST(). For each row in the table,
MATCH() returns a relevance value; that is, a similarity measure between the
search string and the text in that row in the columns named in the MATCH() list. 

mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO articles (title,body) VALUES
    -> ('MySQL Tutorial','DBMS stands for DataBase ...'),
    -> ('How To Use MySQL Well','After you went through a ...'),
    -> ('Optimizing MySQL','In this tutorial we will show ...'),
    -> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    -> ('MySQL vs. YourSQL','In the following database comparison ...'),
    -> ('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0
mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)

By default, the search is performed in case-insensitive fashion. However, you
can perform a case-sensitive full-text search by using a binary collation for
the indexed columns. For example, a column that uses the latin1 character set of
can be assigned a collation of latin1_bin to make it case sensitive for
full-text searches. 

When MATCH() is used in a WHERE clause, as in the example shown earlier, the
rows returned are automatically sorted with the highest relevance first.
Relevance values are non-negative floating-point numbers. Zero relevance means
no similarity. Relevance is computed based on the number of words in the row,
the number of unique words in that row, the total number of words in the
collection, and the number of documents (rows) that contain a particular word. 

To simply count matches, you could use a query like this: 

mysql> SELECT COUNT(*) FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----------+
| COUNT(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)

However, you might find it quicker to rewrite the query as follows: 

mysql> SELECT
    -> COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE
MODE), 1, NULL))
    -> AS count
    -> FROM articles;
+-------+
| count |
+-------+
|     2 |
+-------+
1 row in set (0.03 sec)

The first query sorts the results by relevance whereas the second does not.
However, the second query performs a full table scan and the first does not. The
first may be faster if the search matches few rows; otherwise, the second may be
faster because it would read many rows anyway. 

For natural-language full-text searches, it is a requirement that the columns
named in the MATCH() function be the same columns included in some FULLTEXT
index in your table. For the preceding query, note that the columns named in the
MATCH() function (title and body) are the same as those named in the definition
of the article table's FULLTEXT index. If you wanted to search the title or body
separately, you would need to create separate FULLTEXT indexes for each column. 

It is also possible to perform a boolean search or a search with query
expansion. These search types are described in Section "Boolean Full-Text
Searches", and Section "Full-Text Searches with Query Expansion". 

A full-text search that uses an index can name columns only from a single table
in the MATCH() clause because an index cannot span multiple tables. A boolean
search can be done in the absence of an index (albeit more slowly), in which
case it is possible to name columns from multiple tables. 

The preceding example is a basic illustration that shows how to use the MATCH()
function where rows are returned in order of decreasing relevance. The next
example shows how to retrieve the relevance values explicitly. Returned rows are
not ordered because the SELECT statement includes neither WHERE nor ORDER BY
clauses: 

mysql> SELECT id, MATCH (title,body)
    -> AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score
    -> FROM articles;
+----+------------------+
| id | score            |
+----+------------------+
|  1 | 0.65545833110809 |
|  2 |                0 |
|  3 | 0.66266459226608 |
|  4 |                0 |
|  5 |                0 |
|  6 |                0 |
+----+------------------+
6 rows in set (0.00 sec)

The following example is more complex. The query returns the relevance values
and it also sorts the rows in order of decreasing relevance. To achieve this
result, you should specify MATCH() twice: once in the SELECT list and once in
the WHERE clause. This causes no additional overhead, because the MySQL
optimizer notices that the two MATCH() calls are identical and invokes the
full-text search code only once. 

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root'
    -> IN NATURAL LANGUAGE MODE) AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root'
    -> IN NATURAL LANGUAGE MODE);
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)

The MySQL FULLTEXT implementation regards any sequence of true word characters
(letters, digits, and underscores) as a word. That sequence may also contain
apostrophes (`'`), but not more than one in a row. This means that aaa'bbb is
regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the
beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb'
would be parsed as aaa'bbb. 

The FULLTEXT parser determines where words start and end by looking for certain
delimiter characters; for example, ` ` (space), `,` (comma), and `.` (period).
If words are not separated by delimiters (as in, for example, Chinese), the
FULLTEXT parser cannot determine where a word begins or ends. To be able to add
words or other indexed terms in such languages to a FULLTEXT index, you must
preprocess them so that they are separated by some arbitrary delimiter such as `"`. 

Starting at MySQL 5.1, you can write a plugin that replaces the built-in
full-text parser. For details, see Section 22.2, "The MySQL Plugin Interface".
For example parser plugin source code, see the plugin/fulltext directory of a
MySQL source distribution. 

Some words are ignored in full-text searches: 
Words in the stopword list are ignored. A stopword is a word such as "the" or
"some" that is so common that it is considered to have zero semantic value.
There is a built-in stopword list, but it can be overwritten by a user.

Every correct word in the collection and in the query is weighted according to
its significance in the collection or query. Consequently, a word that is
present in many documents has a lower weight (and may even have a zero weight),
because it has lower semantic value in this particular collection. Conversely,
if the word is rare, it receives a higher weight. The weights of the words are
combined to compute the relevance of the row. 

Such a technique works best with large collections (in fact, it was carefully
tuned this way). For very small tables, word distribution does not adequately
reflect their semantic value, and this model may sometimes produce bizarre
results. For example, although the word "MySQL" is present in every row of the
articles table shown earlier, a search for the word produces no results: 

mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('MySQL' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)

The search result is empty because the word "MySQL" is present in at least 50%
of the rows. As such, it is effectively treated as a stopword. For large data
sets, this is the most desirable behavior: A natural language query should not
return every second row from a 1GB table. For small data sets, it may be less
desirable. 

A word that matches half of the rows in a table is less likely to locate
relevant documents. In fact, it most likely finds plenty of irrelevant
documents. We all know this happens far too often when we are trying to find
something on the Internet with a search engine. It is with this reasoning that
rows containing the word are assigned a low semantic value for the particular
data set in which they occur. A given word may reach the 50% threshold in one
data set but not another. 

The 50% threshold has a significant implication when you first try full-text
searching to see how it works: If you create a table and insert only one or two
rows of text into it, every word in the text occurs in at least 50% of the rows.
As a result, no search returns any results. Be sure to insert at least three
rows, and preferably many more. Users who need to bypass the 50% limitation can
use the boolean search mode; see Section 10.8.2, "Boolean Full-Text Searches". 

Boolean Full-Text Searches
MySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier: 

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                |
+----+-----------------------+-------------------------------------+
|  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Well | After you went through a ...        |
|  3 | Optimizing MySQL      | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+

The + and - operators indicate that a word is required to be present or absent,
respectively, for a match to occur. Thus, this query retrieves all the rows that
contain the word "MySQL" but that do not contain the word "YourSQL".
 
In implementing this feature, MySQL uses what is sometimes referred to as
implied Boolean logic, in which 
 * + stands for AND 
 * - stands for NOT 
 * [no operator] implies OR 

Boolean full-text searches have these characteristics: 
 * They do not use the 50% threshold. 
 * They do not automatically sort rows in order of decreasing relevance. You can
see this from the preceding query result: The row with the highest relevance is
the one that contains "MySQL" twice, but it is listed last, not first. 
 * They can work even without a FULLTEXT index, although a search executed in
this fashion would be quite slow. 
 * The minimum and maximum word length full-text parameters apply. 
 * The stopword list applies. 

The boolean full-text search capability supports the following operators: 
 * + A leading plus sign indicates that this word must be present in each row
that is returned. 
 * - A leading minus sign indicates that this word must not be present in any of
the rows that are returned. Note: The - operator acts only to exclude rows that
are otherwise matched by other search terms. Thus, a boolean-mode search that
contains only terms preceded by - returns an empty result. It does not return
"all rows except those containing any of the excluded terms." 
 * (no operator) By default (when neither + nor - is specified) the word is
optional, but the rows that contain it are rated higher. This mimics the
behavior of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier. 
 * > < These two operators are used to change a word's contribution to the
relevance value that is assigned to a row. The > operator increases the
contribution and the < operator decreases it. See the example following this list. 
 * ( ) Parentheses group words into subexpressions. Parenthesized groups can be
nested. 
 * @ The proximity operator "hello world"@10 Proximity distance is specified in
words, adjusted for word count, and applies to all words within quotes. For
instance, "cat dog mouse"@5 query means that there must be less than 8-word span
which contains all 3 words, ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document
will not match this query, because this span is exactly 8 words long. 
 * ~ A leading tilde acts as a negation operator, causing the word's
contribution to the row's relevance to be negative. This is useful for marking
"noise" words. A row containing such a word is rated lower than others, but is
not excluded altogether, as it would be with the - operator. 
 * * The asterisk serves as the truncation (or wildcard) operator. Unlike the
other operators, it should be appended to the word to be affected. Words match
if they begin with the word preceding the * operator. 
    ** If a stopword or too-short word is specified with the truncation
operator, it will not be stripped from a boolean query. For example, a search
for '+word +stopword*' will likely return fewer rows than a search for '+word
+stopword' because the former query remains as is and requires stopword* to be
present in a document. The latter query is transformed to +word. 
 * "A phrase that is enclosed within double quote (`"`) characters matches only
rows that contain the phrase literally, as it was typed. The full-text engine
splits the phrase into words, performs a search in the FULLTEXT index for the
words. Non-word characters need not be matched exactly: Phrase searching
requires only that matches contain exactly the same words as the phrase and in
the same order. For example, "test phrase" matches "test, phrase". 
    ** If the phrase contains no words that are in the index, the result is
empty. For example, if all words are either stopwords or shorter than the
minimum length of indexed words, the result is empty. 

The following examples demonstrate some search strings that use boolean
full-text operators: 
 * 'apple banana' - Find rows that contain at least one of the two words. 
 * '+apple +juice' - Find rows that contain both words. 
 * '+apple macintosh' - Find rows that contain the word "apple", but rank rows
higher if they also contain "macintosh". 
 * '+apple -macintosh' - Find rows that contain the word "apple" but not
"macintosh". 
 * '+apple ~macintosh' - Find rows that contain the word "apple", but if the row
also contains the word "macintosh", rate it lower than if row does not. This is
"softer" than a search for '+apple -macintosh', for which the presence of
"macintosh" causes the row not to be returned at all. 
 * '+apple +(>turnover  SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
+----+-------------------+------------------------------------------+
2 rows in set (0.00 sec)
mysql> SELECT * FROM articles
    -> WHERE MATCH (title,body)
    -> AGAINST ('database' WITH QUERY EXPANSION);
+----+-------------------+------------------------------------------+
| id | title             | body                                     |
+----+-------------------+------------------------------------------+
|  1 | MySQL Tutorial    | DBMS stands for DataBase ...             |
|  5 | MySQL vs. YourSQL | In the following database comparison ... |
|  3 | Optimizing MySQL  | In this tutorial we will show ...        |
+----+-------------------+------------------------------------------+
3 rows in set (0.00 sec)

Another example could be searching for books by Georges Simenon about Maigret,
when a user is not sure how to spell “Maigret”. A search for “Megre and the
reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without
query expansion. A search with query expansion finds all books with the word
“Maigret” on the second pass.

Because blind query expansion tends to increase noise significantly by returning
non-relevant documents, it is meaningful to use only when a search phrase is
rather short. 

Full-Text Stopwords
For FTS stopwords, we provide two sources of stopwords: 
1) Default stopwords provided by the server - this is a static list of stopword
come with server. If no user stopword is defined, then this default stopword
list will be used. 
2) User defined stopwords - User can define his/her own stop word by creating an
user table, and use global variable "innodb_stopword_table_name" to point to
this stopword list. Server will load stopword from this user table, rather than
the default stopword list when we create the FTS index. 
There are following rules for the InnoDB FTS stopwords: 
1) stopword list is server wide. One server will have one "official" stopword
list at one time. 
2) words in stopword are excluded from indexing. So they will not be indexed in
the first place. 
3) stopword list will be loaded when each index created, and they will remain in
cache for the life time of FTS cache. That is, if the stopword list changes in
between, it will not affect stopword for tables create prior to the change. It
only be effective for stopword created after. 
4) If server crashes/shutdown after stopword list changes, when rebooting the
server, FTS indexes could all load the "current" stopword list, which means, if
user do not rebuild their FTS index after stopword changed, they could possibly
have FTS index with partially indexed stopwords. 
For the user table specified for the stopword, there is only one requirement for
the format of the table, its first column must be of varchar type and not null. 

Here is the current list of stopword for our prototype: 
 * InnoDB default stopword list. 
    ** There are different versions of stopwords, the stop words listed below is
so called "Google Stop Word" list. 

const char *fts_default_stopword[] = { 
 * "a", "about", "an", "are", "as", "at", "be", "by", "com", "de", "en", "for",
"from", "how", "i", "in", "is ", "it", "la", "of", "on", "or", "that", "the",
"this", "to", "was", "what", "when", "where", "who", "will", "with", "und",
"the", "www", NULL 
}; 

Here are some examples: 
# Create FTS table
CREATE TABLE articles (
        id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
        title VARCHAR(200),
        body TEXT,
        FULLTEXT (title,body)
        ) ENGINE=InnoDB;

# Insert six rows
INSERT INTO articles (title,body) VALUES
        ('MySQL Tutorial','DBMS stands for DataBase ...')  ,
        ('How To Use MySQL Well','After you went through a ...'),
        ('Optimizing MySQL','In this tutorial we will show ...'),
        ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
        ('MySQL vs. YourSQL','In the following database comparison ...'),
        ('MySQL Security','When configured properly, MySQL ...');

# This is to test the stopword functionality. The word "this" and
# "in" are in the default stopword list. It should not show up
mysql> SELECT * FROM articles WHERE MATCH (title,body)
    ->         AGAINST ('this' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)

# Provide user defined stopword table, if not (correctly) defined,
# it will be rejected
set global innodb_stopword_table_name = "not_defined";
ERROR 1210 (HY000): Incorrect arguments to SET

# Define a correct formated user stopword table
mysql> create table user_stopword(value varchar(30)) engine = innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> set global innodb_stopword_table_name = "test/user_stopword";
Query OK, 0 rows affected (5.55 sec)

# Nothing inserted into the default stopword, so essentially
# nothing get screened. The new stopword could only be
# effective for table created thereafter
mysql> CREATE TABLE articles_2 (
    ->         id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->         title VARCHAR(200),
    ->         body TEXT,
    ->         FULLTEXT (title,body)
    ->         ) ENGINE=InnoDB;
Query OK, 0 rows affected (5.31 sec)

mysql> INSERT INTO articles_2 (title,body) VALUES ('test for stopwords','this is
it...');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM articles_2 WHERE MATCH (title,body)
    ->         AGAINST ('this' IN NATURAL LANGUAGE MODE);
+----+--------------------+---------------+
| id | title              | body          |
+----+--------------------+---------------+
|  1 | test for stopwords | this is it... |
+----+--------------------+---------------+
1 row in set (0.00 sec)


mysql> drop table articles_2;
Query OK, 0 rows affected (7.43 sec)

# Ok, let's instantiate some value into user supplied stop word
# table
insert into user_stopword values("this");

mysql> CREATE TABLE articles_2 (
    ->         id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->         title VARCHAR(200),
    ->         body TEXT,
    ->         FULLTEXT (title,body)
    ->         ) ENGINE=InnoDB;
Query OK, 0 rows affected (4.57 sec)

mysql> INSERT INTO articles_2 (title, body)
    ->         VALUES ('test for stopwords','this is it...');
Query OK, 1 row affected (0.00 sec)

mysql>
mysql> SELECT * FROM articles_2 WHERE MATCH (title,body)
    ->         AGAINST ('this' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)


Full-Text Restrictions

 * In terms of InnoDB DML, the tokenization of the document happens when the DML
is committed. Before the commit, you will not be able to use FTS to search
document as it is not tokenized and put in the FullText Index.
 * Ideographic languages such as Chinese and Japanese do not have word
delimiters. Therefore, the FULLTEXT parser cannot determine where words begin
and end in these and other such languages. The implications of this and some
workarounds for the problem are described in Section 10.8, "Full-Text Search
Functions". 
 * Although the use of multiple character sets within a single table is
supported, all columns in a FULLTEXT index must use the same character set and
collation. 
 * The MATCH() column list must match exactly the column list in some FULLTEXT
index definition for the table, unless this MATCH() is IN BOOLEAN MODE.
Boolean-mode searches can be done on non-indexed columns, although they are
likely to be slow. 
 * The argument to AGAINST() must be a constant string. 

Fine-Tuning Full-Text Search

Values used by the built-in parser and document manager: 
 * innodb_cache_size - server wide configure variable  (default: 32MB) in bytes.
The size of the cache used to store parsed document in memory. 
 * innodb_min_token_size - server wide configure variable  (default: 3) Ignore
tokens that have a length less than this value. The user must rebuild the FTS
indexes after modifying this value. This change is persistent. 
 * innodb_max_token_size - server wide configure variable  (default: 84) Ignore
tokens that have a length larger than  this value. The user must rebuild the FTS
indexes after modifying this value. This change is persistent. 

Values used by the optimize thread: 
 * innodb_optimize_add_threshold - per table value (default: 20,000 documents)
Signal an optimize when the number of added documents exceeds this threshold. 
 * innodb_optimize_del_threshold - per table value (default: 20,000 documents)
Signal an optimize when the number of deleted documents exceeds this threshold.