WL#5538: InnoDB Full-Text Search Support
Status: Complete
This is the placeholder worklog for the ongoing development of the Full Text Search (FTS) project in InnoDB. The purpose of the Project is to provide user the ability to build Full Text Index on Text Document stored in the InnoDB storage engine, and provide them fast and accurate search on the Document Content. The project design is based on a thesis work (see attached FTSProject.htm) by Osku. In essence, the non-relational text data are tokenized and stored into a set of relational tables (auxiliary tables) along with their Doc ID and position, and thus we build the so called "inverted index" on the tokenized word and their positions. In brief summary, we will provide following FTS functionalities in the first release of the feature (please refer to attached function spec "FTS_FUNC" for more info): 1) Create FTS through FIC (Fast Index Creation) interface 2) FTS search with following options: Natural Language Full-Text Searches Boolean Full-Text Searches Full-Text Searches with Query Expansion Proximity Search 3)Default and User supplied Full-Text Stopwords The initial release shall provide these basic FTS functionality support with efficiency and robustness. =============== DOC ID ============= To create reverted index, we will need an unique DOC ID to identify the document. Since we use a delayed delete mechanism, the deleted document could still be in the index, and we use a DELETED table to register deleted Doc IDs, so we need to make sure Doc ID is never reused. So our Doc ID have following properties: 1) If user do not supply Doc ID, we will generate an internal Doc ID column (of name "FTS_DOC_ID") for them. This would require a rebuild of primary index, so it could be time consuming 2) User can supply Doc IDs (just like many other FT search engine). As of now, it needs to be a 8 bytes unique non-null BIGINT column. Currently it must have the reserved name "FTS_DOC_ID". If such column exists, we will not need to add Doc ID column. And no primary index is needed. 3) We require the Doc ID to be monotonically increasing. That is, we register latest (largest) Doc ID in our internal CONFIG table and its internal structure. So for each inserting Doc, we will check the ID. If there is violation, the insertion can be either rejected or be warned (in the later case user takes his own risk). 4) If server crashes, and reboot, again, the inserting doc's Doc ID needs to be larger than the value we registered in the config table 5) If users do want to reuse ID, then he should check the existing table and the deleted table do not have IDs larger than where he restarted. But the burden can leave to him. Since there is unique index on the Doc ID, he can't inserting anything already in the table, but he could inserting some Doc with Doc ID in our deleted table. The end result would be such Doc will not shown as result and could be optimized out of our inverted dictionary. 6) User can manage Doc ID in any way they want. Auto-increment column is one way to do so.
User Interface The current user Interfaces and Syntax mostly mirror with those provided by MyISAM MySQL v5.5 FTS Documentation. Natural Language Full-Text Searches By default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list. mysql> CREATE TABLE articles ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ) ENGINE=InnoDB; Query OK, 0 rows affected (0.00 sec) mysql> INSERT INTO articles (title,body) VALUES -> ('MySQL Tutorial','DBMS stands for DataBase ...'), -> ('How To Use MySQL Well','After you went through a ...'), -> ('Optimizing MySQL','In this tutorial we will show ...'), -> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), -> ('MySQL vs. YourSQL','In the following database comparison ...'), -> ('MySQL Security','When configured properly, MySQL ...'); Query OK, 6 rows affected (0.00 sec) Records: 6 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM articles -> WHERE MATCH (title,body) -> AGAINST ('database' IN NATURAL LANGUAGE MODE); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec) By default, the search is performed in case-insensitive fashion. However, you can perform a case-sensitive full-text search by using a binary collation for the indexed columns. For example, a column that uses the latin1 character set of can be assigned a collation of latin1_bin to make it case sensitive for full-text searches. When MATCH() is used in a WHERE clause, as in the example shown earlier, the rows returned are automatically sorted with the highest relevance first. Relevance values are non-negative floating-point numbers. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word. To simply count matches, you could use a query like this: mysql> SELECT COUNT(*) FROM articles -> WHERE MATCH (title,body) -> AGAINST ('database' IN NATURAL LANGUAGE MODE); +----------+ | COUNT(*) | +----------+ | 2 | +----------+ 1 row in set (0.00 sec) However, you might find it quicker to rewrite the query as follows: mysql> SELECT -> COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), 1, NULL)) -> AS count -> FROM articles; +-------+ | count | +-------+ | 2 | +-------+ 1 row in set (0.03 sec) The first query sorts the results by relevance whereas the second does not. However, the second query performs a full table scan and the first does not. The first may be faster if the search matches few rows; otherwise, the second may be faster because it would read many rows anyway. For natural-language full-text searches, it is a requirement that the columns named in the MATCH() function be the same columns included in some FULLTEXT index in your table. For the preceding query, note that the columns named in the MATCH() function (title and body) are the same as those named in the definition of the article table's FULLTEXT index. If you wanted to search the title or body separately, you would need to create separate FULLTEXT indexes for each column. It is also possible to perform a boolean search or a search with query expansion. These search types are described in Section "Boolean Full-Text Searches", and Section "Full-Text Searches with Query Expansion". A full-text search that uses an index can name columns only from a single table in the MATCH() clause because an index cannot span multiple tables. A boolean search can be done in the absence of an index (albeit more slowly), in which case it is possible to name columns from multiple tables. The preceding example is a basic illustration that shows how to use the MATCH() function where rows are returned in order of decreasing relevance. The next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the SELECT statement includes neither WHERE nor ORDER BY clauses: mysql> SELECT id, MATCH (title,body) -> AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score -> FROM articles; +----+------------------+ | id | score | +----+------------------+ | 1 | 0.65545833110809 | | 2 | 0 | | 3 | 0.66266459226608 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----+------------------+ 6 rows in set (0.00 sec) The following example is more complex. The query returns the relevance values and it also sorts the rows in order of decreasing relevance. To achieve this result, you should specify MATCH() twice: once in the SELECT list and once in the WHERE clause. This causes no additional overhead, because the MySQL optimizer notices that the two MATCH() calls are identical and invokes the full-text search code only once. mysql> SELECT id, body, MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root' -> IN NATURAL LANGUAGE MODE) AS score -> FROM articles WHERE MATCH (title,body) AGAINST -> ('Security implications of running MySQL as root' -> IN NATURAL LANGUAGE MODE); +----+-------------------------------------+-----------------+ | id | body | score | +----+-------------------------------------+-----------------+ | 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 | | 6 | When configured properly, MySQL ... | 1.3114095926285 | +----+-------------------------------------+-----------------+ 2 rows in set (0.00 sec) The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes (`'`), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as aaa'bbb. The FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, ` ` (space), `,` (comma), and `.` (period). If words are not separated by delimiters (as in, for example, Chinese), the FULLTEXT parser cannot determine where a word begins or ends. To be able to add words or other indexed terms in such languages to a FULLTEXT index, you must preprocess them so that they are separated by some arbitrary delimiter such as `"`. Starting at MySQL 5.1, you can write a plugin that replaces the built-in full-text parser. For details, see Section 22.2, "The MySQL Plugin Interface". For example parser plugin source code, see the plugin/fulltext directory of a MySQL source distribution. Some words are ignored in full-text searches: Words in the stopword list are ignored. A stopword is a word such as "the" or "some" that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overwritten by a user. Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word "MySQL" is present in every row of the articles table shown earlier, a search for the word produces no results: mysql> SELECT * FROM articles -> WHERE MATCH (title,body) -> AGAINST ('MySQL' IN NATURAL LANGUAGE MODE); Empty set (0.00 sec) The search result is empty because the word "MySQL" is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large data sets, this is the most desirable behavior: A natural language query should not return every second row from a 1GB table. For small data sets, it may be less desirable. A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular data set in which they occur. A given word may reach the 50% threshold in one data set but not another. The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more. Users who need to bypass the 50% limitation can use the boolean search mode; see Section 10.8.2, "Boolean Full-Text Searches". Boolean Full-Text Searches MySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier: mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE); +----+-----------------------+-------------------------------------+ | id | title | body | +----+-----------------------+-------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 2 | How To Use MySQL Well | After you went through a ... | | 3 | Optimizing MySQL | In this tutorial we will show ... | | 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... | | 6 | MySQL Security | When configured properly, MySQL ... | +----+-----------------------+-------------------------------------+ The + and - operators indicate that a word is required to be present or absent, respectively, for a match to occur. Thus, this query retrieves all the rows that contain the word "MySQL" but that do not contain the word "YourSQL". In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which * + stands for AND * - stands for NOT * [no operator] implies OR Boolean full-text searches have these characteristics: * They do not use the 50% threshold. * They do not automatically sort rows in order of decreasing relevance. You can see this from the preceding query result: The row with the highest relevance is the one that contains "MySQL" twice, but it is listed last, not first. * They can work even without a FULLTEXT index, although a search executed in this fashion would be quite slow. * The minimum and maximum word length full-text parameters apply. * The stopword list applies. The boolean full-text search capability supports the following operators: * + A leading plus sign indicates that this word must be present in each row that is returned. * - A leading minus sign indicates that this word must not be present in any of the rows that are returned. Note: The - operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean-mode search that contains only terms preceded by - returns an empty result. It does not return "all rows except those containing any of the excluded terms." * (no operator) By default (when neither + nor - is specified) the word is optional, but the rows that contain it are rated higher. This mimics the behavior of MATCH() ... AGAINST() without the IN BOOLEAN MODE modifier. * > < These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the < operator decreases it. See the example following this list. * ( ) Parentheses group words into subexpressions. Parenthesized groups can be nested. * @ The proximity operator "hello world"@10 Proximity distance is specified in words, adjusted for word count, and applies to all words within quotes. For instance, "cat dog mouse"@5 query means that there must be less than 8-word span which contains all 3 words, ie. "CAT aaa bbb ccc DOG eee fff MOUSE" document will not match this query, because this span is exactly 8 words long. * ~ A leading tilde acts as a negation operator, causing the word's contribution to the row's relevance to be negative. This is useful for marking "noise" words. A row containing such a word is rated lower than others, but is not excluded altogether, as it would be with the - operator. * * The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator. ** If a stopword or too-short word is specified with the truncation operator, it will not be stripped from a boolean query. For example, a search for '+word +stopword*' will likely return fewer rows than a search for '+word +stopword' because the former query remains as is and requires stopword* to be present in a document. The latter query is transformed to +word. * "A phrase that is enclosed within double quote (`"`) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words, performs a search in the FULLTEXT index for the words. Non-word characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase". ** If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty. The following examples demonstrate some search strings that use boolean full-text operators: * 'apple banana' - Find rows that contain at least one of the two words. * '+apple +juice' - Find rows that contain both words. * '+apple macintosh' - Find rows that contain the word "apple", but rank rows higher if they also contain "macintosh". * '+apple -macintosh' - Find rows that contain the word "apple" but not "macintosh". * '+apple ~macintosh' - Find rows that contain the word "apple", but if the row also contains the word "macintosh", rate it lower than if row does not. This is "softer" than a search for '+apple -macintosh', for which the presence of "macintosh" causes the row not to be returned at all. * '+apple +(>turnoverSELECT * FROM articles -> WHERE MATCH (title,body) -> AGAINST ('database' IN NATURAL LANGUAGE MODE); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 1 | MySQL Tutorial | DBMS stands for DataBase ... | +----+-------------------+------------------------------------------+ 2 rows in set (0.00 sec) mysql> SELECT * FROM articles -> WHERE MATCH (title,body) -> AGAINST ('database' WITH QUERY EXPANSION); +----+-------------------+------------------------------------------+ | id | title | body | +----+-------------------+------------------------------------------+ | 1 | MySQL Tutorial | DBMS stands for DataBase ... | | 5 | MySQL vs. YourSQL | In the following database comparison ... | | 3 | Optimizing MySQL | In this tutorial we will show ... | +----+-------------------+------------------------------------------+ 3 rows in set (0.00 sec) Another example could be searching for books by Georges Simenon about Maigret, when a user is not sure how to spell “Maigret”. A search for “Megre and the reluctant witnesses” finds only “Maigret and the Reluctant Witnesses” without query expansion. A search with query expansion finds all books with the word “Maigret” on the second pass. Because blind query expansion tends to increase noise significantly by returning non-relevant documents, it is meaningful to use only when a search phrase is rather short. Full-Text Stopwords For FTS stopwords, we provide two sources of stopwords: 1) Default stopwords provided by the server - this is a static list of stopword come with server. If no user stopword is defined, then this default stopword list will be used. 2) User defined stopwords - User can define his/her own stop word by creating an user table, and use global variable "innodb_stopword_table_name" to point to this stopword list. Server will load stopword from this user table, rather than the default stopword list when we create the FTS index. There are following rules for the InnoDB FTS stopwords: 1) stopword list is server wide. One server will have one "official" stopword list at one time. 2) words in stopword are excluded from indexing. So they will not be indexed in the first place. 3) stopword list will be loaded when each index created, and they will remain in cache for the life time of FTS cache. That is, if the stopword list changes in between, it will not affect stopword for tables create prior to the change. It only be effective for stopword created after. 4) If server crashes/shutdown after stopword list changes, when rebooting the server, FTS indexes could all load the "current" stopword list, which means, if user do not rebuild their FTS index after stopword changed, they could possibly have FTS index with partially indexed stopwords. For the user table specified for the stopword, there is only one requirement for the format of the table, its first column must be of varchar type and not null. Here is the current list of stopword for our prototype: * InnoDB default stopword list. ** There are different versions of stopwords, the stop words listed below is so called "Google Stop Word" list. const char *fts_default_stopword[] = { * "a", "about", "an", "are", "as", "at", "be", "by", "com", "de", "en", "for", "from", "how", "i", "in", "is ", "it", "la", "of", "on", "or", "that", "the", "this", "to", "was", "what", "when", "where", "who", "will", "with", "und", "the", "www", NULL }; Here are some examples: # Create FTS table CREATE TABLE articles ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, title VARCHAR(200), body TEXT, FULLTEXT (title,body) ) ENGINE=InnoDB; # Insert six rows INSERT INTO articles (title,body) VALUES ('MySQL Tutorial','DBMS stands for DataBase ...') , ('How To Use MySQL Well','After you went through a ...'), ('Optimizing MySQL','In this tutorial we will show ...'), ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'), ('MySQL vs. YourSQL','In the following database comparison ...'), ('MySQL Security','When configured properly, MySQL ...'); # This is to test the stopword functionality. The word "this" and # "in" are in the default stopword list. It should not show up mysql> SELECT * FROM articles WHERE MATCH (title,body) -> AGAINST ('this' IN NATURAL LANGUAGE MODE); Empty set (0.00 sec) # Provide user defined stopword table, if not (correctly) defined, # it will be rejected set global innodb_stopword_table_name = "not_defined"; ERROR 1210 (HY000): Incorrect arguments to SET # Define a correct formated user stopword table mysql> create table user_stopword(value varchar(30)) engine = innodb; Query OK, 0 rows affected (0.01 sec) mysql> set global innodb_stopword_table_name = "test/user_stopword"; Query OK, 0 rows affected (5.55 sec) # Nothing inserted into the default stopword, so essentially # nothing get screened. The new stopword could only be # effective for table created thereafter mysql> CREATE TABLE articles_2 ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ) ENGINE=InnoDB; Query OK, 0 rows affected (5.31 sec) mysql> INSERT INTO articles_2 (title,body) VALUES ('test for stopwords','this is it...'); Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM articles_2 WHERE MATCH (title,body) -> AGAINST ('this' IN NATURAL LANGUAGE MODE); +----+--------------------+---------------+ | id | title | body | +----+--------------------+---------------+ | 1 | test for stopwords | this is it... | +----+--------------------+---------------+ 1 row in set (0.00 sec) mysql> drop table articles_2; Query OK, 0 rows affected (7.43 sec) # Ok, let's instantiate some value into user supplied stop word # table insert into user_stopword values("this"); mysql> CREATE TABLE articles_2 ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ) ENGINE=InnoDB; Query OK, 0 rows affected (4.57 sec) mysql> INSERT INTO articles_2 (title, body) -> VALUES ('test for stopwords','this is it...'); Query OK, 1 row affected (0.00 sec) mysql> mysql> SELECT * FROM articles_2 WHERE MATCH (title,body) -> AGAINST ('this' IN NATURAL LANGUAGE MODE); Empty set (0.00 sec) Full-Text Restrictions * In terms of InnoDB DML, the tokenization of the document happens when the DML is committed. Before the commit, you will not be able to use FTS to search document as it is not tokenized and put in the FullText Index. * Ideographic languages such as Chinese and Japanese do not have word delimiters. Therefore, the FULLTEXT parser cannot determine where words begin and end in these and other such languages. The implications of this and some workarounds for the problem are described in Section 10.8, "Full-Text Search Functions". * Although the use of multiple character sets within a single table is supported, all columns in a FULLTEXT index must use the same character set and collation. * The MATCH() column list must match exactly the column list in some FULLTEXT index definition for the table, unless this MATCH() is IN BOOLEAN MODE. Boolean-mode searches can be done on non-indexed columns, although they are likely to be slow. * The argument to AGAINST() must be a constant string. Fine-Tuning Full-Text Search Values used by the built-in parser and document manager: * innodb_cache_size - server wide configure variable (default: 32MB) in bytes. The size of the cache used to store parsed document in memory. * innodb_min_token_size - server wide configure variable (default: 3) Ignore tokens that have a length less than this value. The user must rebuild the FTS indexes after modifying this value. This change is persistent. * innodb_max_token_size - server wide configure variable (default: 84) Ignore tokens that have a length larger than this value. The user must rebuild the FTS indexes after modifying this value. This change is persistent. Values used by the optimize thread: * innodb_optimize_add_threshold - per table value (default: 20,000 documents) Signal an optimize when the number of added documents exceeds this threshold. * innodb_optimize_del_threshold - per table value (default: 20,000 documents) Signal an optimize when the number of deleted documents exceeds this threshold.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.