WL#5538: InnoDB Full-Text Search Support

Status: Complete

This is the placeholder worklog for the ongoing development of the Full Text
Search (FTS) project in InnoDB.

The purpose of the Project is to provide user the ability to build Full Text
Index on Text Document stored in the InnoDB storage engine, and provide them
fast and accurate search on the Document Content.

The project design is based on a thesis work (see attached FTSProject.htm) by
Osku. In essence, the non-relational text data are tokenized and stored into a
set of relational tables (auxiliary tables) along with their Doc ID and
position, and thus we build the so called "inverted index" on the tokenized word
and their positions.

In brief summary, we will provide following FTS functionalities in the first
release of the feature (please refer to attached function spec "FTS_FUNC" for
more info):

1) Create FTS through FIC (Fast Index Creation) interface

2) FTS search with following options:
    Natural Language Full-Text Searches
    Boolean Full-Text Searches
    Full-Text Searches with Query Expansion
    Proximity Search

3)Default and User supplied Full-Text Stopwords

The initial release shall provide these basic FTS functionality support with
efficiency and robustness.


=============== DOC ID =============

To create reverted index, we will need an unique DOC ID to identify the
document. Since we use a delayed delete mechanism, the deleted document could
still be in the index, and we use a DELETED table to register deleted Doc IDs,
so we need to make sure Doc ID is never reused. So our Doc ID have following
properties:

1) If user do not supply Doc ID, we will generate an internal Doc ID column (of
name "FTS_DOC_ID") for them. This would require a rebuild of primary index, so
it could be time consuming

2) User can supply Doc IDs (just like many other FT search engine). As of now,
it needs to be a 8 bytes unique non-null BIGINT column. Currently it must have
the reserved name "FTS_DOC_ID". If such column exists, we will not need to add
Doc ID column. And no primary index is needed.

3) We require the Doc ID to be monotonically increasing. That is, we register
latest (largest) Doc ID in our internal CONFIG table and its internal structure.
So for each inserting Doc, we will check the ID. If there is violation, the
insertion can be either rejected or be warned (in the later case user takes his
own risk).

4) If server crashes, and reboot, again, the inserting doc's Doc ID needs to be
larger than the value we registered in the config table

5) If users do want to reuse ID, then he should check the existing table and the
deleted table do not have IDs larger than where he restarted.  But the burden
can leave to him. Since there is unique index on the Doc ID, he can't inserting
anything already in the table, but he could inserting some Doc with Doc ID in
our deleted table. The end result would be such Doc will not shown as result and
could be optimized out of our inverted dictionary.

6) User can manage Doc ID in any way they want. Auto-increment column is one way
to do so.