The world's most popular open source database

Documentation Downloads MySQL.com

Developer Zone

Section Menu:

About Worklog
MySQL Worklogs are design specifications for changes that may define past work, or be considered for future development.

WL#2466: Fulltext: "always-index" words

Affects: Server-5.7 — Status: Assigned

Description
High Level Architecture
Low Level Design

An opposite of "stopword list" - a list of strings that are always
treated as words and indexed.

This strings may include non-letters, or be shorter than min_word_length,
or be listed in stopword list.

Examples: C++, TCP/IP, HP-UX, IBM (as too short).

Because "always-index" words do not respect word boundaries they may overlap with
other words. In this case, first word wins.

Example: having two "always-index" words "c++" and "++c" the string "c++c" will
be parsed as a word "c++" and word "c" (the second being discarded as too short)
the string "++c++" will be parsed as a word "++c"

We will use Aho-Corasic algorithm to quickly match a string against a set of
patterns.
It should be plugged into current one-pass ft parser, and it won't require
additional
passes.

always-index words will be loaded from a file just like stopwords do,
no built-in list of always-index words is needed.

It'll be a new data structure TRIE in include/trie.h, mysys/trie.c
and a set of functions to work with it:

trie_init, trie_insert, trie_free, trie_search, trie_prepare

Since there will be no way to delete one element from a trie (trie can be only 
freed as a whole), trie should use "root memory allocator" (MEM_ROOT).

trie_prepare function will compute failure links so that trie can be used
for pattern set searches, as in Aho-Corasic algorithm.

Once trie is "prepared" it should not be modified. If it is really needed, one 
must call trie_prepare after inserting new elements.

without trie_prepare, the trie can still be used for keyword lookups

WL#2466: Fulltext: &quot;always-index&quot; words

WL#2466: Fulltext: "always-index" words