WL#2466: Fulltext: "always-index" words
Affects: Server-5.7
—
Status: Assigned
An opposite of "stopword list" - a list of strings that are always treated as words and indexed. This strings may include non-letters, or be shorter than min_word_length, or be listed in stopword list. Examples: C++, TCP/IP, HP-UX, IBM (as too short). Because "always-index" words do not respect word boundaries they may overlap with other words. In this case, first word wins. Example: having two "always-index" words "c++" and "++c" the string "c++c" will be parsed as a word "c++" and word "c" (the second being discarded as too short) the string "++c++" will be parsed as a word "++c"
We will use Aho-Corasic algorithm to quickly match a string against a set of patterns. It should be plugged into current one-pass ft parser, and it won't require additional passes. always-index words will be loaded from a file just like stopwords do, no built-in list of always-index words is needed.
It'll be a new data structure TRIE in include/trie.h, mysys/trie.c and a set of functions to work with it: trie_init, trie_insert, trie_free, trie_search, trie_prepare Since there will be no way to delete one element from a trie (trie can be only freed as a whole), trie should use "root memory allocator" (MEM_ROOT). trie_prepare function will compute failure links so that trie can be used for pattern set searches, as in Aho-Corasic algorithm. Once trie is "prepared" it should not be modified. If it is really needed, one must call trie_prepare after inserting new elements. without trie_prepare, the trie can still be used for keyword lookups
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.