WL#1171: Unsafe optimization in Natural Language Full-Text Search

Affects: Server-7.1   —   Status: Un-Assigned

Implement unsafe optimization in NL FTS

It's both a speed optimization and an effectiveness optimization
(if we'll look at it as "dynamic automatic stopword list" where fixed stopword
list is empty)

Background:
20% of people drink 80% of beer.
The same is true for full-text index - 20% of the words occupy 80% of the space.
Most of the words in the index appear only in a few documents, and there are a
small set of words that appear in LOTS of documents.
On search, MySQL needs to scan and merge lists of documents for every word -
these very popular words take the most time, naturally.
But because the weight of the word is decreasing when its popularity is
increasing, the impact of these words on the total search result is very small.
"Unsafe optimization" skips the words whose impact is less than some threshold.
It's called "unsafe" because it changes the end result - the relevance - albeit
negligibly.

Math: words are distributed according to Zipf's law. It's easy to show that 20/80
ratio will happen at about 3000 distinct words in the collection. 10/90 split
happens at about 1e10 distinct words.

See also:
http://en.wikipedia.org/wiki/Zipf's_law
http://www.georgetown.edu/faculty/wilsong/IR/WD2.html