WL#1171: Unsafe optimization in Natural Language Full-Text Search
Affects: Server-7.1 — Status: Un-Assigned — Priority: Medium
Implement unsafe optimization in NL FTS It's both a speed optimization and an effectiveness optimization (if we'll look at it as "dynamic automatic stopword list" where fixed stopword list is empty) Background: 20% of people drink 80% of beer. The same is true for full-text index - 20% of the words occupy 80% of the space. Most of the words in the index appear only in a few documents, and there are a small set of words that appear in LOTS of documents. On search, MySQL needs to scan and merge lists of documents for every word - these very popular words take the most time, naturally. But because the weight of the word is decreasing when its popularity is increasing, the impact of these words on the total search result is very small. "Unsafe optimization" skips the words whose impact is less than some threshold. It's called "unsafe" because it changes the end result - the relevance - albeit negligibly. Math: words are distributed according to Zipf's law. It's easy to show that 20/80 ratio will happen at about 3000 distinct words in the collection. 10/90 split happens at about 1e10 distinct words. See also: http://en.wikipedia.org/wiki/Zipf's_law http://www.georgetown.edu/faculty/wilsong/IR/WD2.html
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.