WL#2423: Stemming for full-text

Affects: Server-7.1   —   Status: Un-Assigned   —   Priority: Medium

A 'root' is the part of a word that is common to all    
inflected forms, an 'affix' is the part of a word that    
cannot stand alone but modifies the root (e.g. a    
prefix or suffix), and two words are 'stems' if they    
have the same root but different affixes. For example,    
"rain" is a root, "y" and "ing" and "ed" are affixes,    
so "rain" and "rainy" and "raining" and "rained" are stems.    
The ability to search for all the stems of "rainy"    
instead of just "rainy" is a feature named "stemming".    
It's available in PostgreSQL, Oracle, SQL Server, and    
DB2 fulltext. Stemming differs from thesaurus searching    
(which would return "pluvial"), from fuzzy searching    
(which might return "reign"), and from wildcarding    
(which would return "Rainier's").    
We will support stemming based on the rules of English,    
without using a dictionary. We promised stemming in the    
MySQL Reference Manual, section Full-Text Search TODO.    
We will preserve MySQL's MATCH ... AGAINST syntax for    
the time being, changing it is not part of this task.    
To request stemming, one adds    
in the AGAINST expression. The words STEMMED    
and ENGLISH are optional noise. For example:    
SELECT * FROM articles    
WHERE MATCH (title)    
We will not use the Oracle-like syntax    
CONTAINS (title, '$rainy',1) > 0    
We will be slightly closer to DB2 and SQL/MM syntax:    
CONTAINS (handle, 'FORM OF "rainy"') = 1    
The stemmed word ("rainy" in the example) may not contain    
white space or a wildcard character. Support for phrase    
searching with stems is something for the far future.    
The string may be in any MySQL-supported character set.    
Sergei Golubchik says that something like  
AGAINST ('$rainy')  
would be more compatible with the current syntax.  
Finding the root    
We can find a root with an "affix removal" algorithm.    
The main alternative is a "table lookup" algorithm,    
which requires a word list. We need word lists for    
thesaurus searching, and there will be another task    
for incorporating them in MySQL. But due to space    
worries we will defer table lookup and call it a    
version-7 non-default option.    
Affix removal (usually just suffix removal) algorithms:    
Dawson        1974    
Krovets       1993    
Lovins        1968    
Paice/Husk    1990    
Porter        1980    
Salton        1968    
If someone has a preference for another algorithm, fine,    
it's something to discuss. If not, we'll go with the    
"Porter stemming algorithm" (also known as the "Porter    
stemmer") because it's popular.    
The Porter stemmer removes affixes from English words.    
The number of vowels followed by consonants must be > 1    
(words which are very short probably have no affixes).    
Step 1: remove affixes for plurals and participles.    
Examples: "rained" to "rain", "rains" to "rain".    
Step 2: remove common suffixes.    
Example: "raininess" to "raini"    
Step 3: remove special word endings    
Example: "bountiful" to "bounti"    
Step 4: Repeat steps 1-3 until no more affixes appear.    
Step 5: If result ends in a vowel or doublet, possibly remove.    
Example: after stripping "kidnapped" to "kidnapp", remove a "p".    
Example: after stripping "raininess" to "raini", change to "rain".    
For the full C code, see the References section of this task.    
Algorithms like this are not particularly good if    
the variation is not done with affixes, e.g. "sang"/"sung".    
Since Oracle can handle "sang"/"sung" and "mouse"/"mice"    
and so on, it will look better.    
Indexing the roots    
Once you have the stemming code, what do you do with it?    
If we had a dictionary, we could say at search time:    
given 'rain', find 'rain' in the dictionary,    
which gives a list of stems. Then look up each stem.    
But we have no dictionary. Instead, we will store both    
root and stem in the index. So, given "raining", we    
figure out the root, then look up the root ("rain"),    
 which points to all rows containing any stem of    
"rain". But the word "raining" is also indexed,    
separately, because non-stem searching is the default.    
This suggestion would double the index size. An   
alternative, which works only if the stemming   
algorithm removes suffixes and nothing else,   
is to search for LIKE 'rain%'. Using this alternative,   
one could find 'paint' when searching for 'pain',   
but we can eliminate such errors by checking   
whether 'paint' and 'pain' are stems.    
First we'll need some tests to see what happens   
with relevance rankings -- they're affected    
because a root might appear with a different   
frequency than a stem.   
Initially we will support English only. Supporting    
other languages would of course be a nice thing,    
but should be a separate task.    
"Fulltext (formerly inverted index)"    
"Glossary of Linguistic Terms"    
- We use this for the definition of 'root', 'affix', and 'stem'.    
- an add-on which claims to use ispell affixes    
(MySQL's Izhevsk employees know about it)    
"Oracle Text"    
"The Porter Stemming Algorithm"    
"Snowball: a language for stemming algorithms"    
"stem ($)"    
- Notes on Oracle's stem operator 
"University of Neuchatel" 
- Links to stemmers and stopword lists for several languages 
WL#2428 Dictionary for fulltext