WL#2423: Stemming for full-text
Affects: Server-7.1
—
Status: Un-Assigned
A 'root' is the part of a word that is common to all
inflected forms, an 'affix' is the part of a word that
cannot stand alone but modifies the root (e.g. a
prefix or suffix), and two words are 'stems' if they
have the same root but different affixes. For example,
"rain" is a root, "y" and "ing" and "ed" are affixes,
so "rain" and "rainy" and "raining" and "rained" are stems.
The ability to search for all the stems of "rainy"
instead of just "rainy" is a feature named "stemming".
It's available in PostgreSQL, Oracle, SQL Server, and
DB2 fulltext. Stemming differs from thesaurus searching
(which would return "pluvial"), from fuzzy searching
(which might return "reign"), and from wildcarding
(which would return "Rainier's").
We will support stemming based on the rules of English,
without using a dictionary. We promised stemming in the
MySQL Reference Manual, section Full-Text Search TODO.
Syntax
------
We will preserve MySQL's MATCH ... AGAINST syntax for
the time being, changing it is not part of this task.
To request stemming, one adds
[STEMMED] FORM OF [ENGLISH]
in the AGAINST expression. The words STEMMED
and ENGLISH are optional noise. For example:
SELECT * FROM articles
WHERE MATCH (title)
AGAINST ('STEMMED FORM OF ENGLISH "rainy"');
We will not use the Oracle-like syntax
CONTAINS (title, '$rainy',1) > 0
We will be slightly closer to DB2 and SQL/MM syntax:
CONTAINS (handle, 'FORM OF "rainy"') = 1
The stemmed word ("rainy" in the example) may not contain
white space or a wildcard character. Support for phrase
searching with stems is something for the far future.
The string may be in any MySQL-supported character set.
Sergei Golubchik says that something like
AGAINST ('$rainy')
would be more compatible with the current syntax.
Finding the root
----------------
We can find a root with an "affix removal" algorithm.
The main alternative is a "table lookup" algorithm,
which requires a word list. We need word lists for
thesaurus searching, and there will be another task
for incorporating them in MySQL. But due to space
worries we will defer table lookup and call it a
version-7 non-default option.
Affix removal (usually just suffix removal) algorithms:
Dawson 1974
Krovets 1993
Lovins 1968
Paice/Husk 1990
Porter 1980
Salton 1968
If someone has a preference for another algorithm, fine,
it's something to discuss. If not, we'll go with the
"Porter stemming algorithm" (also known as the "Porter
stemmer") because it's popular.
The Porter stemmer removes affixes from English words.
The number of vowels followed by consonants must be > 1
(words which are very short probably have no affixes).
Step 1: remove affixes for plurals and participles.
Examples: "rained" to "rain", "rains" to "rain".
Step 2: remove common suffixes.
Example: "raininess" to "raini"
Step 3: remove special word endings
Example: "bountiful" to "bounti"
Step 4: Repeat steps 1-3 until no more affixes appear.
Step 5: If result ends in a vowel or doublet, possibly remove.
Example: after stripping "kidnapped" to "kidnapp", remove a "p".
Example: after stripping "raininess" to "raini", change to "rain".
For the full C code, see the References section of this task.
Algorithms like this are not particularly good if
the variation is not done with affixes, e.g. "sang"/"sung".
Since Oracle can handle "sang"/"sung" and "mouse"/"mice"
and so on, it will look better.
Indexing the roots
------------------
Once you have the stemming code, what do you do with it?
If we had a dictionary, we could say at search time:
given 'rain', find 'rain' in the dictionary,
which gives a list of stems. Then look up each stem.
But we have no dictionary. Instead, we will store both
root and stem in the index. So, given "raining", we
figure out the root, then look up the root ("rain"),
which points to all rows containing any stem of
"rain". But the word "raining" is also indexed,
separately, because non-stem searching is the default.
This suggestion would double the index size. An
alternative, which works only if the stemming
algorithm removes suffixes and nothing else,
is to search for LIKE 'rain%'. Using this alternative,
one could find 'paint' when searching for 'pain',
but we can eliminate such errors by checking
whether 'paint' and 'pain' are stems.
First we'll need some tests to see what happens
with relevance rankings -- they're affected
because a root might appear with a different
frequency than a stem.
Multi-lingual
-------------
Initially we will support English only. Supporting
other languages would of course be a nice thing,
but should be a separate task.
References
----------
"Fulltext (formerly inverted index)"
https://intranet.mysql.com/secure/mailarchive/mail.php?folder=4&mail=9754
"Glossary of Linguistic Terms"
http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/Index.htm
- We use this for the definition of 'root', 'affix', and 'stem'.
"mnoGoSearch"
mnogosearch.org
- an add-on which claims to use ispell affixes
(MySQL's Izhevsk employees know about it)
"Oracle Text"
http://www.oracle.com/technology/products/text/index.html
"The Porter Stemming Algorithm"
http://www.tartarus.org/~martin/PorterStemmer/
"Snowball: a language for stemming algorithms"
http://snowball.tartarus.org/texts/introduction.html
"stem ($)"
http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/cqoper.htm#14499
- Notes on Oracle's stem operator
"University of Neuchatel"
http://www.unine.ch/info/clef/
- Links to stemmers and stopword lists for several languages
WL#2428 Dictionary for fulltext
Copyright (c) 2000, 2025, Oracle Corporation and/or its affiliates. All rights reserved.