WL#2423: Stemming for full-text
Affects: Server-7.1 — Status: Un-Assigned — Priority: Medium
A 'root' is the part of a word that is common to all inflected forms, an 'affix' is the part of a word that cannot stand alone but modifies the root (e.g. a prefix or suffix), and two words are 'stems' if they have the same root but different affixes. For example, "rain" is a root, "y" and "ing" and "ed" are affixes, so "rain" and "rainy" and "raining" and "rained" are stems. The ability to search for all the stems of "rainy" instead of just "rainy" is a feature named "stemming". It's available in PostgreSQL, Oracle, SQL Server, and DB2 fulltext. Stemming differs from thesaurus searching (which would return "pluvial"), from fuzzy searching (which might return "reign"), and from wildcarding (which would return "Rainier's"). We will support stemming based on the rules of English, without using a dictionary. We promised stemming in the MySQL Reference Manual, section Full-Text Search TODO. Syntax ------ We will preserve MySQL's MATCH ... AGAINST syntax for the time being, changing it is not part of this task. To request stemming, one adds [STEMMED] FORM OF [ENGLISH] in the AGAINST expression. The words STEMMED and ENGLISH are optional noise. For example: SELECT * FROM articles WHERE MATCH (title) AGAINST ('STEMMED FORM OF ENGLISH "rainy"'); We will not use the Oracle-like syntax CONTAINS (title, '$rainy',1) > 0 We will be slightly closer to DB2 and SQL/MM syntax: CONTAINS (handle, 'FORM OF "rainy"') = 1 The stemmed word ("rainy" in the example) may not contain white space or a wildcard character. Support for phrase searching with stems is something for the far future. The string may be in any MySQL-supported character set. Sergei Golubchik says that something like AGAINST ('$rainy') would be more compatible with the current syntax. Finding the root ---------------- We can find a root with an "affix removal" algorithm. The main alternative is a "table lookup" algorithm, which requires a word list. We need word lists for thesaurus searching, and there will be another task for incorporating them in MySQL. But due to space worries we will defer table lookup and call it a version-7 non-default option. Affix removal (usually just suffix removal) algorithms: Dawson 1974 Krovets 1993 Lovins 1968 Paice/Husk 1990 Porter 1980 Salton 1968 If someone has a preference for another algorithm, fine, it's something to discuss. If not, we'll go with the "Porter stemming algorithm" (also known as the "Porter stemmer") because it's popular. The Porter stemmer removes affixes from English words. The number of vowels followed by consonants must be > 1 (words which are very short probably have no affixes). Step 1: remove affixes for plurals and participles. Examples: "rained" to "rain", "rains" to "rain". Step 2: remove common suffixes. Example: "raininess" to "raini" Step 3: remove special word endings Example: "bountiful" to "bounti" Step 4: Repeat steps 1-3 until no more affixes appear. Step 5: If result ends in a vowel or doublet, possibly remove. Example: after stripping "kidnapped" to "kidnapp", remove a "p". Example: after stripping "raininess" to "raini", change to "rain". For the full C code, see the References section of this task. Algorithms like this are not particularly good if the variation is not done with affixes, e.g. "sang"/"sung". Since Oracle can handle "sang"/"sung" and "mouse"/"mice" and so on, it will look better. Indexing the roots ------------------ Once you have the stemming code, what do you do with it? If we had a dictionary, we could say at search time: given 'rain', find 'rain' in the dictionary, which gives a list of stems. Then look up each stem. But we have no dictionary. Instead, we will store both root and stem in the index. So, given "raining", we figure out the root, then look up the root ("rain"), which points to all rows containing any stem of "rain". But the word "raining" is also indexed, separately, because non-stem searching is the default. This suggestion would double the index size. An alternative, which works only if the stemming algorithm removes suffixes and nothing else, is to search for LIKE 'rain%'. Using this alternative, one could find 'paint' when searching for 'pain', but we can eliminate such errors by checking whether 'paint' and 'pain' are stems. First we'll need some tests to see what happens with relevance rankings -- they're affected because a root might appear with a different frequency than a stem. Multi-lingual ------------- Initially we will support English only. Supporting other languages would of course be a nice thing, but should be a separate task. References ---------- "Fulltext (formerly inverted index)" https://intranet.mysql.com/secure/mailarchive/mail.php?folder=4&mail=9754 "Glossary of Linguistic Terms" http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/Index.htm - We use this for the definition of 'root', 'affix', and 'stem'. "mnoGoSearch" mnogosearch.org - an add-on which claims to use ispell affixes (MySQL's Izhevsk employees know about it) "Oracle Text" http://www.oracle.com/technology/products/text/index.html "The Porter Stemming Algorithm" http://www.tartarus.org/~martin/PorterStemmer/ "Snowball: a language for stemming algorithms" http://snowball.tartarus.org/texts/introduction.html "stem ($)" http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/cqoper.htm#14499 - Notes on Oracle's stem operator "University of Neuchatel" http://www.unine.ch/info/clef/ - Links to stemmers and stopword lists for several languages WL#2428 Dictionary for fulltext
Copyright (c) 2000, 2016, Oracle Corporation and/or its affiliates. All rights reserved.