WL#2428: Dictionary for fulltext

Affects: Server-7.1   —   Status: Un-Assigned   —   Priority: Very High

Supply an English dictionary with MySQL.   
With a dictionary we can have better support for   
full-text searches with stemming,    
and a thesaurus for "synonym expansion".   
   
Syntax   
------   
   
THESAURUS <thesaurus specification>   
EXPAND SYNONYM TERM OF <text literal>   
   
Actually the SQL/MM syntax is more like:   
THESAURUS <thesaurus specification>   
EXPAND   
{ SYNONYM | PREFERRED | RELATED | BROADER | NARROWER | TOP }   
etc.   
But I think SYNONYM alone is enough.   
   
Example:   
SELECT * FROM articles   
WHERE MATCH (title,body)   
AGAINST   
(   
'THESAURUS "english" EXPAND SYNONYM TERM OF "rainy"'   
)   
   
This returns hits for "rainy", "wet", "showery",   
and the phrase "abounding with rain". There is   
always a return for self, i.e. "rain" is a   
synonym for "rain" even if "rain" isn't in the   
thesaurus.   
   
If the match-against row contains any synonym,   
then it's a match. All synonyms are counted for   
frequency ranking algorithms.   
   
You can't combine SYNONYM with STEMMED (WL#2423).   
   
The IBM syntax, which we won't use, is:   
SYNONYM FORM OF "rainy"   
or   
SYNONYM FORM OF ENGLISH "rainy"   
It's closer to the STEMMED proposal,   
WL#2423.   
But if ENGLISH is a fixed keyword, it's not extensible.   
   
The Oracle syntax, which we won't use, is:   
SYN(rain)   
   
Due to prior correspondence with Sergei Golubchik,  
I believe he would prefer a symbol at the start of  
the word, for example:  
AGAINST ('}rainy')   
This suggestion needs to be considered. 
Jim Winstead sympathizes with Sergei Golubchik, 
but suggests that we could allow the "wordy" syntax 
with a different operator (CONTAINS?) or a different 
mode (IN WORDY_BOOLEAN MODE?). 
   
Source   
------   
   
Where do we get the words from?   
There are three possible sources:   
Roget's 1911 edition   
Webster's 1913 edition, via GNU   
WordNet (Princeton)   
I could find no dictionaries that are modern,   
machine-readable, and in public domain or GPL.   
   
Here is an example of a Webster's/GNU entry, for "rainy":   
<p><hw>Rain"y</hw>   
<pr>(r<amac/n"<ycr/)</pr>,   
<pos>a.</pos>   
<ety>[AS. <ets>regenig</ets>.]</ety>   
<def>Abounding with rain; wet; showery;   
<as>as,   
<ex>rainy</ex> weather;   
a <ex>rainy</ex> day or season   
</as>.</def><br/   
[<source>1913 Webster</source>]</p>   
   
With some difficulty, we can see from this that the root   
is "Rain", that it's an adjective, and that the synonyms   
(taken from the definition) are:   
"abounding with rain",  "wet", "showery".   
   
Roget's might be okay for synonyms, but   
I didn't see how it could be useful for stemming too.   
   
Princeton contains only nouns, verbs, adjectives, adverbs   
(no pronouns). The license is generous but not GPL:   
http://wordnet.princeton.edu/license   
   
Table   
-----   
   
We make a MySQL database table, named   
"ENGLISH" -- the name that follows the word   
THESAURUS according to the proposed syntax.   
It contains (at least)   
Term        CHAR      /* marked up so 'root' is clear */   
Synonym     CHAR   
   
If it is too large to supply with the regular download,   
we will supply it separately.   
   
If somebody wants to add support for another language,   
they can add their own table. Provided it's in   
the mysql database, and has the same format as   
"ENGLISH", MySQL should be able to read it the   
same way that it reads "ENGLISH".   
   
This table is not read-only.   
   
References   
----------   
   
WL#2423 Stemming   
   
"Format of WordNet database files"   
http://wordnet.princeton.edu/man/wndb.5WN   
-- The Princeton list   
   
"Text Extender Administration and Programming"   
http://www-306.ibm.com/software/data/db2/extenders/text3/publications/v7_os400/h1267200.pdf
  
-- Source for the statement about IBM syntax   
   
"dict.org resources"   
http://www.dict.org/links.html   
-- Some links to bilingual dictionaries   
   
"The letter r" from Webster's dictionary / GCIDE   
http://ftp.gnu.org/gnu/gcide/dictionary-0.43/uncompressed/cide.r   
see also http://ftp.gnu.org/gnu/gcide/dictionary-0.43/uncompressed/   
   
"Dictionaries for International Ispell"   
http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html   
- A list of useless ispell dictionaries   
   
"Roget's thesaurus"   
http://www.gutenberg.org/etext/10681   
-- The Gutenberg project also has Webster's dictionary   
at http://www.gutenberg.org/etext/673. See also promo.net/pg   
   
"SYNonym (SYN)"   
http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/cqoper.htm#18628   
-- the Oracle SYN operator