WL#2428: Dictionary for fulltext
Affects: Server-7.1
—
Status: Un-Assigned
Supply an English dictionary with MySQL. With a dictionary we can have better support for full-text searches with stemming, and a thesaurus for "synonym expansion". Syntax ------ THESAURUSEXPAND SYNONYM TERM OF Actually the SQL/MM syntax is more like: THESAURUS EXPAND { SYNONYM | PREFERRED | RELATED | BROADER | NARROWER | TOP } etc. But I think SYNONYM alone is enough. Example: SELECT * FROM articles WHERE MATCH (title,body) AGAINST ( 'THESAURUS "english" EXPAND SYNONYM TERM OF "rainy"' ) This returns hits for "rainy", "wet", "showery", and the phrase "abounding with rain". There is always a return for self, i.e. "rain" is a synonym for "rain" even if "rain" isn't in the thesaurus. If the match-against row contains any synonym, then it's a match. All synonyms are counted for frequency ranking algorithms. You can't combine SYNONYM with STEMMED (WL#2423). The IBM syntax, which we won't use, is: SYNONYM FORM OF "rainy" or SYNONYM FORM OF ENGLISH "rainy" It's closer to the STEMMED proposal, WL#2423. But if ENGLISH is a fixed keyword, it's not extensible. The Oracle syntax, which we won't use, is: SYN(rain) Due to prior correspondence with Sergei Golubchik, I believe he would prefer a symbol at the start of the word, for example: AGAINST ('}rainy') This suggestion needs to be considered. Jim Winstead sympathizes with Sergei Golubchik, but suggests that we could allow the "wordy" syntax with a different operator (CONTAINS?) or a different mode (IN WORDY_BOOLEAN MODE?). Source ------ Where do we get the words from? There are three possible sources: Roget's 1911 edition Webster's 1913 edition, via GNU WordNet (Princeton) I could find no dictionaries that are modern, machine-readable, and in public domain or GPL. Here is an example of a Webster's/GNU entry, for "rainy": With some difficulty, we can see from this that the root is "Rain", that it's an adjective, and that the synonyms (taken from the definition) are: "abounding with rain", "wet", "showery". Roget's might be okay for synonyms, but I didn't see how it could be useful for stemming too. Princeton contains only nouns, verbs, adjectives, adverbs (no pronouns). The license is generous but not GPL: http://wordnet.princeton.edu/license Table ----- We make a MySQL database table, named "ENGLISH" -- the name that follows the word THESAURUS according to the proposed syntax. It contains (at least) Term CHAR /* marked up so 'root' is clear */ Synonym CHAR If it is too large to supply with the regular download, we will supply it separately. If somebody wants to add support for another language, they can add their own table. Provided it's in the mysql database, and has the same format as "ENGLISH", MySQL should be able to read it the same way that it reads "ENGLISH". This table is not read-only. References ---------- WL#2423 Stemming "Format of WordNet database files" http://wordnet.princeton.edu/man/wndb.5WN -- The Princeton list "Text Extender Administration and Programming" http://www-306.ibm.com/software/data/db2/extenders/text3/publications/v7_os400/h1267200.pdf -- Source for the statement about IBM syntax "dict.org resources" http://www.dict.org/links.html -- Some links to bilingual dictionaries "The letter r" from Webster's dictionary / GCIDE http://ftp.gnu.org/gnu/gcide/dictionary-0.43/uncompressed/cide.r see also http://ftp.gnu.org/gnu/gcide/dictionary-0.43/uncompressed/ "Dictionaries for International Ispell" http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html - A list of useless ispell dictionaries "Roget's thesaurus" http://www.gutenberg.org/etext/10681 -- The Gutenberg project also has Webster's dictionary at http://www.gutenberg.org/etext/673. See also promo.net/pg "SYNonym (SYN)" http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/cqoper.htm#18628 -- the Oracle SYN operator
Rain"y (r , a. [AS. regenig .]Abounding with rain; wet; showery; as, .rainy weather; arainy day or season
1913 Webster]
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.