WL#2476: Fuzzy operations for fulltext

Affects: Server-7.1   —   Status: Assigned

Support fuzzy full-text searches. 
Maybe support soundex full-text searches too. 
 
This is not in the Full-Text Search TODO. I added it 
because FUZZY FORM OF is an SQL/MM operation, which 
both DB2 and Oracle support. 
    
Syntax    
------    
    
As stated already for other worklog entries: 
We will preserve MySQL's MATCH ... AGAINST syntax for    
the time being, changing it is not part of this task.    
 
To request fuzziness, one adds    
FUZZY FORM OF     
in the AGAINST expression. For example:    
    
SELECT * FROM articles    
WHERE MATCH (title)    
AGAINST ('FUZZY FORM OF "rain"');    
 
Perhaps this should match "rail" or "raid" 
(one letter differing at end). 
 
The SQL/MM syntax, which we won't support, would be: 
CONTAINS (handle, 'FUZZY FORM OF "rain"') 
 
The DB2 syntax, which we won't support, would be: 
CONTAINS (handle, 'FUZZY FORM OF [n] "rain"') 
where n is a number that indicates how close a 
match must be, with 100 meaning "must be exactly 
the same". 
 
The Oracle syntax, which we won't support, would be: 
fuzzy(term, score, numresults, weight) 
The Oracle manual's example is: 
...CONTAINS(TEXT, 'fuzzy(government, 70, 6, weight)', 1) > 0; 
 
The Fuzzy operation is applicable only for words, 
not phrases. 
 
For all worklog fulltext tasks which propose some 
wordy syntax, there will probably be an alternative proposal 
that requires a single special character. For example: 
AGAINST ('?rain') 
Since ? was the fuzzy operator in an early Oracle 
version, there might be some people who recognize it. 
Indeed this would be more compatible with current 
syntax, but I think somebody has already observed 
that we'll run out of special characters someday. 
 
The criterion 
------------- 
 
SQL/MM doesn't state what makes FUZZY FORM OF 
true or false. The SQL/MM document merely says 
"The equivalence is language dependent and 
implementation dependent." 
 
So let's follow DB2. The DB2 documentation says: 
 
"A [fuzzy search] is a search for terms that have a similar 
spelling to the search term. This is particularly useful 
when searching in documents that were created by an Optical 
Character Recognition (OCR) program. Such documents often 
include misspelled words. For example, the word economy 
could be recognized by an OCR program as econony. Note that 
the first three characters must match ..." 
 
So, when comparing word in query to word in document: 
(1) Word length must be the same 
(2) First three letters must be the same 
(3) 90% of following letters must be the same. 
 
To arrive at "90%", I picked a number out of the air. 
Remember that DB2 allows you to specify the minimum 
score as part of the query. 
 
Oracle has more capability, and apparently has a different 
search algorithm that involves storing fuzzy forms in the 
index, but I think imitating DB2 is enough. 
 
Soundex 
------- 
 
If it's convenient, do Soundex at the same time as fuzzy. 
If it's not convenient, move Soundex to a separate task. 
 
The syntax for Soundex is: 
SOUNDS LIKE  
However, Oracle allows '!word'. 
 
MySQL supports SOUNDEX(expr) and expr SOUNDS LIKE expr 
as described in the String Functions section of the manual, 
http://dev.mysql.com/doc/mysql/en/string-functions.html. 
Doubtless one could use the same algorithm for full-text. 
 
References    
----------    
 
"[DB2] Net Search Extender Administration and User's Guide" 
http://safariexamples.informit.com/0131007726/PDFDOC/desf9e80.pdf 
 
"Oracle Text Reference: Indexing" 
http://www.stanford.edu/dept/itss/docs/oracle/9i/text.920/a96518/cdatadic.htm#CCREF0200 
 
"Fuzzy Lookup ... for SQL Server 2005" (not a full-text function description) 
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp 
 
WL#2423 Stemming for fulltext 
WL#2428 Dictionary for fulltext 
WL#526 Implement DIFFERENCE() function (easy)