WL#2476: Fuzzy operations for fulltext
Affects: Server-7.1 — Status: Assigned — Priority: Very High
Support fuzzy full-text searches. Maybe support soundex full-text searches too. This is not in the Full-Text Search TODO. I added it because FUZZY FORM OF is an SQL/MM operation, which both DB2 and Oracle support. Syntax ------ As stated already for other worklog entries: We will preserve MySQL's MATCH ... AGAINST syntax for the time being, changing it is not part of this task. To request fuzziness, one adds FUZZY FORM OF <word> in the AGAINST expression. For example: SELECT * FROM articles WHERE MATCH (title) AGAINST ('FUZZY FORM OF "rain"'); Perhaps this should match "rail" or "raid" (one letter differing at end). The SQL/MM syntax, which we won't support, would be: CONTAINS (handle, 'FUZZY FORM OF "rain"') The DB2 syntax, which we won't support, would be: CONTAINS (handle, 'FUZZY FORM OF [n] "rain"') where n is a number that indicates how close a match must be, with 100 meaning "must be exactly the same". The Oracle syntax, which we won't support, would be: fuzzy(term, score, numresults, weight) The Oracle manual's example is: ...CONTAINS(TEXT, 'fuzzy(government, 70, 6, weight)', 1) > 0; The Fuzzy operation is applicable only for words, not phrases. For all worklog fulltext tasks which propose some wordy syntax, there will probably be an alternative proposal that requires a single special character. For example: AGAINST ('?rain') Since ? was the fuzzy operator in an early Oracle version, there might be some people who recognize it. Indeed this would be more compatible with current syntax, but I think somebody has already observed that we'll run out of special characters someday. The criterion ------------- SQL/MM doesn't state what makes FUZZY FORM OF true or false. The SQL/MM document merely says "The equivalence is language dependent and implementation dependent." So let's follow DB2. The DB2 documentation says: "A [fuzzy search] is a search for terms that have a similar spelling to the search term. This is particularly useful when searching in documents that were created by an Optical Character Recognition (OCR) program. Such documents often include misspelled words. For example, the word economy could be recognized by an OCR program as econony. Note that the first three characters must match ..." So, when comparing word in query to word in document: (1) Word length must be the same (2) First three letters must be the same (3) 90% of following letters must be the same. To arrive at "90%", I picked a number out of the air. Remember that DB2 allows you to specify the minimum score as part of the query. Oracle has more capability, and apparently has a different search algorithm that involves storing fuzzy forms in the index, but I think imitating DB2 is enough. Soundex ------- If it's convenient, do Soundex at the same time as fuzzy. If it's not convenient, move Soundex to a separate task. The syntax for Soundex is: SOUNDS LIKE <word> However, Oracle allows '!word'. MySQL supports SOUNDEX(expr) and expr SOUNDS LIKE expr as described in the String Functions section of the manual, http://dev.mysql.com/doc/mysql/en/string-functions.html. Doubtless one could use the same algorithm for full-text. References ---------- "[DB2] Net Search Extender Administration and User's Guide" http://safariexamples.informit.com/0131007726/PDFDOC/desf9e80.pdf "Oracle Text Reference: Indexing" http://www.stanford.edu/dept/itss/docs/oracle/9i/text.920/a96518/cdatadic.htm#CCREF0200 "Fuzzy Lookup ... for SQL Server 2005" (not a full-text function description) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp WL#2423 Stemming for fulltext WL#2428 Dictionary for fulltext WL#526 Implement DIFFERENCE() function (easy)
Copyright (c) 2000, 2015, Oracle Corporation and/or its affiliates. All rights reserved.