Documentation Home
MySQL 5.6 リファレンスマニュアル
Download this Manual
EPUB - 7.5Mb
HTML Download (TGZ) - 7.2Mb
HTML Download (Zip) - 7.2Mb


12.9.2 ブール全文検索

MySQL では、IN BOOLEAN MODE 修飾子を使用することでブール全文検索を実行できます。この修飾子を使用すると、検索文字列の先頭または末尾にある特定の文字が特別な意味を持ちます。次のクエリーでは、+ および - 演算子は、一致が発生するために単語が存在しなければならないことと、単語が存在してはならないことをそれぞれ示します。したがって、このクエリーでは、MySQL という単語は含まれるが、YourSQL という単語は含まれないすべての行が取得されます。

mysql> SELECT * FROM articles WHERE MATCH (title,body)
    AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                |
+----+-----------------------+-------------------------------------+
|  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Well | After you went through a ...        |
|  3 | Optimizing MySQL      | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+
注記

この機能を実装すると、MySQL では暗黙的ブール論理とも呼ばれる次のようなものが使用されます。

  • +AND を表します

  • -NOT を表します

  • [演算子なし] は暗黙的に OR を表します

ブール全文検索には、次のような特徴があります。

  • 行は自動的に関連性の降順にソートされません。

  • InnoDB テーブルでブールクエリーを実行するには、MATCH() 式のすべてのカラム上に FULLTEXT インデックスが必要です。MyISAM 検索インデックスに対するブールクエリーは、FULLTEXT インデックスなしでも機能します。ただし、この方法で実行される検索の速度は、非常に遅くなります。

  • 単語の最小長および最大長の全文パラメータは、InnoDB 検索インデックスの場合は innodb_ft_min_token_size および innodb_ft_max_token_sizeMyISAM 検索インデックスの場合は ft_min_word_len および ft_max_word_len に適用されます。

  • ストップワードリストが適用されます。これらは、InnoDB 検索インデックスの場合は innodb_ft_enable_stopwordinnodb_ft_server_stopword_table、および innodb_ft_user_stopword_tableMyISAM 検索インデックスの場合は ft_stopword_file によって制御されます。

  • InnoDB の全文検索では、'++apple' の例と同様に、単一の検索単語で複数の演算子を使用するようサポートされていません。MyISAM の全文検索では、同じ検索が正常に処理され、検索単語に隣接する演算子を除くすべての演算子が無視されます。

  • InnoDB の全文検索では、先頭のプラス記号またはマイナス記号のみがサポートされています。たとえば、InnoDB では '+apple' がサポートされますが、'apple+' はサポートされていません。末尾にプラス記号またはマイナス記号を指定すると、InnoDB で構文エラーがレポートされます。

  • InnoDB の全文検索では、ワイルドカード ('+*') を使用した先頭のプラス記号、プラス記号とマイナス記号の組み合わせ ('+-')、または先頭のプラス記号とマイナス記号の組み合わせ ('+-apple') はサポートされていません。このような無効なクエリーでは、構文エラーが返されます。

  • MyISAM 検索インデックスに適用される 50% のしきい値は使用されません。

ブール全文検索機能では、次の演算子がサポートされています。

  • +

    先頭または末尾のプラス記号は、この単語が返される各行に存在しなければならないことを示します。InnoDB では、先頭のプラス記号のみがサポートされています。

  • -

    先頭または末尾のマイナス記号は、この単語が返される行のいずれにも存在してはならないことを示します。InnoDB では、先頭のマイナス記号のみがサポートされています。

    注: - 演算子は、本来ならほかの検索語句で一致が行われる行を除外することのみに使用します。したがって、- の前にある検索語句のみを含むブールモードの検索では、空の結果が返されます。除外された検索語句のいずれかを含む行を除いたすべての行が返されるわけではありません。

  • (演算子なし)

    デフォルトでは (+- のどちらも指定されてない場合)、この単語はオプションですが、それを含む行の評価は高くなります。これは、IN BOOLEAN MODE 修飾子なしの MATCH() ... AGAINST() の動作と似ています。

  • @distance

    この演算子は、InnoDB テーブルでのみ機能します。2 つ以上の単語がすべて、相互に指定された距離内で始まっているかどうかが単語単位でテストされます。@distance 演算子の直前に、二重引用符で囲まれた文字列内の検索単語を指定します (たとえば、MATCH(col1) AGAINST('"word1 word2 word3" @8' IN BOOLEAN MODE))。

  • > <

    これらの 2 つの演算子は、行に割り当てられた関連性の値への単語の貢献度を変更する際に使用されます。> 演算子は貢献度を上げ、< 演算子は貢献度を下げます。次のリストのあとに示す例を参照してください。

  • ( )

    丸括弧は、単語を部分式にグループ化します。丸括弧で囲まれたグループはネストできます。

  • ~

    先頭のチルダは否定演算子として機能するため、行の関連性への単語の貢献度がマイナスになります。これは、ノイズ単語にマークを付ける際に便利です。このような単語を含む行は、その他よりも低く評価されますが、- 演算子を使用した場合のように、完全に除外されることはありません。

  • *

    アスタリスクは、切り捨て (またはワイルドカード) 演算子として機能します。その他の演算子とは異なり、影響を受ける単語に追加されます。* 演算子の前の単語で始まれば、単語が一致します。

    切り捨て演算子を付けて単語が指定されている場合は、その単語が短すぎたり、ストップワードであったりしても、ブールクエリーから削除されません。単語が短すぎるかどうかは、InnoDB テーブルの場合は innodb_ft_min_token_size 設定、MyISAM テーブルの場合は ft_min_word_len によって判断されます。ワイルドカード単語は、1 つ以上の単語の先頭に存在しなければならないプリフィクスとみなされます。単語の最小長が 4 である場合は、'+word +the*' の検索では、2 番目のクエリーで短すぎる検索語句 the が無視されるため、'+word +the' の検索よりも少ない行が返される可能性があります。

  • "

    二重引用符 (") 文字内で囲まれたフレーズは、入力されたそのままのフレーズを含む行にのみ一致します。全文エンジンでは、フレーズが複数の単語に分割され、それらの単語の FULLTEXT インデックス内で検索が実行されます。単語以外の文字は、正確に一致する必要がありません。フレーズ検索では、そのフレーズとまったく同じ単語が同じ順序で一致に含まれることのみが必要です。たとえば、"test phrase""test, phrase" と一致します。

    フレーズにインデックス内にある単語が含まれない場合は、結果が空になります。単語がテキスト内に存在しない場合、ストップワードである場合、またはインデックス付きの単語の最小長よりも短い場合の組み合わせが原因で、単語がインデックス内に存在しない可能性があります。

次の例では、ブール全文演算子を使用する一部の検索文字列を実演します。

  • 'apple banana'

    2 つの単語の 1 つ以上を含む行を検索します。

  • '+apple +juice'

    両方の単語を含む行を検索します。

  • '+apple macintosh'

    単語 apple を含む行を検索しますが、行に macintosh も含まれる場合は行を高く評価されます。

  • '+apple -macintosh'

    単語 apple を含むが、macintosh は含まない行を検索します。

  • '+apple ~macintosh'

    単語 apple を含む行を検索しますが、行に単語 macintosh も含まれる場合は、行に含まれない場合よりも低く評価されます。これは、macintosh が存在すると、行がまったく返されない '+apple -macintosh' の検索よりもソフトです。

  • '+apple +(>turnover <strudel)'

    単語 appleturnover、または applestrudel (順序は不問) を含む行を検索しますが、apple turnoverapple strudel よりも高く評価します。

  • 'apple*'

    appleapplesapplesauceapplet などの単語を含む行を検索します。

  • '"some words"'

    some words とまったく同じフレーズを含む行を検索します (たとえば、some words of wisdom を含む行は検索しますが、some noise words は検索しません)。

    注記

    フレーズを囲む " 文字は、フレーズを区切る演算子文字です。検索文字列自体を囲む引用符ではありません。

InnoDB ブールモード検索の関連性ランキング

InnoDB の全文検索は、Sphinx の全文検索エンジンをモデルにし、使用されるアルゴリズムは、BM25 および TF-IDF のランキングアルゴリズムに基づいています。このような理由のため、InnoDB のブール全文検索の関連性ランキングは、MyISAM の関連性ランキングと異なる場合があります。

InnoDB では、term frequency-inverse document frequency (TF-IDF) 重み付けシステムの偏差を使用して、指定された全文検索クエリーのドキュメントの関連性にランクが付けられます。TF-IDF の重み付けは、ドキュメントで単語が出現する頻度に基づき、コレクション内のすべてのドキュメントで単語が出現する頻度によってオフセットされます。言い換えると、ある単語がドキュメントで出現する頻度が高くなるほど、その単語がドキュメントコレクションで出現する頻度が低くなり、ドキュメントのランクが高くなります。

関連性ランキングの計算方法

単語の出現頻度 (TF) 値は、単語がドキュメントで出現する回数です。単語の逆文書頻度 (IDF) 値は、次の公式を使用して計算されます。ここで、total_records はコレクション内のレコード数、matching_records は検索語句が表示されるレコード数です。

${IDF} = log10( ${total_records} / ${matching_records} )  

ドキュメントに単語が複数回含まれる場合は、IDF 値が TF 値で乗算されます。

${TF} * ${IDF}

TF および IDF 値を使用する場合は、ドキュメントの関連性ランキングが次の公式を使用して計算されます。

${rank} = ${TF} * ${IDF} * ${IDF}

公式については、次の例で実演されています。

単一単語検索の関連性ランキング

この例では、単一単語検索の関連性ランキングの計算を実演します。

mysql> CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
) ENGINE=InnoDB;
Query OK, 0 rows affected (1.04 sec)
 
mysql> INSERT INTO articles (title,body) VALUES 
('MySQL Tutorial','This database tutorial ...'),
("How To Use MySQL",'After you went through a ...'),
('Optimizing Your Database','In this database tutorial ...'),
('MySQL vs. YourSQL','When comparing databases ...'),
('MySQL Security','When configured properly, MySQL ...'),
('Database, Database, Database','database database database'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');                    
Query OK, 8 rows affected (0.06 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT id, title, body, MATCH (title,body)  AGAINST ('database' IN BOOLEAN MODE) 
AS score FROM articles ORDER BY score DESC;
+----+------------------------------+-------------------------------------+---------------------+
| id | title                        | body                                | score               |
+----+------------------------------+-------------------------------------+---------------------+
|  6 | Database, Database, Database | database database database          |  1.0886961221694946 |
|  3 | Optimizing Your Database     | In this database tutorial ...       | 0.36289870738983154 |
|  1 | MySQL Tutorial               | This database tutorial ...          | 0.18144935369491577 |
|  2 | How To Use MySQL             | After you went through a ...        |                   0 |
|  4 | MySQL vs. YourSQL            | When comparing databases ...        |                   0 |
|  5 | MySQL Security               | When configured properly, MySQL ... |                   0 |
|  7 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... |                   0 |
|  8 | MySQL Full-Text Indexes      | MySQL fulltext indexes use a ..     |                   0 |
+----+------------------------------+-------------------------------------+---------------------+
8 rows in set (0.00 sec)

合計で 8 つのレコードがあり、そのうち 3 つが database という検索語句に一致します。1 つ目のレコード (id 6) には、検索語句が 6 回含まれ、関連性ランキングは 1.0886961221694946 です。このランキング値は、6 の TF 値 (レコード id 6 には database という検索語句が 6 回出現します)、および次のように計算される 0.42596873216370745 の IDF 値 (ここで、8 はレコードの合計数、3 は検索語句が出現するレコードの数) を使用して計算されます。

${IDF} = log10( 8 / 3 ) = 0.42596873216370745

その後、TF および IDF 値はランキング公式に入力されます。

${rank} = ${TF} * ${IDF} * ${IDF}

MySQL コマンド行クライアントで計算を実行すると、1.088696164686938 のランキング値が返されます。

mysql> SELECT 6*log10(8/3)*log10(8/3);
+-------------------------+
| 6*log10(8/3)*log10(8/3) |
+-------------------------+
|       1.088696164686938 |
+-------------------------+
1 row in set (0.00 sec)
注記

SELECT ... MATCH ... AGAINST ステートメントと MySQL コマンド行クライアントで返されるランキング値 (1.08869612216949461.088696164686938) に、わずかな相違がある場合があります。この相違は、整数と浮動小数点/倍精度間のキャストが (関連する精度および丸めの決定とともに) InnoDB によって内部で実行される方法、およびその他の場所 (MySQL コマンド行クライアントやその他のタイプの計算機など) で実行される方法が原因で発生します。

複数単語検索の関連性ランキング

この例では、以前の例で使用された articles テーブルおよびデータに基づいて、複数単語の全文検索の関連性ランキングの計算を実演します。

複数の単語で検索する場合は、次の公式に示すように、関連性ランキングの値が各単語の関連性ランキングの合計になります。

${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}

2 つの語句 ('mysql tutorial') で検索を実行すると、次の結果が返されます。

mysql> SELECT id, title, body, MATCH (title,body)  AGAINST ('mysql tutorial' IN BOOLEAN MODE) 
    AS score FROM articles ORDER BY score DESC;
+----+------------------------------+-------------------------------------+----------------------+
| id | title                        | body                                | score                |
+----+------------------------------+-------------------------------------+----------------------+
|  1 | MySQL Tutorial               | This database tutorial ...          |   0.7405621409416199 |
|  3 | Optimizing Your Database     | In this database tutorial ...       |   0.3624762296676636 |
|  5 | MySQL Security               | When configured properly, MySQL ... | 0.031219376251101494 |
|  8 | MySQL Full-Text Indexes      | MySQL fulltext indexes use a ..     | 0.031219376251101494 |
|  2 | How To Use MySQL             | After you went through a ...        | 0.015609688125550747 |
|  4 | MySQL vs. YourSQL            | When comparing databases ...        | 0.015609688125550747 |
|  7 | 1001 MySQL Tricks            | 1. Never run mysqld as root. 2. ... | 0.015609688125550747 |
|  6 | Database, Database, Database | database database database          |                    0 |
+----+------------------------------+-------------------------------------+----------------------+
8 rows in set (0.00 sec)

1 つ目のレコード (id 8) では、'mysql' が 1 回出現し、'tutorial' が 2 回出現します。'mysql' に一致するレコードは 6 つ、'tutorial' に一致するレコードは 2 つあります。MySQL コマンド行クライアントでは、これらの値を複数単語検索のランキング公式に挿入するときに、予期されるランキング値が返されます。

mysql> SELECT (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2));
+-------------------------------------------------------+
| (1*log10(8/6)*log10(8/6)) + (2*log10(8/2)*log10(8/2)) |
+-------------------------------------------------------+
|                                    0.7405621541938003 |
+-------------------------------------------------------+
1 row in set (0.00 sec)
注記

上記の例では、SELECT ... MATCH ... AGAINST ステートメントと MySQL コマンド行クライアントで返されるランキング値に、わずかな相違があることについて説明しました。


User Comments
  Posted by Jeff Smith on May 27, 2004
Keep in mind that although MATCH() AGAINST() is case-insensitive, it also is basically **accent-insensitive**. In other words, if you do not want _mangé_ to match with _mange_ (this example is in French), you have no choice but to use the BOOLEAN MODE with the double quote operator. This is the only way that MATCH() AGAINST() will make accent-sensitive matches.

E.g.:

SELECT * FROM quotes_table WHERE MATCH (quote) AGAINST ('"mangé"' IN BOOLEAN MODE)

For multiword searches:

SELECT * FROM quotes_table MATCH (quote) AGAINST ('"mangé" "pensé"' IN BOOLEAN MODE)

SELECT * FROM quotes_table MATCH (quote) AGAINST ('+"mangé" +"pensé"' IN BOOLEAN MODE)

Although the double quotes are intended to enable phrase searching, just like any web search engine for example, you can also use them to signify single words where accents and other diacritics matter.

The only drawback to this method seems to be that the asterisk operator is mutually exclusive with the double quote. Or I just haven't been able to combine both effectively.

  Posted by Rainer Typke on September 8, 2004
Be careful with the phrase search when short words are involved!
Words that are shorter than the minimum word length (by default, words with up to 3 characters) are sometimes taken into consideration when you search for phrases, but sometimes not!

Example 1:
A search for the phrase "the creation" will find all records that really contain this phrase, and only those. So, a record containing only "la creation du monde", even without the accent aigu on the e in creation, won't be found. This is just fine and what one would expect.

Example 2: A search for the phrase "let it be" won't find any record, not even records containing something like "The Beatles: Let It Be". According to the MySQL team, this is not a bug.
I personally find it very counterintuitive to sometimes take short words into consideration for phrase searches, but only if there is at least one properly long word in the search phrase.
  Posted by Markus Loponen on October 13, 2004
For those of you who interface MySQL with PHP and wonder what the problem is with getting "exact phrases" working properly, here's the way to go.

$query= "SELECT code, category, header, date FROM articles WHERE MATCH (text,header,summary) AGAINST ('" . stripslashes (str_replace ("&quot;", "\"", ($_POST['keywords']))) . "' IN BOOLEAN MODE)";

PHP, or some setups or with some browsers, convert double quotes from POST data to their HTML-equivalents even without being asked to do that. The above will fix the issue. Stripslashes() is optional, I prefer to keep it in to keep things looking clean, though the \" doesn't seem to break the boolean literal search.
  Posted by Rob Thorpe on November 14, 2004
It's also possible to create a prioritized boolean query with the following SQL:

SELECT id, text, MATCH (text) AGAINST ('word1 word2 word3' in boolean mode)
AS score FROM table1
WHERE MATCH (text) AGAINST ('word1 word2 word3' in boolean mode) order by score desc;

  Posted by Adam George on December 13, 2004
According to the last comment by Rob Thorpe it's possible to prioritize the boolean query like so:

SELECT id, text, MATCH (text) AGAINST ('word1 word2 word3' in boolean mode)
AS score FROM table1
WHERE MATCH (text) AGAINST ('word1 word2 word3' in boolean mode) order by score desc;

I tried this and it failed to work, i.e. all the scores turned out to be '1' even though the number of matches differed from record to record.
  Posted by Brad Satoris on December 13, 2004
Boolean searching has two deficiencies: 1) results are not sorted by relevance and; 2) no method by which to weigh certain columns. There is a way around both of these problems. For example, if I have a table of articles and want to weigh the title more heavily than the text, I can do the following:

SELECT *, ( (1.3 * (MATCH(title) AGAINST ('+term +term2' IN BOOLEAN MODE))) + (0.6 * (MATCH(text) AGAINST ('+term +term2' IN BOOLEAN MODE))) ) AS relevance FROM [table_name] WHERE ( MATCH(title,text) AGAINST ('+term +term2' IN BOOLEAN MODE) ) HAVING relevance > 0 ORDER BY relevance DESC;

Here we artificially manipulate the relevancy score to give title more weight by multiplying by the constant 1.3. In the above query, it doesn't seem to matter whether I have 3 fulltext indexes or just one comprising the title and text columns. From my testing, the results appear to be the same.
  Posted by Joe Laffey on January 5, 2005
In response to the note above Posted by Adam George on December 13 2004 7:32pm:

In my tests it would seem that the score returned is an integer equal to the number of words matched. So if you match on 3 words the scores will range from 1 to 3. If you match only on one word, or only one word is matched in any document, then the scores would all be 1.
  Posted by Richard on February 23, 2005
In response to Joe Laffey and Adam George:
To enhance sorting of the results in boolean mode you can do the following:

SELECT id, text, MATCH (text) AGAINST ('word1 word2 word3')
AS score FROM table1
WHERE MATCH (text) AGAINST ('+word1 +word2 +word3' in boolean mode) order by score desc;

Using the first MATCH() we get the score in non-boolean search mode (more distinctive). The second MATCH() ensures we really get back only the results we want (with all 3 words). If you want to do 'any of the words' search only, it's better to use non-boolean search instead (unless you are using boolean in order to get rid of 50% treshold limit).
  Posted by kim markegard on June 24, 2005
I'm not sure why MATCH/AGAINST uses a different scoring method when in boolean mode and when it's not. As stated above, if searching 3 terms in boolean mode, the score will be between 1 and 3 (integer). However, if not in boolean mode, the score is a floating point value.

It seems that non-boolean mode returns a "real" relevancy (based on how often each term was found I presume). In boolean mode it only returns how many terms were found. To me, this is not really relevancy. For instance, if searching on 2 terms, one result may have 20 occurrences of each term and another may have only 1 occurrence of each word, yet they will both return "2" as their relevance.
  Posted by Ben Allfree on September 1, 2005
This seems to work well for ranking relevance in boolean queries:

select products_id,match(products_model) against ('printer' ) as Relevance

from products

where match(products_model) against ('+"printer"' in boolean mode)
  Posted by Ferenc Fogarasi on October 19, 2005
Hi,
if You want to combine the phrase search with the *, simply search for the words separately and apply a having clause.

For example:
If You wish to find `bird cathcing`, `bird cathcer`, `bird cathers`

try this

SELECT column, MATCH( column ) AGAINST ('bird catch' IN BOOLEAN MODE) AS rank
FROM mytable
WHERE MATCH( column ) AGAINST ('bird catch' IN BOOLEAN MODE) > 0
HAVING column LIKE '%bird catch%'
ORDER BY rank DESC

I know the HAVING clause is slow, but it is only allpied to the rows that match the search criteira.

Have'n tested on big tables, but I have a feeling it works just fine.
  Posted by Robert Collins on November 8, 2005
This works for me so I get a score and the benefits of a boolean search. However, it's doing two different fulltext searches so it may slow things down a lot:

SELECT *, MATCH(post_content, post_title) AGAINST('string') AS `score` FROM posts MATCH(post_content, post_title) AGAINST('string' IN BOOLEAN MODE) ORDER BY score DESC LIMIT 10

The boolean mode after the from statement automatically pulls out the 0's and then the Match statement in the SELECT clause allows me to get the relevance score so I can sort by it.
  Posted by Martin Halford on December 2, 2005
Following on from Robert's comment, I've been playing around using his technique trying to get 'exact' and 'any word' searches to work. I've been having trouble with plurals, e.g. searching for 'anchor bolts' when the fulltext index includes 'anchors' and 'bolt'. This can be solved with the boolean part of the Match statement by trimming and searching for +anchor* +bolt*, but the non-boolean part of the Match statement for relevance is still a problem returning zeros in some instances. My 'fix' for this is to search for 'anchor anchors bolt bolts' in the non-boolean part (fairly easy to program in php), which seems to solve the problem. Any easier solutions such as like '%anchor%'?
  Posted by Justin Laing on October 4, 2006
I needed to be able to take a user search that might have words in it that are less then the min word length and return only results that contained all words.

What I did was break the query up into two sets of words, one set contained all the words that were >= ft_min_word_len, the other set contained all the shorter words. I did a fulltext search on the words that met the fulltext search length requirement and augmented it with an AND clause that used LIKE '%<shortword1>%'. MySQL uses the fulltext index to narrow down the results and then applies the LIKE conditions, so it stays fast.

This way you don't need to change your ft_min_word_len to a smaller number, which will make your indexes bigger.

Here's the regex I used to break things up:
\b(\w{4,})\b
where ft_min_word_len = 4, that will match all words of 4 or more letters.

Example
Search for the string "axle hub nut" will result in:
WHERE MATCH (col) AGAINST ("+axle*" IN BOOLEAN MODE) AND col LIKE '%hub%' AND col LIKE '%nut%'

  Posted by Joost on March 30, 2007
Fulltext boolean mode search returning (relevance):
SELECT MATCH (x) AGAINST ('word1 word2' IN BOOLEAN MODE) AS relevance
The returned relevance is 0,1 or 2. 0 = no match, 1 = one of the words is matched (word1 or word2), 2 = both word are matched.
When using eg. '+word1 word2 word3' it returns a floating point number (double) which is the relevance (all returned rows contain word1, some are more relevant (they return word2 and word3, others less..they contain word1 and (word2 or word3).. others even less..containing only word1 (relevance = 1).
  Posted by Lee Clemmer on August 21, 2008
Markus Loponen, awesome tip about the PHP statement, I was wondering why the quotes didn't seem to do the trick. Thanks for sharing!
  Posted by Sean Cannon on November 12, 2008
Here's a query I use to return relevance-based data while still using boolean mode searches, and using weights for different columns, and even weights for specific results:

SELECT id,
store,
name,
(((MATCH(name)
AGAINST(?) * 1.2) +
MATCH(description, keywords)
AGAINST (?) +
((MATCH(creators)
AGAINST (?) * 1.2)) / 3) +
(((store IN ('xfx','w3d')) * .12) * ((store IN ('iv', '3da', 'vp')) * .1))) *
MATCH(description,keywords,requirements,creators,name)
AGAINST(? IN BOOLEAN MODE)
AS sort_rel
FROM prod_text
WHERE MATCH(description,keywords,requirements,creators,name)
AGAINST(? IN BOOLEAN MODE)
AND store != 'pp'
HAVING sort_rel > 0.2
ORDER BY sort_rel DESC

All placeholders (?s) take the search terms entered by the user in this case.
  Posted by Sean Cannon on November 12, 2008
For the record, I also like to use three letter (and smaller) words, but I also understand the need to not index all of them (it would simply be absurd).

Instead of enabling a smaller size, what I do is this:

I preprocess against a list of three letter words I want to be matchable. For any row I find, I insert a different string into the indexed text the same number of times (this column need not actually be the one holding returned text mind you, it can be a duplicate column used only for indexing and searching).

When a user searches, I preprocess his or her search query in the same way, replacing any of the words I want matchable with the longer string.

For instance, if I wanted the search to be able to match 'GUN' I would search (using LIKE) for all rows matching and get back the results into a perlscript:

my $check_for_tlw_q = <<"EOF";
SELECT id, fulltext_indexed
FROM some_table
WHERE fulltext_displayed RLIKE '(^|[[:blank:]])GUN([[:blank:]]|$)'
AND fulltext_indexed NOT RLIKE '(^|[[:blank:]])TLWGUNTLW([[:blank:]]|$)'
EOF
my $check_for_tlw = $dbh->prepare($check_for_tlw_q);

my $update_tlw_q = <<<"EOF";
UPDATE some_table
SET fulltext_indexed = ?
WHERE id = ?
EOF
my $update_tlw = $dbh->prepare($update_tlw_q);

$check_for_tlw->execute;
while (my $row = $check_for_tlw->fetchrow_hashref) {
$row->{fulltext_indexed} =~ s/\bgun\b/TLWGUNTLW/igsm;
$update_tlw->execute($row->{fulltext_indexed}, $row->{id});
}

Then when a user searches for 'machine gun' the search is substituted to be 'machine TLWGUNTLW' behind the scenes. This has the effect of making 'gun' matchable without enabling words like 'the' and 'and' and 'les' and 'how' and 'brb' and thus not bogging down the index except where necessary.

  Posted by masi pay on February 20, 2009
Using > < can give some weight but if you have more than two words to match against like match(..) against('>frist second third') and you want to give weight, seems like there is no way. adding '>' will give you the higher weight to 'first' in this case but how to give second and third.. weight is not clear.

This is impportant specially if you have more than two fields to match against, field artist, field album, field producer, you want to list with artist then album then producer.

  Posted by Roel Van de Paar on October 13, 2009
Note that you can use myisam_ftdump utility to dump the fulltext index.
  Posted by Andrew Upton on March 10, 2010
Another possibility for returning results based on relevance is to construct your search table with a score/value already built in. This is not a perfect solution by any means but will certainly return results much quicker with extremely large tables and help to sort the results a little better.

1. Build your search table

CREATE TABLE my_search
(
id int(10),
category smallint(5),
score smallint(5),
text_data text
)
ENGINE=MyISAM
DEFAULT CHARSET=latin1
ROW_FORMAT=COMPRESSED;

2. Fill the table with data

Fill the search table with data giving a greater score value, to more important data.

For example with a product database:

Insert the product names, make and model into the "text_data" column and give it a score of 5 (or similarly higher score). Insert the product specification and give this a score of 3 (medium value) and then insert the product description and any other related text with a score of 1 (low value).

Therefore, in this example, we insert 3 rows for the three items of data with differing scores relating to the data's importance.

3. Create indexes

CREATE FULLTEXT INDEX idx_1 ON my_search (text_data);
CREATE INDEX idx_2 ON my_search (id);

You may wish to create indexes for the other columns, such as category, if they are included in your proposed SELECT statement.

4. Execute your SELECT

The following Select statement will return your search results sorted with matches against product name towards the top. If the match is against product name *and* specification then this has an even higher score (relevance) and appears higher in the list.

SELECT id, sum(score)
FROM my_search
WHERE MATCH(text_data) AGAINST ('search for this text' IN BOOLEAN MODE)
GROUP BY id
ORDER BY sum(score) DESC;

--

As I said at the beginning, this is not a perfectly accurate list by *real* computed relevance. However, this may provide help in some circumstances with extremely large databases (mine has 2.2 million rows).

  Posted by Richard Shea on May 20, 2010
I found the default settings of the server/database I was working within ...

SHOW VARIABLES LIKE 'collation%'

Variable_name Value
collation_connection utf8_unicode_ci
collation_database latin1_general_cs
collation_server latin1_swedish_ci

... meant the above example did not return both rows but instead only the row which contained the 'database' string.

I was able to resolve this by explicilty setting the character set/collation to one that that is case in-sensitive

CREATE TABLE articles(
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY ,
title VARCHAR( 200 ) ,
body TEXT,
FULLTEXT (
title,
body
)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_general_ci;

  Posted by Roman Partyka on March 18, 2011
I have noticed strange behavior. I have a table 'mytable' with two columns: id (int), article (text). Column 'article' has FULLTEXT index on it. I want to find articles containing both 'word1' and 'word2'. 'word1' is very common (90% of the articles contain it).

Straightforward query:
SELECT id, article FROM mytable WHERE MATCH(article) AGAINST ('+word1 +word2' IN BOOLEAN MODE)
does return result, but is very slow.

At the same time, another query
SELECT id, article FROM mytable WHERE MATCH(article) AGAINST ('+word1' IN BOOLEAN MODE) AND MATCH(article) AGAINST ('+word2' IN BOOLEAN MODE)
returns the same result, but could be 100 times faster...

This is true for version 4.1

  Posted by Micah Stevenson on October 12, 2012
Another option for using full text boolean searches w/php and getting "exact phrases" working is to use the native php function html_entity_decode() to reverse the html entities on the user input.

Or, you can use Markus' example. Don't forget to sanitize user input :-)
Sign Up Login You must be logged in to post a comment.