Skip navigation links

User Comments

Posted by John Navratil on May 10 2007 2:58pm[Delete] [Edit]

How I added '-' to the list of word characters:

The documentation is weak in two regards: (1) it doesn't explain how to modify the map and (2) it doesn't touch on the implications of doing so. I'll try to solve (1), but cannot begin to speak to (2)

The charsets files exist at the location specified by the "character_sets_dir" system variable (use SHOW VARIABLES to see this) and is typically compiled in as "/usr/share/mysql/charsets". The name of the file is given by the "character_set_...' variables. Typically the default is "latin1". Thus the file I needed to change was /usr/share/mysql/charsets/latin1.xml

The <ctype><map> is the one we are after (other maps are "upper", "lower", "unicode" and the various collation maps).

The "ctype" map differs from the others in that is has a leading 0x00 before the character map, the meaning of which is unclear to me. Each entry of the map appears to classify the corresponding character according to the following bitmask:

0x01 Upper-case word character
0x02 Lower-case word character
0x04 Decimal digit
0x08 Printer control (Space/TAB/VT/FF/CR)
0x10 Not-white, not a word
0x20 Control-char (0x00 - 0x1F)
0x40 Space
0x80 Hex digit (0-9, a-f, A-F)

In my case, I needed the dash '-', but nothing else, so I altered the corresponding character position (0x2D - third row, third from the right) from 0x10 (Not-white, not a word) to 0x01 (Upper-case word).

There is little on the web to address this, but some commentary in the forums suggested that this was NOT the way to do this, but rather to write ones own full-text engine as the changing of the <ctype> map has implications for the SQL parser. This may be true, but I suspect SQL parsing would require a stricter classification of characters. The SQL statement "SELECT a-b FROM test" worked for me after this change.

Altering latin1.xml and restarting the server had the desired result.

Finally, there does not appear to be a way to create a new character set or collation without recompiling. If this is true, it might be desirable for the standard distribution to include a "custom" character set for just this sort of thing.

Posted by Sebastien Salou on July 4 2007 2:20pm[Delete] [Edit]

Based on your example with the dash `-`, I had a look to make the single quote `'` (which is a word character by default), a non word character.
Thus,
I had a look on a ascii table, the single quote is corresponding to the hexadecimal value 27.
I opened the file share/mysql/charsets/latin1.xml, I went to the upper map (0x27 is actually on the 3rd rows, 8th col from the left).
I went to this position in the ctype map, and surprised !!! This character is already set to 0x10 Not-white, not a word whereas it is a word character during tests !

From there, I'm pretty lost. Why the single quote is not detected as a non word as it should be ?

Modifying the mysql source in myisam/ftdefs.h works.
I modified the line #define misc_word_char(X) ((X)=='\'')
Is it the only way ?

Sebastien Salou

Posted by Scott Noyes on November 19 2009 2:31pm[Delete] [Edit]

John Navratil very nearly has it, but to get fulltext to treat characters as words, you also have to add a new collation. I've written it up at http://www.thenoyes.com/littlenoise/?p=91