MySQL :: MySQL 8.0 Regular expressions and character properties

In MySQL 8.0 we have replaced the old regular expression library with the ICU regex library. See Martin’s blog on the topic. The main goal is to get full Unicode support for regular expressions, but in addition we get a lot of neat features. The documentation of regular expressions in MySQL 8.0 is found here, and the documentation of ICU regular expressions is found here. This blog will highlight some of the Unicode related features that are new to MySQL 8.0.

A sample table

We will in this blog work with a table called demo with the following content:

+------+-----------------------+
| v    | descr                 |
+------+-----------------------+
|     | Device Control 3      |
| A    | Latin A               |
| a    | Latin a               |
| А    | Cyrillic A            |
| 🍣     | Sushi Emoji           |
| 9    | Digit 9               |
| ⅲ    | Small Roman numeral 3 |
| 中   | CJK middle            |
| -    | Hyphen                |
|      | Non-breakable space   |
| ۳    | Arabic Three          |
| ३    | Devanagari 3          |
| अ    | Devanagari A          |
| パ   | Katakana PA           |
| ぱ   | Hiragana PA           |
+------+-----------------------+

+------+-----------------------+

| v | descr |

+------+-----------------------+

| | Device Control 3 |

| A | Latin A |

| a | Latin a |

| А | Cyrillic A |

| 🍣 | Sushi Emoji |

| 9 | Digit 9 |

| ⅲ | Small Roman numeral 3 |

| 中 | CJK middle |

| - | Hyphen |

| | Non-breakable space |

| ۳ | Arabic Three |

| ३ | Devanagari 3 |

| अ | Devanagari A |

| パ | Katakana PA |

| ぱ | Hiragana PA |

+------+-----------------------+

Character properties

Character properties (or character classes as they are called in the MySQL documentation) have been greatly extended in 8.0.

The main properties you can check for are 1) what Unicode calls general properties and 2) script names.

General properties are L (Letter), N (Numerals), C (Controls), Z (Separators), P (punctuation), S (Symbols) and M (Marking). All these have subcategories, like e.g. Lu (Uppercase letters), Ll (Lowercase letters) etc. See “General category” in “Unicode File Format” for an exhaustive list.

Properties can be checked with the patterns \p{property} or [:property:] while negative properties (not having the property) can be checked with the pattern \P{property} .

Example 1: List the letters in the demo table.

mysql> select * from demo where v regexp '[:L:]';
+------+--------------+
| v    | descr        |
+------+--------------+
| A    | Latin A      |
| a    | Latin a      |
| А    | Cyrillic A   |
| 中   | CJK middle   |
| अ    | Devanagari A |
| パ   | Katakana PA  |
| ぱ   | Hiragana PA  |
+------+--------------+
7 rows in set (0.00 sec)

mysql> select * from demo where v regexp '[:L:]';

+------+--------------+

| v | descr |

+------+--------------+

| A | Latin A |

| a | Latin a |

| А | Cyrillic A |

| 中 | CJK middle |

| अ | Devanagari A |

| パ | Katakana PA |

| ぱ | Hiragana PA |

+------+--------------+

7 rows in set (0.00 sec)

Example 2: List all the non-letters in the demo table.

mysql> select * from demo where v regexp '\\P{L}';
+------+-----------------------+
| v    | descr                 |
+------+-----------------------+
|     | Device Control 3      |
| 🍣     | Sushi Emoji           |
| 9    | Digit 9               |
| ⅲ    | Small Roman numeral 3 |
| -    | Hyphen                |
|      | Non-breakable space   |
| ۳    | Arabic Three          |
| ३    | Devanagari 3          |
+------+-----------------------+
8 rows in set (0.02 sec)

mysql> select * from demo where v regexp '\\P{L}';

+------+-----------------------+

| v | descr |

+------+-----------------------+

| | Device Control 3 |

| 🍣 | Sushi Emoji |

| 9 | Digit 9 |

| ⅲ | Small Roman numeral 3 |

| - | Hyphen |

| | Non-breakable space |

| ۳ | Arabic Three |

| ३ | Devanagari 3 |

+------+-----------------------+

8 rows in set (0.02 sec)

Scripts may be specified by the ISO-15924 code or a property value alias. Use the code or the alias as a property or more specific script=code or script=alias .

Example 3: List all Devanagari characters:

mysql> select * from demo where v regexp '\\p{script=Devanagari}';
+------+--------------+
| v    | descr        |
+------+--------------+
| ३    | Devanagari 3 |
| अ    | Devanagari A |
+------+--------------+
2 rows in set (0.00 sec)

mysql> select * from demo where v regexp '\\p{script=Devanagari}';

+------+--------------+

| v | descr |

+------+--------------+

| ३ | Devanagari 3 |

| अ | Devanagari A |

+------+--------------+

2 rows in set (0.00 sec)

Example 4: The Japanese use Kanji, Katakana and Hiragana as well as Romaji (Latin characters), let’s list those:

mysql> select * from demo where v regexp '[\\p{Hani}\\p{Kana}\\p{Hira}\\p{Latn}]';
+------+-----------------------+
| v    | descr                 |
+------+-----------------------+
| A    | Latin A               |
| a    | Latin a               |
| ⅲ    | Small Roman numeral 3 |
| 中   | CJK middle            |
| パ   | Katakana PA           |
| ぱ   | Hiragana PA           |
+------+-----------------------+
6 rows in set (0.01 sec)

mysql> select * from demo where v regexp '[\\p{Hani}\\p{Kana}\\p{Hira}\\p{Latn}]';

+------+-----------------------+

| v | descr |

+------+-----------------------+

| A | Latin A |

| a | Latin a |

| ⅲ | Small Roman numeral 3 |

| 中 | CJK middle |

| パ | Katakana PA |

| ぱ | Hiragana PA |

+------+-----------------------+

6 rows in set (0.01 sec)

Example 5: That’s all fine, but we want to exclude Roman numerals from the list. That can be done with the && operator like this (Numerals that are letters have the character category Nl):

mysql> select * from demo where v regexp '[\\p{Hani}\\p{Kana}\\p{Hira}\\p{Latn}&&\\P{Nl}]';
+------+-------------+
| v    | descr       |
+------+-------------+
| A    | Latin A     |
| a    | Latin a     |
| 中   | CJK middle  |
| パ   | Katakana PA |
| ぱ   | Hiragana PA |
+------+-------------+
5 rows in set (0.00 sec)

mysql> select * from demo where v regexp '[\\p{Hani}\\p{Kana}\\p{Hira}\\p{Latn}&&\\P{Nl}]';

+------+-------------+

| v | descr |

+------+-------------+

| A | Latin A |

| a | Latin a |

| 中 | CJK middle |

| パ | Katakana PA |

| ぱ | Hiragana PA |

+------+-------------+

5 rows in set (0.00 sec)

Example 6: And finally, while talking about numerals, we can also search for numerals with a certain numeric value:

mysql> select * from demo where v regexp '[\\p{numeric_value=3}]';
+------+-----------------------+
| v | descr                    |
+------+-----------------------+
| ⅲ | Small Roman numeral 3    |
| ۳ | Arabic Three             |
| ३ | Devanagari 3             |
+------+-----------------------+
3 rows in set (0.00 sec)

mysql> select * from demo where v regexp '[\\p{numeric_value=3}]';

+------+-----------------------+

| v | descr |

+------+-----------------------+

| ⅲ | Small Roman numeral 3 |

| ۳ | Arabic Three |

| ३ | Devanagari 3 |

+------+-----------------------+

3 rows in set (0.00 sec)

Character codepoint values.

Instead of characters, you may use Unicode codepoint values in patterns. Use \uxxxx for 16-bit values and \U00xxxxxx for 32-bit values:

mysql> select * from demo where v regexp '[\\U0001F363\\u0041]';
+------+-------------+
| v    | descr       |
+------+-------------+
| A    | Latin A     |
| a    | Latin a     |
| 🍣     | Sushi Emoji |
+------+-------------+
3 rows in set (0,00 sec)

mysql> select * from demo where v regexp '[\\U0001F363\\u0041]';

+------+-------------+

| v | descr |

+------+-------------+

| A | Latin A |

| a | Latin a |

| 🍣 | Sushi Emoji |

+------+-------------+

3 rows in set (0,00 sec)

Those codepoint values may be used in ranges too, so in order to find all characters that have a codepoint value outside BMP (have a code point value larger than 0xFFFF):

mysql> select v,descr from demo where v regexp '[\\U00010000-\\U0010ffff]';
+------+-------------+
| v    | descr       |
+------+-------------+
| 🍣     | Sushi Emoji |
+------+-------------+
1 row in set (0.02 sec)

mysql> select v,descr from demo where v regexp '[\\U00010000-\\U0010ffff]';

+------+-------------+

| v | descr |

+------+-------------+

| 🍣 | Sushi Emoji |

+------+-------------+

1 row in set (0.02 sec)

There are vast possibilities, so check out the MySQL and ICU/Unicode documentation on the topic and play with it.

Thanks for using MySQL!