In MySQL 8.0 we introduce the ICU library to handle our regular expression support. This library is maintained by the Unicode Consortium and is used on a wide variety on platforms. The reason we switch libraries is threefold: security, Unicode support and functionality. The ICU library offers excellent security by limiting both memory and cpu time. You can configure it to be more liberal, but by default the settings are rather conservative. The Unicode support, as one would expect, given its origins, is the best you could get. And finally the functionality it offers goes beyond simply a yes/no answer whether or not a string matches a regular expressions.
New functions
MySQL is now equipped with the following new functions:
REGEXP_LIKE
REGEXP_INSTR
REGEXP_REPLACE
REGEXP_SUBSTR
REGEXP_LIKE
simply takes a “subject” string, a pattern that gets compiled into a regular expression, and an optional parameter string, more on that later. The full signature is as follows:
REGEXP_LIKE( subject, pattern[, match_parameter ] )
In its simplest forms, you use it like this:
1
2
3
4
5
6
|
mysql> SELECT regexp_like( 'aba', 'b+' ) 1 mysql> SELECT regexp_like( 'aba', 'b{2}' ) 0 mysql> SELECT regexp_like( 'abba', 'b{2}' ) 1 |
Or even
1
2
3
4
|
mysql> SELECT regexp_like( '🍣🍣🍣🍺🍺', '🍣🍺' ); 1 mysql> SELECT regexp_like( '🍣🍣🍣🍺🍺', '🍺🍣' ); 0 |
As you can see, MySQL regular expression search can distinguish sushi from beer!
You will be right in thinking that this looks suspiciously similar to RLIKE
and REGEXP
. Those in-fix operators are still there, but they are now just syntactic sugar for REGEXP_LIKE
.
The match parameter
All regular expression functions have a common argument, which is always the last in the list: the match_parameter
is a set of flags that let you control how matching is done. It is similar to command line flags in that the flags are binary, processed left-to-right, and you can specify any number of them, but when they conflict, it’s the last one that counts. For REGEXP_LIKE
, the available flags are
-
c
meaning case-sensitive matching -
i
meaning case-insensitive matching -
m
Turn on multi-line matching. -
n
allows the.
to match newline. This is not the case by default. -
u
Recognize only Unix line endings.
Here are some examples of casing control:
1
2
3
4
5
6
|
mysql> SELECT regexp_like ( 'Abba', 'ABBA', 'i' ); 1 mysql> SELECT regexp_like ( 'Abba', 'ABBA', 'c' ); 0 mysql> SELECT regexp_like ( 'Abba', 'ABBA', 'ci' ); 1 |
And here are some of newline and multiline matching:
1
2
3
4
5
6
7
8
|
mysql> SELECT regexp_like ( 'a\nb\nc', '^b$', ); 0 mysql> SELECT regexp_like ( 'a\nb\nc', '^b$', 'm' ); 1 mysql> SELECT regexp_like ( 'a\nb\nc', '.b.' ); 0 mysql> SELECT regexp_like ( 'a\nb\nc', '.b.', 'n' ); 1 |
How to control case sensitivity
All regular expression functions take a subject and a pattern, obviously. They are typically a character type such as CHAR
, VARCHAR
, TEXT
or BLOB
(although MySQL will attempt to interpret a value of any type, as usual). The matching will search for the type of characters defined by the subject’s and pattern’s characters sets and collations. Like I already said, full Unicode is supported, for instance UTF-8 mb4. The collation defines case sensitivity. Hence we have three ways to define case sensitivity, and the order of precedence is
- The regular expression itself, i.e. the
(?i)
construction, meaning case-insensitive matching. - The
match_parameter
. - The collation.
REGEXP_INSTR
is very similar, but it doesn’t just return a 1 or 0. Rather, it gives you the position of the match, or 0 if there was none. Plus, it gives you a few more options. The full signature is:
REGEXP_INSTR( subject, pattern[, position[, occurrence[, return_option[, match_parameter ]]]] )
For instance, you can choose on which position you wish the search to start:
1
2
3
4
|
mysql> SELECT regexp_instr( 'abba', 'b{2}', 2 ); 2 mysql> SELECT regexp_instr( 'abba', 'b{2}', 3 ); 0 |
You can also choose which occurrence of the match it is that counts:
1
2
3
4
|
mysql> SELECT regexp_instr( 'abbabba', 'b{2}', 1, 2 ); 5 mysql> SELECT regexp_instr( 'abbabba', 'b{2}', 1, 3 ); 0 |
Finally, you can tweak what is actually returned:
1
2
3
4
|
mysql> SELECT regexp_instr( 'abbabba', 'b{2}', 1, 2, 0 ); 5 mysql> SELECT regexp_instr( 'abbabba', 'b{2}', 1, 3, 1 ); 7 |
In case you didn’t guess it already, 0 means that MySQL should return the first position of the match, while 1 means the position after the match.
REGEXP_REPLACE
does a full search-and-replace operation. The full signature is REGEXP_REPLACE( subject, pattern, replacement [, position[, occurrence[, match_parameter ]]] )
Security
It is possible to write regular expressions that, left to their own devices, would consume a lot of CPU time or memory, especially using computation-heavy constructions like back-references. This is a well-known fact. That’s why we offer two ways of controlling execution space/time. Say hello to regexp_stack_limit
and regexp_time_limit
. These integer-typed global variables make the regular expression engine stop after a limit has been reached, so it won’t hog the CPU indefinitely or exhaust the memory. Here are some examples.
1
2
3
|
mysql> SET GLOBAL regexp_time_limit = 1000; mysql> SELECT regexp_instr( 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC', '(A+)+B' ); ERROR 3699 (HY000): Timeout exceeded in regular expression match. |
1
2
3
|
mysql> SET GLOBAL regexp_stack_limit = 239; mysql> SELECT regexp_instr( '', '(((((((){120}){11}){11}){11}){80}){11}){4}' ); ERROR 3698 (HY000): Overflow in the regular expression backtrack stack. |
Happy searching, and as always, thank you for using MySQL!