MySQL Blog Archive
For the latest blogs go to blogs.oracle.com/mysql
New Regular Expression Functions in MySQL 8.0

In MySQL 8.0 we introduce the ICU library to handle our regular expression support. This library is maintained by the Unicode Consortium and is used on a wide variety on platforms. The reason we switch libraries is threefold: security, Unicode support and functionality. The ICU library offers excellent security by limiting both memory and cpu time. You can configure it to be more liberal, but by default the settings are rather conservative. The Unicode support, as one would expect, given its origins, is the best you could get. And finally the functionality it offers goes beyond simply a yes/no answer whether or not a string matches a regular expressions.

New functions

MySQL is now equipped with the following new functions:

  • REGEXP_LIKE
  • REGEXP_INSTR
  • REGEXP_REPLACE
  • REGEXP_SUBSTR

REGEXP_LIKE simply takes a “subject” string, a pattern that gets compiled into a regular expression, and an optional parameter string, more on that later. The full signature is as follows:

REGEXP_LIKE( subject, pattern[, match_parameter ] )

In its simplest forms, you use it like this:

Or even

As you can see, MySQL regular expression search can distinguish sushi from beer!

You will be right in thinking that this looks suspiciously similar to RLIKE and REGEXP. Those in-fix operators are still there, but they are now  just syntactic sugar for REGEXP_LIKE.

The match parameter

All regular expression functions have a common argument, which is always the last in the list: the match_parameter is a set of flags that let you control how matching is done. It is similar to command line flags in that the flags are binary, processed left-to-right, and you can specify any number of them, but when they conflict, it’s the last one that counts. For REGEXP_LIKE, the available flags are

  • c meaning case-sensitive matching
  • i meaning case-insensitive matching
  • m Turn on multi-line matching.
  • n allows the . to match newline. This is not the case by default.
  • u Recognize only Unix line endings.

Here are some examples of casing control:

And here are some of newline and multiline matching:

How to control case sensitivity

All regular expression functions take a subject and a pattern, obviously. They are typically a character type such as CHAR, VARCHAR, TEXT or BLOB (although MySQL will attempt to interpret a value of any type, as usual). The matching will search for the type of characters defined by the subject’s and pattern’s characters sets and collations. Like I already said, full Unicode is supported, for instance UTF-8 mb4. The collation defines case sensitivity. Hence we have three ways to define case sensitivity, and the order of precedence is

  1. The regular expression itself, i.e. the (?i) construction, meaning case-insensitive matching.
  2. The match_parameter.
  3. The collation.

REGEXP_INSTR is very similar, but it doesn’t just return a 1 or 0. Rather, it gives you the position of the match, or 0 if there was none. Plus, it gives you a few more options. The full signature is:

REGEXP_INSTR( subject, pattern[, position[, occurrence[, return_option[, match_parameter ]]]] )

For instance, you can choose on which position you wish the search to start:

You can also choose which occurrence of the match it is that counts:

Finally, you can tweak what is actually returned:

In case you didn’t guess it already, 0 means that MySQL should return the first position of the match, while 1 means the position after the match.

REGEXP_REPLACE does a full search-and-replace operation. The full signature is REGEXP_REPLACE( subject, pattern, replacement [, position[, occurrence[, match_parameter ]]] )

Security

It is possible to write regular expressions that, left to their own devices, would consume a lot of CPU time or memory, especially using computation-heavy constructions like back-references. This is a well-known fact. That’s why we offer two ways of controlling execution space/time. Say hello to regexp_stack_limit and regexp_time_limit. These integer-typed global variables make the regular expression engine stop after a limit has been reached, so it won’t hog the CPU indefinitely or exhaust the memory. Here are some examples.

 

Happy searching, and as always, thank you for using MySQL!