WL#8987: Add the ICU library to handle RLIKE/REGEXP.

Affects: Server-8.0 — Status: Complete

Description
Requirements
High Level Architecture
Low Level Design

There have been requests from customers about a REGEXP_REPLACE function. It is the #2 requested feature of all time, behind only CTEs: BUG#27389 - regular expression substitute. This bug is also duplicated in BUG#79428, BUG#29781, BUG#16357, BUG#9105

There are smaller requests for additional features:

BUG#29590 - Support for back-references (0 affected, 2 subscribed) BUG#8098 - REGEX match count (0 a, 2 s) BUG#70413 - \d is not working in REGEXP for a MySQL query (1a, 5s) BUG#68271 - allow for reverse regex (0a, 1s)

F-1: All regular expression functions must handle the full Unicode reportoire according to Unicode Technical Standard #18.

F-2: The new function REGEXP_LIKE will handle regular expression comparison.

F-3: The operators RLIKE and REGEXP will be synonyms for REGEXP_LIKE().

F-4: We will support build both with a bundled ICU library and with ICU installed in a system location through the build flag WITH_ICU=system|bundled.

F-5: Except when explicitly mentioned, the new behavior of RLIKE and REGEXP should follow the old behavior for ASCII strings.

F-6: The running time of regular expression matches is limited by a system variable.

F-7: The memory consumption of regular expression matches is limited by a system variable.

Functionality
- New Functions
- New System Variables
Security
- User Visible Security Features
- ICU Security Features
Compatibility with Old Library
Libraries

Functionality

New Functions

This work will add four new functions to MySQL. They are somewhat based on the Oracle regular expression functions. For all functions, indexes are 1-based, i.e. 1 is the first index.

REGEXP_INSTR ( subject, pattern [, position [, occurence [, match_parameter]]] ): Returns the index of the start of the substring of subject matching pattern, or 0 if no match was found. The following optional arguments are accepted:

position specifies the position in the string to start the search. Default is 1.
occurence specifies which occurence of a match should be searched for. Default is 1.
match_parameter is a string specifying how to do the matching:
- 'c' Means case sensitive matching.
- 'i' Means case insensitive matching.
- 'm' Means that line terminators can be recognized within the string. Default behavios is to match only at start and end of the input string.
- 'n' If set, '.' matches line terminators, otherwise '.' matching stops at line end.
- 'u' Unix-only line endings. When this mode is enabled, only \\u000a is recognized as a line ending in the behavior of ., ^, and $.

If contradictory options are specified, the last (rightmost) one takes precedence. Any or all may be specified.

REGEXP_LIKE ( subject, pattern [, match_parameter] ): Returns 1 if the subject string has a match for the pattern, otherwise 0. The following optional argument is accepted:

match_parameter, see REGEXP_INSTR.

REGEXP_REPLACE ( subject, pattern, replacement [, position [, occurence [, match_parameter]]] ): Replaces substrings matching the pattern with the replacement. The following optional arguments are accepted:

position specifies the position in the string to start the search. Default is 1.
occurence specifies which occurence of a match should be searched for. Default is 1.
match_parameter, see REGEXP_INSTR.

REGEXP_SUBSTR ( subject, pattern [, position [, occurence [, match_parameter]]] ): Returns the substring of the subject string matching the pattern. The following optional arguments are accepted:

position specifies the position in the string to start the search. Default is 1.
occurence specifies which occurence of a match should be searched for. Default is 1.
match_parameter, see REGEXP_INSTR.

New System Variables

These system variables are added:

regexp_stack_limit: See User Visible Security Features.
regexp_time_limit: See User Visible Security Features.

Security

User Visible Security Features

This work adds two system variables that can aid in avoiding DoS attacks:

regexp_stack_limit: Stack size limit for regular expressions matches
regexp_time_limit: Timeout for regular expressions matches, in steps of the match engine, typically on the order of milliseconds.

ICU Security Features

ICU offers the following ways to control the regular expression matching:

Maximum execution time: This is not expressed in seconds or milliseconds, but rather in "steps of the match engine". According to the API reference: (http://www.icu-project.org/apiref/icu4c/classicu_1_1RegexMatcher.html#a6ebcfcab4fe6a38678c0291643a03a00). This is what regexp_time_limit uses.

"The units of the limit are steps of the match engine. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds."

Maximum memory consumption: ICU uses an internal stack during execution. The amount of memory is set in bytes. This is what regexp_stack_limit uses.

Callback during match operations: So far, a use for this callback has not been found, but it may fix bugs that the two above limits cannot.

Callback during find operations: Ditto.

Compatibility with Old Library

The Henry Spencer library does not have a rich documentation. This section will fill out as we find places where additional tweaks are necessary for backward compatibility.

Escaping

Both the H.S. library are able to consume up to three leading backslashes. However, in ICU they may collide with metacharacters (http://userguide.icu-project.org/strings/regexp#TOC-Regular-Expression-Metacharacters). The following example behaves the same in both ICU and HS. 'i' is not part of a metacharacter.

 mysql> SELECT 'i' rlike '\i' as 'one', 'i' rlike '\\i' as 'two ', 'i' rlike '\\\i' as 'three', 'i' rlike '\\\\i' as 'four';
 +-----+------+-------+------+
 | one | two  | three | four |
 +-----+------+-------+------+
 |   1 |    1 |     1 |    0 |
 +-----+------+-------+------+

ICU also differs in how escaping works in the subject string as compared to the pattern. In the subject string, only a single backslash may be used for escaping:

 mysql> SELECT regexp_like( 'a\nb', '(?m)^b' );
 +---------------------------------+
 | regexp_like( 'a\nb', '(?m)^b' ) |
 +---------------------------------+
 |                               1 |
 +---------------------------------+

 mysql> SELECT regexp_like( 'a\\nb', '(?m)^b' );
 +----------------------------------+
 | regexp_like( 'a\\nb', '(?m)^b' ) |
 +----------------------------------+
 |                                0 |
 +----------------------------------+

Different Modes

The Henry Spencer library has various modes that ICU lacks, e.g. basic vs. extended regular expression, not-end/beginning-of line, and others. But since these are not exposed to the SQL clients, they are ignored here.

Features Missing in ICU

There is one feature that ICU doesn't have, which the Henry Spencer library has: Start-of-word and end-of-word boundaries. This has been reported to ICU as a feature request with the ticket number 12916: http://bugs.icu-project.org/trac/ticket/12916.

What is Considered an Error?

The libraries differ in which regular expressions are considered erroneous. However, short of missing features, this turns out to be only in corner cases.

Cases Where the Henry Spencer Library Raises Error and ICU Doesn't

  +---------+-----------------------------------------------+
  | Pattern | Error                                         |
  +---------+-----------------------------------------------+
  | '^$'    | empty (sub)expression(only in Multiline mode) |
  | '|'     | empty (sub)expression                         |
  | 'a||b'  | -"-                                           |
  | '|ab'   | -"-                                           |
  | 'ab|'   | -"-                                           |
  | '(|a)b' | -"-                                           |
  | '^*'    | -"-                                           |
  | '^+'    | -"-                                           |
  | '^?'    | -"-                                           |
  | '^{n}'  | -"-                                           |
  +---------+-----------------------------------------------+

The Henry Spencer library has a configurable maximum repetition count, by default 255, i.e.

 mysql> SELECT '' rlike 'a{255}';
 +-------------------+
 | '' rlike 'a{255}' |
 +-------------------+
 |                 0 |
 +-------------------+
 1 row in set (0,00 sec)

 mysql> SELECT '' rlike 'a{256}';
 ERROR 1139 (42000): Got error 'invalid repetition count(s)' from regexp

ICU has no such feature, however "steps of the matching engine" can be limited, see ICU Security Features.

Cases Where ICU Raises Error and the Henry Spencer Library Doesn't

  +---------+-----------------------------------------------+
  | Pattern | Error                                         |
  +---------+-----------------------------------------------+
  | '{'     | U_REGEX_RULE_SYNTAX                           +
  | 'a{,n}' | U_REGEX_RULE_SYNTAX                           +
  | 'a{,}'  | U_REGEX_RULE_SYNTAX                           +
  +---------+-----------------------------------------------+

Libraries

ICU regular expressions needs the following libraries in order to work:

i18n: This is the regular expression engine.
uc: This is a common library for all ICU code.
data: This is not code, but information about character sets and collations. Most of it is likely not needed for regular expression matching.

= The Build System

Build Options
Removal of Henry Spencer Library

API Usages
Class diagram
Versions Supported

Build Options

The build scripts will get a new option WITH_ICU which can have the following options:

bundled: Means that we build a version of ICU that is included in our source distribution.

system: Means to use the system-installed libraries. Will obviously not work on systems without standardized package management, i.e. Windows, Mac OS.

path to library: This option allows to build the server with a path to a custom location for the ICU libraries. Feedback is needed from the build team whether this options should be supported on Windows. Otherwise the only supported option on Windows will be 'bundled'.

Removal of Henry Spencer Library

The Henry Spencer library will still remain in the source code as the rapid plugin and mysqltest binary link against it. However, it has been removed from INCLUDE_DIRECTORIES, so no other component of the source code can include it without explicitly adding it.

The Item_func_regex class has been completely removed.

API Usages

MySQL will use the C API exclusively, for the following reasons.

The C API is stable, meaning that users can upgrade their ICU libraries to a newer version than their OS distribution supplies. This is something we will encourage.
MySQL is built with C++11, while the ICU C++ headers use C++03. These options are not compatible on Sun Studio. Only C linkage can be used.
The C++ API takes UTF-8 strings, converting them internally to UTF-16. They are then converted back to UTF-8 upon return. With the C API we only convert to UTF-16 once, and the resulting UTF-16 strings can be handled by MySQL and only get converted when necessary.

Class diagram

The new functions will get a common base class in Item_regexp_func, and each function is represented by a subclass of this function.

Versions Supported

The ICU libraries that are part of package systems are compiled with the option U_ICU_ENTRY_POINT_RENAME, meaning that all entry points are extended with the major version number, e.g. uregex_start becomes uregex_start_55. In practice this means that MySQL can never load a different ICU library version than the one it was built for. This is perfectly normal and is handled by the package maintainers.