WL#353: Better REGEXP package

Affects: Server-Prototype Only   —   Status: Un-Assigned

We should try to replace the current REGEXP package (Henry Spencer's) with a
newer one because the current one is extremely slow when you search after
multiple strings:

text REGEXP 'tom|david|monty'

Sergei: Not only because it's slow, but in 4.1 the main reason is - because it
only works with simple charsets, no multi-byte, no complex (german) collation rules.

The challenge is to find a compatible but faster REGEXP library that we can
modify to work with our character set handling...

Examples of multi-byte aware regex libraries: pcre, tre.

References:
- BUG#30241 Regular expression problems
Possible candidates : 

From Matt (pcre) : 

In my experience, many of the people who really love regexp are also fairly 
heavy users of perl. So my first thought would be:
  http://www.pcre.org/

Perhaps that's a bit too big/heavy for our purposes though? 

Looks like it uses the standard BSD license, which AFAIK is good for us. 
Attached is the LICENCE (sic) file from the pcre-8.33.tar.gz file.


They also note how you can avoid the GPL by building against libedit in the 
README file:

"
. It is possible to compile pcretest so that it links with the libreadline
  or libedit libraries, by specifying, respectively,

  --enable-pcretest-libreadline or --enable-pcretest-libedit

  If this is done, when pcretest's input is from a terminal, it reads it using
  the readline() function. This provides line-editing and history facilities.
  Note that libreadline is GPL-licenced, so if you distribute a binary of
  pcretest linked in this way, there may be licensing issues. These can be
  avoided by linking with libedit (which has a BSD licence) instead.
"

More good news, it built very quickly and w/o a hitch on my OS X 10.8 laptop:

noob:pcre-8.33 matt$ make test
make  check-am
make
make  all-am
make[3]: Nothing to be done for `all-am'.
make  check-TESTS
PASS: pcrecpp_unittest
PASS: pcre_scanner_unittest
PASS: pcre_stringpiece_unittest
PASS: RunTest
PASS: RunGrepTest
make  all-am
make[5]: Nothing to be done for `all-am'.
============================================================================
Testsuite summary for PCRE 8.33
============================================================================
# TOTAL: 5
# PASS:  5
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================


So we know it's not terribly linux-centric source, and it will hopefully not 
require much work on our part to get it to build on all supported platforms. 
Windows builds may require cygwin though... 
Objective:

To replace the current regex library with a new one as the current one 
is slow and doesnt have multi-byte support.
Make the regex engine pluggable, so that any library can be added as a plugin.

Multi-byte aware REGEX libraries:

PCRE      --> BSD license. Widely Used. Used by Apache.
TRE       --> 2-clause BSD-like license. Supports Approximate matching.
RE2       --> BSD License. Uses automata theory instead of backtracking. 
              very fast for large inputs. Useful in multithreaded environments 
where 
              thread stacks cannot grow arbitrarily large.
              developed and used by Google.
Oniguruma --> BSD license. Comparable speeds to RE2 and PCRE. Looks promising.
ICU Regex --> ICU License. Inferior speeds compared to most of the libraries 
above.
Boost:Regex-> Boost License. Pro: Supports different locales. Con: Has no native 
              UTF-8 support.Must be used over ICU Regex library to enable UTF 
support.

Conclusions from above.

I have spoken to Tor regarding Boost:Regex. We cannot use it ATM, probably 
early next year.
TRU, ICU Regex is slow.
RE2, PCRE seem really good candidates.

For general comparison among all the regex libraries, these are a few good 
links.

http://lh3lh3.users.sourceforge.net/reb.shtml

http://softwareramblings.com/2008/08/c-regular-expression-performance.html

http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html

https://github.com/ellzey/libevhtp/issues/31

A good comparison (in terms of speed) of PCRE and RE2 is detailed here.

http://swtch.com/~rsc/regexp/regexp3.html

At this point, I will be narrowing down my research to only PCRE and RE2.

Some notes on PCRE and RE2:

Charsets: 

-> RE2 supports only UTF-8 and Latin-1.Collating elements and collation classes 
are not supported.
-> PCRE supports UTF 8,16,32.
-> PCRE has complete UNICODE property support, while RE2 has some support.

Since the structures used are different from what we current use (CHARSET_INFO), 
mapping of our collation rules for a particular collation (like case 
insensitivity, may be multi-line etc) needs to be done so that the new library 
understands it. 

  A few approaches that can be taken:
  1)Convert everything to UTF-8 or a standard wide character set, send it to the 
new RE library and convert back the result. Slow and not-so-good approach.
  2)Check out the rules in specific collations and supply them as arguments.For 
ex: utf_general_ci is case-insensitive. So we can pass in PCRE_CASEINSENSITIVE 
and so on.

  But method (2) may not be applicable to all collation rules. So we have to 
shift back to method (1), in those cases. This may end up deteriorating 
performance at times.

  Combination of the above 2 approaches is a pretty good and doable approach.
  
  But there is a better approach with some limitations. And thats method (3).
  
  3)Change PCRE/RE2 so that it complies with our charset.But how feasible is 
this?Changing any of these libraries following a similar approach as Bar had 
taken to change original Henry Spencers Regex to suit our needs, is possibly the 
best option, but it will require lots of code changes (and time). Also the new
libraries are much bigger than the current one and a drawback of this would be 
that integrating future versions will become a pain. 

Continuing with the comparison between PCRE and RE2,

Development Status and Stability:

Both have stable versions released and further development and enhancement is 
going on.

Multithreading, global variables and locks:

PCRE uses no global variables except, the call to functions pcre_free, 
pcre_malloc and of the kind.
The compiled form of a regular expression is not altered during  matching, so 
the same compiled pattern 
can safely be used by several threads at once.

If the just-in-time (JIT) optimization feature is being used, it needs  separate  
memory stack areas for each thread. Here pthreads are used.
(I dont think we will be using the JIT feature so I didn't read up about that)

RE2 only uses global mutexs and call to mutex related functions.

Each of the 2 has their own malloc free realloc functions.I guess they can be 
tweaked to suit our purposes.

Both the libraries are thread safe.
Quoting from one of the links above: RE2's DFA is more thread friendly.

In both the cases, matching can run simultaneosuly in different threads with the 
compiled
regular expressions.Only during compilation some locking is involved in RE2, 
most of ehich is
around memeory allocation and deallocation.

Both the libraries do not use setjmp/longjmp() as they slow down things.But PCRE 
uses it when NO_RECURSE is it.

Memory Usage:

PCRE can use a large recursive stack and have exponential runtime on certain 
patterns. RE2 uses a fixed stack and guarantees that run-time increases linearly 
with the size of the input. The maximum memory allocated with RE2 can be 
configured if you have good knowledge of the workings of its code.
Google's RE2 has a slightly smaller set of features than PCRE, but has very 
predictable run-time and a maximum memory allotment. Making it suitable for use 
in server applications which require precise boundaries on memory usage and 
computational time. PCRE on the other hand, has almost all of the features that 
a regular expression library can have. But PCRE's run-time and memory usage is 
not predictable and can grow unbounded.

The compiled form of an RE2 is bigger than that of a PCRE object, a few 
kilobytes vs a few hundred bytes for a typical small regular expression. RE2 
does more analysis of the regexp during compilation and stores a richer form 
than PCRE. Thus compilation time of RE2 is 3-4x more than PCRE.

Error code:

Both of them have their own error codes which can be tweaked to conform to  
mysql error code standards.

Signal and exception Handling:

As per my reading up, nothing much is done on these lines in the libraries.We 
can
handle it on the server side, just as we are doing now.

Pros and Cons:

-> RE2 doesnt support back references.
-> It is slow compared to PCRE for small strings. Real fast for long strings.
-> It is completely automata based.Does everything in linear time. PCRE may end 
up 
   taking exponential time.
-> In case of RE2, no need to worry about stack overflow, contrary to PCRE where 
a 
   minimum amount of stack space is always required (unless NO_RECURSE is set, 
   which may slow down things).

Small tests performed by me.

I compiled PCRE with UTF support (--enable-utf) and RE2.
The BUG#30241 seems to be solved.

Also I wrote a small API which calls either of the 2 libraries based on command 
line arguments.

Some BenchMark test Results (performed by me):

BenchMark test Results:
 
--------------------------------------------------------------------------------
----------------
                                                  | Time taken for 1000000 runs 
(in seconds)   
                                                  |     |       my_regex    |     
PCRE
--------------------------------------------------|     |                   |
RE:     '^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)'       |        6.95       |     
3.03
String: 'http://www.linux.com/'                         |                   |
--------------------------------------------------------------------------------
----------------
RE:     'usd [+-]?[0-9]+.[0-9][0-9]'                    |        2.03       |     
0.46
String: 'same same same'                                |                   |
--------------------------------------------------------------------------------
----------------
RE:     '\b(\w+)(\s+\1)+\b'                             |        0.17       |     
23.06
String: 'http://www.thelinuxshow.com/main.php3'         |                   |
--------------------------------------------------------------------------------
----------------
PCRE is real slow for the third regex under all strings.RE2 is about 10.06 
seconds.

To the best of my knowledge, I have done everything correctly, but unknowing 
mistakes may exist.

ONIGURUMA:

I didnt research much upon this unless suddenly it looked quite promising. USed 
by TextMate, SublimeText, PHP.

The characteristics of this library is that different character encoding
for every regular expression object can be specified.
The code base and files are similar to what we have.May be this is modifiable?

This is the ONIG character class.

http://pastebin.no.oracle.com/index.php/view/90543385

This too is a multi-byte library, solves the BUG#30241 and multithreading 
support can be enabled.In that case no global variables are used.
It has almost the same set of features than PCRE and uses backtracking as well.
Error code, again easily mappable. Memory allocation adn deallocation is similar 
to what we have currently, and can be tweaked.
There is not much you can find in google about Onig, compared to PCRE.
There are loads of documents/ test/ comparisons about PCRE but not so much for 
oNIG.

Making things pluggable:

What can be done here is, isolate the current library regex/ code and place it 
in the plugins folder.
Make a generic interface on the server side, with one or may be two generic 
function calls.

This generic function call will speak to the required library (based on the .so 
file included).
Now, for every new library we add as a plugin, a 'bridge' / interface has to be 
written which 
will do the required communication from server to the regex engine.

Now this bridge, can either convert everything to UTF-8 (or a wide character 
set) by calling 
the sql string service functions, or amy be something else.

If we follow method (3)  as mentioned above, every library has to be tweaked to 
support our CHARSET_INFO 
structure.Then this bridge concept is not required.
 
Conclusins:

In my opinion, since we will make it pluggable somehow (by somehow I mean, 
either charset conversion or writing an interface),
let us first change the library to ONIG or PCRE or may be both and then add 
support to the other.We will keep the one we have also.
I dont think RE2 will be great choice to us.The major advantages of RE2 being
-> Regex which have '|'
-> Linear time in worst case and no stack overflow.

We can do a slight trade off on point number (1) and check for stack overflow as 
we are doing.
RE2 has cons also as mentioned above.

TODO:

Make calls to regexp_match multithreaded.