WL#353: Better REGEXP package
Affects: Server-Prototype Only
—
Status: Un-Assigned
We should try to replace the current REGEXP package (Henry Spencer's) with a newer one because the current one is extremely slow when you search after multiple strings: text REGEXP 'tom|david|monty' Sergei: Not only because it's slow, but in 4.1 the main reason is - because it only works with simple charsets, no multi-byte, no complex (german) collation rules. The challenge is to find a compatible but faster REGEXP library that we can modify to work with our character set handling... Examples of multi-byte aware regex libraries: pcre, tre. References: - BUG#30241 Regular expression problems
Possible candidates : From Matt (pcre) : In my experience, many of the people who really love regexp are also fairly heavy users of perl. So my first thought would be: http://www.pcre.org/ Perhaps that's a bit too big/heavy for our purposes though? Looks like it uses the standard BSD license, which AFAIK is good for us. Attached is the LICENCE (sic) file from the pcre-8.33.tar.gz file. They also note how you can avoid the GPL by building against libedit in the README file: " . It is possible to compile pcretest so that it links with the libreadline or libedit libraries, by specifying, respectively, --enable-pcretest-libreadline or --enable-pcretest-libedit If this is done, when pcretest's input is from a terminal, it reads it using the readline() function. This provides line-editing and history facilities. Note that libreadline is GPL-licenced, so if you distribute a binary of pcretest linked in this way, there may be licensing issues. These can be avoided by linking with libedit (which has a BSD licence) instead. " More good news, it built very quickly and w/o a hitch on my OS X 10.8 laptop: noob:pcre-8.33 matt$ make test make check-am make make all-am make[3]: Nothing to be done for `all-am'. make check-TESTS PASS: pcrecpp_unittest PASS: pcre_scanner_unittest PASS: pcre_stringpiece_unittest PASS: RunTest PASS: RunGrepTest make all-am make[5]: Nothing to be done for `all-am'. ============================================================================ Testsuite summary for PCRE 8.33 ============================================================================ # TOTAL: 5 # PASS: 5 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 ============================================================================ So we know it's not terribly linux-centric source, and it will hopefully not require much work on our part to get it to build on all supported platforms. Windows builds may require cygwin though...
Objective: To replace the current regex library with a new one as the current one is slow and doesnt have multi-byte support. Make the regex engine pluggable, so that any library can be added as a plugin. Multi-byte aware REGEX libraries: PCRE --> BSD license. Widely Used. Used by Apache. TRE --> 2-clause BSD-like license. Supports Approximate matching. RE2 --> BSD License. Uses automata theory instead of backtracking. very fast for large inputs. Useful in multithreaded environments where thread stacks cannot grow arbitrarily large. developed and used by Google. Oniguruma --> BSD license. Comparable speeds to RE2 and PCRE. Looks promising. ICU Regex --> ICU License. Inferior speeds compared to most of the libraries above. Boost:Regex-> Boost License. Pro: Supports different locales. Con: Has no native UTF-8 support.Must be used over ICU Regex library to enable UTF support. Conclusions from above. I have spoken to Tor regarding Boost:Regex. We cannot use it ATM, probably early next year. TRU, ICU Regex is slow. RE2, PCRE seem really good candidates. For general comparison among all the regex libraries, these are a few good links. http://lh3lh3.users.sourceforge.net/reb.shtml http://softwareramblings.com/2008/08/c-regular-expression-performance.html http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/gcc-performance.html https://github.com/ellzey/libevhtp/issues/31 A good comparison (in terms of speed) of PCRE and RE2 is detailed here. http://swtch.com/~rsc/regexp/regexp3.html At this point, I will be narrowing down my research to only PCRE and RE2. Some notes on PCRE and RE2: Charsets: -> RE2 supports only UTF-8 and Latin-1.Collating elements and collation classes are not supported. -> PCRE supports UTF 8,16,32. -> PCRE has complete UNICODE property support, while RE2 has some support. Since the structures used are different from what we current use (CHARSET_INFO), mapping of our collation rules for a particular collation (like case insensitivity, may be multi-line etc) needs to be done so that the new library understands it. A few approaches that can be taken: 1)Convert everything to UTF-8 or a standard wide character set, send it to the new RE library and convert back the result. Slow and not-so-good approach. 2)Check out the rules in specific collations and supply them as arguments.For ex: utf_general_ci is case-insensitive. So we can pass in PCRE_CASEINSENSITIVE and so on. But method (2) may not be applicable to all collation rules. So we have to shift back to method (1), in those cases. This may end up deteriorating performance at times. Combination of the above 2 approaches is a pretty good and doable approach. But there is a better approach with some limitations. And thats method (3). 3)Change PCRE/RE2 so that it complies with our charset.But how feasible is this?Changing any of these libraries following a similar approach as Bar had taken to change original Henry Spencers Regex to suit our needs, is possibly the best option, but it will require lots of code changes (and time). Also the new libraries are much bigger than the current one and a drawback of this would be that integrating future versions will become a pain. Continuing with the comparison between PCRE and RE2, Development Status and Stability: Both have stable versions released and further development and enhancement is going on. Multithreading, global variables and locks: PCRE uses no global variables except, the call to functions pcre_free, pcre_malloc and of the kind. The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be used by several threads at once. If the just-in-time (JIT) optimization feature is being used, it needs separate memory stack areas for each thread. Here pthreads are used. (I dont think we will be using the JIT feature so I didn't read up about that) RE2 only uses global mutexs and call to mutex related functions. Each of the 2 has their own malloc free realloc functions.I guess they can be tweaked to suit our purposes. Both the libraries are thread safe. Quoting from one of the links above: RE2's DFA is more thread friendly. In both the cases, matching can run simultaneosuly in different threads with the compiled regular expressions.Only during compilation some locking is involved in RE2, most of ehich is around memeory allocation and deallocation. Both the libraries do not use setjmp/longjmp() as they slow down things.But PCRE uses it when NO_RECURSE is it. Memory Usage: PCRE can use a large recursive stack and have exponential runtime on certain patterns. RE2 uses a fixed stack and guarantees that run-time increases linearly with the size of the input. The maximum memory allocated with RE2 can be configured if you have good knowledge of the workings of its code. Google's RE2 has a slightly smaller set of features than PCRE, but has very predictable run-time and a maximum memory allotment. Making it suitable for use in server applications which require precise boundaries on memory usage and computational time. PCRE on the other hand, has almost all of the features that a regular expression library can have. But PCRE's run-time and memory usage is not predictable and can grow unbounded. The compiled form of an RE2 is bigger than that of a PCRE object, a few kilobytes vs a few hundred bytes for a typical small regular expression. RE2 does more analysis of the regexp during compilation and stores a richer form than PCRE. Thus compilation time of RE2 is 3-4x more than PCRE. Error code: Both of them have their own error codes which can be tweaked to conform to mysql error code standards. Signal and exception Handling: As per my reading up, nothing much is done on these lines in the libraries.We can handle it on the server side, just as we are doing now. Pros and Cons: -> RE2 doesnt support back references. -> It is slow compared to PCRE for small strings. Real fast for long strings. -> It is completely automata based.Does everything in linear time. PCRE may end up taking exponential time. -> In case of RE2, no need to worry about stack overflow, contrary to PCRE where a minimum amount of stack space is always required (unless NO_RECURSE is set, which may slow down things). Small tests performed by me. I compiled PCRE with UTF support (--enable-utf) and RE2. The BUG#30241 seems to be solved. Also I wrote a small API which calls either of the 2 libraries based on command line arguments. Some BenchMark test Results (performed by me): BenchMark test Results: -------------------------------------------------------------------------------- ---------------- | Time taken for 1000000 runs (in seconds) | | my_regex | PCRE --------------------------------------------------| | | RE: '^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)' | 6.95 | 3.03 String: 'http://www.linux.com/' | | -------------------------------------------------------------------------------- ---------------- RE: 'usd [+-]?[0-9]+.[0-9][0-9]' | 2.03 | 0.46 String: 'same same same' | | -------------------------------------------------------------------------------- ---------------- RE: '\b(\w+)(\s+\1)+\b' | 0.17 | 23.06 String: 'http://www.thelinuxshow.com/main.php3' | | -------------------------------------------------------------------------------- ---------------- PCRE is real slow for the third regex under all strings.RE2 is about 10.06 seconds. To the best of my knowledge, I have done everything correctly, but unknowing mistakes may exist. ONIGURUMA: I didnt research much upon this unless suddenly it looked quite promising. USed by TextMate, SublimeText, PHP. The characteristics of this library is that different character encoding for every regular expression object can be specified. The code base and files are similar to what we have.May be this is modifiable? This is the ONIG character class. http://pastebin.no.oracle.com/index.php/view/90543385 This too is a multi-byte library, solves the BUG#30241 and multithreading support can be enabled.In that case no global variables are used. It has almost the same set of features than PCRE and uses backtracking as well. Error code, again easily mappable. Memory allocation adn deallocation is similar to what we have currently, and can be tweaked. There is not much you can find in google about Onig, compared to PCRE. There are loads of documents/ test/ comparisons about PCRE but not so much for oNIG. Making things pluggable: What can be done here is, isolate the current library regex/ code and place it in the plugins folder. Make a generic interface on the server side, with one or may be two generic function calls. This generic function call will speak to the required library (based on the .so file included). Now, for every new library we add as a plugin, a 'bridge' / interface has to be written which will do the required communication from server to the regex engine. Now this bridge, can either convert everything to UTF-8 (or a wide character set) by calling the sql string service functions, or amy be something else. If we follow method (3) as mentioned above, every library has to be tweaked to support our CHARSET_INFO structure.Then this bridge concept is not required. Conclusins: In my opinion, since we will make it pluggable somehow (by somehow I mean, either charset conversion or writing an interface), let us first change the library to ONIG or PCRE or may be both and then add support to the other.We will keep the one we have also. I dont think RE2 will be great choice to us.The major advantages of RE2 being -> Regex which have '|' -> Linear time in worst case and no stack overflow. We can do a slight trade off on point number (1) and check for stack overflow as we are doing. RE2 has cons also as mentioned above. TODO: Make calls to regexp_match multithreaded.
Copyright (c) 2000, 2024, Oracle Corporation and/or its affiliates. All rights reserved.