SourceForge.net Logo

Performance comparison of regular expression engines

Introduction

Processing text or raw byte-sequence are among the common tasks performed by most software tools. These tasks usually involve pattern matching algorithms, and the most popular tool for such purpose is regular expressions. Regular expressions has been evolved a lot since Kleene defined the regular sets in the 1950's. Today we have several widely used regular expression engines which have different features which makes any performance comparison a difficult task, since a faster engine is not necessary better. Depending on the use case it might be enough to know whether a POSIX compatible regular expression matches to a line, even the position of the match is unneeded (grep utility). On the contrary other use cases require the position of capturing brackets, unicode support, conditional and atomic block (handling a byte sequence as a single character, like 'sch' in German language) support just to name a few. The former case needs a less sophisticated algorithm, which is likely be much faster than the latter, but again, that does not mean the former is better. More about these engine types can be found here.

Participants

The following popular engines were choosen: Before anyone jump to any conclusions, I should note the followings:
  • The engines were not fine tuned (because of my lack of knowledge about their internal workings). I just compiled them with the default options. I know enabling or disabling some features can heavily affect the results. If you feel that you have a better configuration just drop me an e-mail and I will update the results (hzmester(at)freemail(dot)hu).
  • The regular expression engines are compiled with -O3 to allow the best performance.
  • This comparison page was inspired by the work of John Maddock (See his own regex comparison here). The input is also the same he used before: mtent12.zip. It is a text file (e-book) which size is about 20 Mbytes.
  • Only common patterns are selected, they are not pathological cases nor have any PERL specific features. The comparison was caseful.

Results

x86-64 bit mode on an Intel Xeon 2.67GHz (GCC 4.4.5, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain30 ms40 ms570 ms20 ms10 ms20 ms
^Twain160 ms190 ms330 ms20 ms60 ms20 ms
Twain$20 ms20 ms580 ms20 ms10 ms10 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+40 ms30 ms930 ms50 ms70 ms30 ms
a[^x]{20}b100 ms680 ms1040 ms540 ms740 ms60 ms
Tom|Sawyer|Huckleberry|Finn50 ms40 ms1510 ms60 ms70 ms30 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)7430 ms9140 ms6590 ms240 ms70 ms730 ms
[a-zA-Z]+ing1510 ms2560 ms980 ms1860 ms90 ms360 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]200 ms250 ms500 ms50 ms70 ms50 ms
[a-zA-Z]+ing$1590 ms2700 ms970 ms1900 ms70 ms370 ms
^[a-zA-Z ]{5,}$260 ms520 ms660 ms420 ms80 ms80 ms
^.{16,20}$250 ms420 ms590 ms960 ms70 ms60 ms
([a-f](.[d-m].){0,2}[h-n]){2}1010 ms1510 ms1560 ms810 ms80 ms180 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]1130 ms1710 ms1540 ms290 ms70 ms190 ms
"[^"]{0,30}[?!\.]"40 ms70 ms650 ms100 ms20 ms30 ms
Tom.{10,25}river|river.{10,25}Tom90 ms120 ms1070 ms110 ms70 ms40 ms

x86-32 bit mode on an Intel Xeon 2.67GHz (GCC 4.4.5, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain20 ms20 ms580 ms20 ms20 ms20 ms
^Twain200 ms220 ms340 ms20 ms60 ms20 ms
Twain$20 ms20 ms620 ms20 ms10 ms20 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+20 ms40 ms840 ms40 ms60 ms20 ms
a[^x]{20}b120 ms720 ms1100 ms560 ms940 ms80 ms
Tom|Sawyer|Huckleberry|Finn40 ms60 ms1320 ms60 ms60 ms20 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)9180 ms8900 ms6120 ms220 ms60 ms760 ms
[a-zA-Z]+ing1760 ms2740 ms1080 ms1540 ms80 ms340 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]240 ms280 ms520 ms40 ms60 ms60 ms
[a-zA-Z]+ing$1860 ms2880 ms1060 ms1580 ms60 ms340 ms
^[a-zA-Z ]{5,}$320 ms580 ms680 ms540 ms80 ms100 ms
^.{16,20}$280 ms480 ms620 ms740 ms60 ms80 ms
([a-f](.[d-m].){0,2}[h-n]){2}1240 ms1620 ms1540 ms880 ms80 ms180 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]1480 ms1780 ms1580 ms300 ms60 ms180 ms
"[^"]{0,30}[?!\.]"40 ms60 ms680 ms100 ms10 ms20 ms
Tom.{10,25}river|river.{10,25}Tom80 ms120 ms920 ms100 ms60 ms40 ms

x86-64 bit mode on an Intel Core2 Duo 1.86GHz (GCC 4.4.7, Windows)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain31 ms31 ms1029 ms15 ms15 ms15 ms
^Twain249 ms312 ms405 ms15 ms249 ms15 ms
Twain$31 ms31 ms1014 ms16 ms16 ms15 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+47 ms47 ms2231 ms78 ms249 ms46 ms
a[^x]{20}b140 ms936 ms1684 ms795 ms1435 ms78 ms
Tom|Sawyer|Huckleberry|Finn78 ms78 ms3479 ms93 ms265 ms47 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)12433 ms13712 ms10499 ms296 ms265 ms1201 ms
[a-zA-Z]+ing2293 ms4071 ms1545 ms2168 ms359 ms670 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]312 ms405 ms686 ms62 ms265 ms78 ms
[a-zA-Z]+ing$2418 ms4274 ms1528 ms2230 ms265 ms733 ms
^[a-zA-Z ]{5,}$405 ms764 ms889 ms702 ms297 ms125 ms
^.{16,20}$374 ms608 ms873 ms1216 ms265 ms93 ms
([a-f](.[d-m].){0,2}[h-n]){2}1544 ms2044 ms1934 ms1233 ms343 ms265 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]1950 ms2808 ms2137 ms405 ms265 ms265 ms
"[^"]{0,30}[?!\.]"62 ms94 ms1342 ms125 ms46 ms31 ms
Tom.{10,25}river|river.{10,25}Tom125 ms187 ms2106 ms156 ms250 ms62 ms

The testing environment can be downloaded from here. Just extract, type make. Download the mtent12.zip, rename the mtent12.txt inside to mark.txt and run runtest.

Last modification: 20.8.2013