SourceForge.net Logo

Performance comparison of regular expression engines

Introduction

Processing text or raw byte-sequence are among the common tasks performed by most software tools. These tasks usually involve pattern matching algorithms, and the most popular tool for such purpose is regular expressions. Regular expressions has been evolved a lot since Kleene defined the regular sets in the 1950's. Today we have several widely used regular expression engines which have different features which makes any performance comparison a difficult task, since a faster engine is not necessary better. Depending on the use case it might be enough to know whether a POSIX compatible regular expression matches to a line, even the position of the match is unneeded (grep utility). On the contrary other use cases require the position of capturing brackets, unicode support, conditional and atomic block (handling a byte sequence as a single character, like 'sch' in German language) support just to name a few. The former case needs a less sophisticated algorithm, which is likely be much faster than the latter, but again, that does not mean the former is better. More about these engine types can be found here.

Participants

The following popular engines were choosen: Before anyone jump to any conclusions, I should note the followings:
  • The engines were not fine tuned (because of my lack of knowledge about their internal workings). I just compiled them with the default options. I know enabling or disabling some features can heavily affect the results. If you feel that you have a better configuration just drop me an e-mail and I will update the results (hzmester(at)freemail(dot)hu).
  • The regular expression engines are compiled with -O3 to allow the best performance.
  • This comparison page was inspired by the work of John Maddock (See his own regex comparison here). The input is also the same he used before: mtent12.zip. It is a text file (e-book) which size is about 20 Mbytes.
  • Only common patterns are selected, they are not pathological cases nor have any PERL specific features. The comparison was caseful.

Results

x86-64 bit mode on an Intel Xeon 2.67GHz (GCC 4.8.2, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain18 ms20 ms492 ms16 ms3 ms16 ms
^Twain148 ms168 ms274 ms16 ms59 ms16 ms
Twain$15 ms17 ms504 ms16 ms2 ms16 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+29 ms30 ms708 ms48 ms59 ms20 ms
a[^x]{20}b91 ms606 ms947 ms466 ms553 ms58 ms
Tom|Sawyer|Huckleberry|Finn40 ms42 ms1179 ms57 ms61 ms41 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)6543 ms6962 ms5011 ms185 ms67 ms669 ms
[a-zA-Z]+ing1360 ms2194 ms952 ms1374 ms129 ms314 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]185 ms220 ms407 ms37 ms60 ms43 ms
[a-zA-Z]+ing$1428 ms2313 ms940 ms1418 ms112 ms321 ms
^[a-zA-Z ]{5,}$237 ms467 ms530 ms357 ms68 ms66 ms
^.{16,20}$215 ms378 ms527 ms603 ms60 ms46 ms
([a-f](.[d-m].){0,2}[h-n]){2}940 ms1298 ms1276 ms672 ms122 ms142 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]1108 ms1468 ms1275 ms246 ms111 ms27 ms
"[^"]{0,30}[?!\.]"32 ms59 ms567 ms79 ms7 ms26 ms
Tom.{10,25}river|river.{10,25}Tom81 ms107 ms804 ms97 ms68 ms25 ms

x86-32 bit mode on an Intel Xeon 2.67GHz (GCC 4.8.2, Linux)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain18 ms20 ms555 ms16 ms3 ms18 ms
^Twain185 ms204 ms293 ms16 ms60 ms16 ms
Twain$15 ms17 ms569 ms16 ms3 ms16 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+24 ms25 ms820 ms48 ms60 ms20 ms
a[^x]{20}b97 ms648 ms936 ms423 ms454 ms77 ms
Tom|Sawyer|Huckleberry|Finn37 ms38 ms1347 ms58 ms61 ms40 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)8142 ms7297 ms5380 ms193 ms70 ms593 ms
[a-zA-Z]+ing1701 ms2367 ms894 ms1382 ms125 ms287 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]208 ms239 ms449 ms38 ms60 ms41 ms
[a-zA-Z]+ing$1787 ms2494 ms881 ms1416 ms107 ms302 ms
^[a-zA-Z ]{5,}$294 ms501 ms564 ms362 ms68 ms66 ms
^.{16,20}$238 ms422 ms514 ms615 ms60 ms48 ms
([a-f](.[d-m].){0,2}[h-n]){2}1113 ms1390 ms1303 ms728 ms133 ms140 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]1263 ms1488 ms1557 ms255 ms104 ms27 ms
"[^"]{0,30}[?!\.]"36 ms63 ms633 ms85 ms7 ms26 ms
Tom.{10,25}river|river.{10,25}Tom86 ms102 ms904 ms98 ms69 ms25 ms

x86-64 bit mode on an Intel Core i5 3.2GHz (GCC 4.4.7, Windows)

Regular expressionPCREPCRE
-DFA
TREOnig-
uruma
RE2PCRE
-JIT
Twain10 ms12 ms313 ms20 ms8 ms16 ms
^Twain93 ms118 ms159 ms20 ms131 ms16 ms
Twain$10 ms12 ms327 ms20 ms7 ms16 ms
Huck[a-zA-Z]+|Finn[a-zA-Z]+19 ms20 ms522 ms37 ms130 ms20 ms
a[^x]{20}b58 ms377 ms574 ms347 ms526 ms38 ms
Tom|Sawyer|Huckleberry|Finn26 ms30 ms874 ms43 ms132 ms37 ms
.{0,3}(Tom|Sawyer|Huckleberry|Finn)3794 ms4958 ms3016 ms112 ms132 ms375 ms
[a-zA-Z]+ing747 ms1594 ms547 ms833 ms176 ms166 ms
^[a-zA-Z]{0,4}ing[^a-zA-Z]113 ms150 ms242 ms42 ms133 ms23 ms
[a-zA-Z]+ing$789 ms1683 ms540 ms839 ms132 ms182 ms
^[a-zA-Z ]{5,}$152 ms288 ms322 ms273 ms152 ms48 ms
^.{16,20}$135 ms239 ms298 ms490 ms133 ms24 ms
([a-f](.[d-m].){0,2}[h-n]){2}458 ms698 ms746 ms478 ms166 ms103 ms
([A-Za-z]awyer|[A-Za-z]inn)[^a-zA-Z]663 ms1052 ms826 ms173 ms131 ms27 ms
"[^"]{0,30}[?!\.]"21 ms37 ms365 ms67 ms18 ms14 ms
Tom.{10,25}river|river.{10,25}Tom55 ms76 ms645 ms75 ms131 ms26 ms

The testing environment can be downloaded from here. Just extract, type make. Download the mtent12.zip, rename the mtent12.txt inside to mark.txt and run runtest.

Last modification: 6.6.2015