123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195 |
- <html>
- <head>
- <title>pcreperform specification</title>
- </head>
- <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
- <h1>pcreperform man page</h1>
- <p>
- Return to the <a href="index.html">PCRE index page</a>.
- </p>
- <p>
- This page is part of the PCRE HTML documentation. It was generated automatically
- from the original man page. If there is any nonsense in it, please consult the
- man page, in case the conversion went wrong.
- <br>
- <br><b>
- PCRE PERFORMANCE
- </b><br>
- <P>
- Two aspects of performance are discussed below: memory usage and processing
- time. The way you express your pattern as a regular expression can affect both
- of them.
- </P>
- <br><b>
- COMPILED PATTERN MEMORY USAGE
- </b><br>
- <P>
- Patterns are compiled by PCRE into a reasonably efficient interpretive code, so
- that most simple patterns do not use much memory. However, there is one case
- where the memory usage of a compiled pattern can be unexpectedly large. If a
- parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
- a limited maximum, the whole subpattern is repeated in the compiled code. For
- example, the pattern
- <pre>
- (abc|def){2,4}
- </pre>
- is compiled as if it were
- <pre>
- (abc|def)(abc|def)((abc|def)(abc|def)?)?
- </pre>
- (Technical aside: It is done this way so that backtrack points within each of
- the repetitions can be independently maintained.)
- </P>
- <P>
- For regular expressions whose quantifiers use only small numbers, this is not
- usually a problem. However, if the numbers are large, and particularly if such
- repetitions are nested, the memory usage can become an embarrassment. For
- example, the very simple pattern
- <pre>
- ((ab){1,1000}c){1,3}
- </pre>
- uses 51K bytes when compiled using the 8-bit library. When PCRE is compiled
- with its default internal pointer size of two bytes, the size limit on a
- compiled pattern is 64K data units, and this is reached with the above pattern
- if the outer repetition is increased from 3 to 4. PCRE can be compiled to use
- larger internal pointers and thus handle larger compiled patterns, but it is
- better to try to rewrite your pattern to use less memory if you can.
- </P>
- <P>
- One way of reducing the memory usage for such patterns is to make use of PCRE's
- <a href="pcrepattern.html#subpatternsassubroutines">"subroutine"</a>
- facility. Re-writing the above pattern as
- <pre>
- ((ab)(?2){0,999}c)(?1){0,2}
- </pre>
- reduces the memory requirements to 18K, and indeed it remains under 20K even
- with the outer repetition increased to 100. However, this pattern is not
- exactly equivalent, because the "subroutine" calls are treated as
- <a href="pcrepattern.html#atomicgroup">atomic groups</a>
- into which there can be no backtracking if there is a subsequent matching
- failure. Therefore, PCRE cannot do this kind of rewriting automatically.
- Furthermore, there is a noticeable loss of speed when executing the modified
- pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
- speed is acceptable, this kind of rewriting will allow you to process patterns
- that PCRE cannot otherwise handle.
- </P>
- <br><b>
- STACK USAGE AT RUN TIME
- </b><br>
- <P>
- When <b>pcre_exec()</b> or <b>pcre[16|32]_exec()</b> is used for matching, certain
- kinds of pattern can cause it to use large amounts of the process stack. In
- some environments the default process stack is quite small, and if it runs out
- the result is often SIGSEGV. This issue is probably the most frequently raised
- problem with PCRE. Rewriting your pattern can often help. The
- <a href="pcrestack.html"><b>pcrestack</b></a>
- documentation discusses this issue in detail.
- </P>
- <br><b>
- PROCESSING TIME
- </b><br>
- <P>
- Certain items in regular expression patterns are processed more efficiently
- than others. It is more efficient to use a character class like [aeiou] than a
- set of single-character alternatives such as (a|e|i|o|u). In general, the
- simplest construction that provides the required behaviour is usually the most
- efficient. Jeffrey Friedl's book contains a lot of useful general discussion
- about optimizing regular expressions for efficient performance. This document
- contains a few observations about PCRE.
- </P>
- <P>
- Using Unicode character properties (the \p, \P, and \X escapes) is slow,
- because PCRE has to use a multi-stage table lookup whenever it needs a
- character's property. If you can find an alternative pattern that does not use
- character properties, it will probably be faster.
- </P>
- <P>
- By default, the escape sequences \b, \d, \s, and \w, and the POSIX
- character classes such as [:alpha:] do not use Unicode properties, partly for
- backwards compatibility, and partly for performance reasons. However, you can
- set PCRE_UCP if you want Unicode character properties to be used. This can
- double the matching time for items such as \d, when matched with
- a traditional matching function; the performance loss is less with
- a DFA matching function, and in both cases there is not much difference for
- \b.
- </P>
- <P>
- When a pattern begins with .* not in parentheses, or in parentheses that are
- not the subject of a backreference, and the PCRE_DOTALL option is set, the
- pattern is implicitly anchored by PCRE, since it can match only at the start of
- a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this
- optimization, because the . metacharacter does not then match a newline, and if
- the subject string contains newlines, the pattern may match from the character
- immediately following one of them instead of from the very start. For example,
- the pattern
- <pre>
- .*second
- </pre>
- matches the subject "first\nand second" (where \n stands for a newline
- character), with the match starting at the seventh character. In order to do
- this, PCRE has to retry the match starting after every newline in the subject.
- </P>
- <P>
- If you are using such a pattern with subject strings that do not contain
- newlines, the best performance is obtained by setting PCRE_DOTALL, or starting
- the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE
- from having to scan along the subject looking for a newline to restart at.
- </P>
- <P>
- Beware of patterns that contain nested indefinite repeats. These can take a
- long time to run when applied to a string that does not match. Consider the
- pattern fragment
- <pre>
- ^(a+)*
- </pre>
- This can match "aaaa" in 16 different ways, and this number increases very
- rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
- times, and for each of those cases other than 0 or 4, the + repeats can match
- different numbers of times.) When the remainder of the pattern is such that the
- entire match is going to fail, PCRE has in principle to try every possible
- variation, and this can take an extremely long time, even for relatively short
- strings.
- </P>
- <P>
- An optimization catches some of the more simple cases such as
- <pre>
- (a+)*b
- </pre>
- where a literal character follows. Before embarking on the standard matching
- procedure, PCRE checks that there is a "b" later in the subject string, and if
- there is not, it fails the match immediately. However, when there is no
- following literal this optimization cannot be used. You can see the difference
- by comparing the behaviour of
- <pre>
- (a+)*\d
- </pre>
- with the pattern above. The former gives a failure almost instantly when
- applied to a whole line of "a" characters, whereas the latter takes an
- appreciable time with strings longer than about 20 characters.
- </P>
- <P>
- In many cases, the solution to this kind of performance issue is to use an
- atomic group or a possessive quantifier.
- </P>
- <br><b>
- AUTHOR
- </b><br>
- <P>
- Philip Hazel
- <br>
- University Computing Service
- <br>
- Cambridge CB2 3QH, England.
- <br>
- </P>
- <br><b>
- REVISION
- </b><br>
- <P>
- Last updated: 25 August 2012
- <br>
- Copyright © 1997-2012 University of Cambridge.
- <br>
- <p>
- Return to the <a href="index.html">PCRE index page</a>.
- </p>
|