123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348 |
- .TH PCRECPP 3 "08 January 2012" "PCRE 8.30"
- .SH NAME
- PCRE - Perl-compatible regular expressions.
- .SH "SYNOPSIS OF C++ WRAPPER"
- .rs
- .sp
- .B #include <pcrecpp.h>
- .
- .SH DESCRIPTION
- .rs
- .sp
- The C++ wrapper for PCRE was provided by Google Inc. Some additional
- functionality was added by Giuseppe Maxia. This brief man page was constructed
- from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
- further details. Note that the C++ wrapper supports only the original 8-bit
- PCRE library. There is no 16-bit or 32-bit support at present.
- .
- .
- .SH "MATCHING INTERFACE"
- .rs
- .sp
- The "FullMatch" operation checks that supplied text matches a supplied pattern
- exactly. If pointer arguments are supplied, it copies matched sub-strings that
- match sub-patterns into them.
- .sp
- Example: successful match
- pcrecpp::RE re("h.*o");
- re.FullMatch("hello");
- .sp
- Example: unsuccessful match (requires full match):
- pcrecpp::RE re("e");
- !re.FullMatch("hello");
- .sp
- Example: creating a temporary RE object:
- pcrecpp::RE("h.*o").FullMatch("hello");
- .sp
- You can pass in a "const char*" or a "string" for "text". The examples below
- tend to use a const char*. You can, as in the different examples above, store
- the RE object explicitly in a variable or use a temporary RE object. The
- examples below use one mode or the other arbitrarily. Either could correctly be
- used for any of these examples.
- .P
- You must supply extra pointer arguments to extract matched subpieces.
- .sp
- Example: extracts "ruby" into "s" and 1234 into "i"
- int i;
- string s;
- pcrecpp::RE re("(\e\ew+):(\e\ed+)");
- re.FullMatch("ruby:1234", &s, &i);
- .sp
- Example: does not try to extract any extra sub-patterns
- re.FullMatch("ruby:1234", &s);
- .sp
- Example: does not try to extract into NULL
- re.FullMatch("ruby:1234", NULL, &i);
- .sp
- Example: integer overflow causes failure
- !re.FullMatch("ruby:1234567891234", NULL, &i);
- .sp
- Example: fails because there aren't enough sub-patterns:
- !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
- .sp
- Example: fails because string cannot be stored in integer
- !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
- .sp
- The provided pointer arguments can be pointers to any scalar numeric
- type, or one of:
- .sp
- string (matched piece is copied to string)
- StringPiece (StringPiece is mutated to point to matched piece)
- T (where "bool T::ParseFrom(const char*, int)" exists)
- NULL (the corresponding matched sub-pattern is not copied)
- .sp
- The function returns true iff all of the following conditions are satisfied:
- .sp
- a. "text" matches "pattern" exactly;
- .sp
- b. The number of matched sub-patterns is >= number of supplied
- pointers;
- .sp
- c. The "i"th argument has a suitable type for holding the
- string captured as the "i"th sub-pattern. If you pass in
- void * NULL for the "i"th argument, or a non-void * NULL
- of the correct type, or pass fewer arguments than the
- number of sub-patterns, "i"th captured sub-pattern is
- ignored.
- .sp
- CAVEAT: An optional sub-pattern that does not exist in the matched
- string is assigned the empty string. Therefore, the following will
- return false (because the empty string is not a valid number):
- .sp
- int number;
- pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
- .sp
- The matching interface supports at most 16 arguments per call.
- If you need more, consider using the more general interface
- \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
- \fBDoMatch\fP.
- .P
- NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
- list of optional arguments, as a placeholder for missing arguments, as this can
- lead to segfaults.
- .
- .
- .SH "QUOTING METACHARACTERS"
- .rs
- .sp
- You can use the "QuoteMeta" operation to insert backslashes before all
- potentially meaningful characters in a string. The returned string, used as a
- regular expression, will exactly match the original string.
- .sp
- Example:
- string quoted = RE::QuoteMeta(unquoted);
- .sp
- Note that it's legal to escape a character even if it has no special meaning in
- a regular expression -- so this function does that. (This also makes it
- identical to the perl function of the same name; see "perldoc -f quotemeta".)
- For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
- .
- .SH "PARTIAL MATCHES"
- .rs
- .sp
- You can use the "PartialMatch" operation when you want the pattern
- to match any substring of the text.
- .sp
- Example: simple search for a string:
- pcrecpp::RE("ell").PartialMatch("hello");
- .sp
- Example: find first number in a string:
- int number;
- pcrecpp::RE re("(\e\ed+)");
- re.PartialMatch("x*100 + 20", &number);
- assert(number == 100);
- .
- .
- .SH "UTF-8 AND THE MATCHING INTERFACE"
- .rs
- .sp
- By default, pattern and text are plain text, one byte per character. The UTF8
- flag, passed to the constructor, causes both pattern and string to be treated
- as UTF-8 text, still a byte stream but potentially multiple bytes per
- character. In practice, the text is likelier to be UTF-8 than the pattern, but
- the match returned may depend on the UTF8 flag, so always use it when matching
- UTF8 text. For example, "." will match one byte normally but with UTF8 set may
- match up to three bytes of a multi-byte character.
- .sp
- Example:
- pcrecpp::RE_Options options;
- options.set_utf8();
- pcrecpp::RE re(utf8_pattern, options);
- re.FullMatch(utf8_string);
- .sp
- Example: using the convenience function UTF8():
- pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
- re.FullMatch(utf8_string);
- .sp
- NOTE: The UTF8 flag is ignored if pcre was not configured with the
- --enable-utf8 flag.
- .
- .
- .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
- .rs
- .sp
- PCRE defines some modifiers to change the behavior of the regular expression
- engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
- pass such modifiers to a RE class. Currently, the following modifiers are
- supported:
- .sp
- modifier description Perl corresponding
- .sp
- PCRE_CASELESS case insensitive match /i
- PCRE_MULTILINE multiple lines match /m
- PCRE_DOTALL dot matches newlines /s
- PCRE_DOLLAR_ENDONLY $ matches only at end N/A
- PCRE_EXTRA strict escape parsing N/A
- PCRE_EXTENDED ignore white spaces /x
- PCRE_UTF8 handles UTF8 chars built-in
- PCRE_UNGREEDY reverses * and *? N/A
- PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
- .sp
- (*) Both Perl and PCRE allow non capturing parentheses by means of the
- "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
- capture, while (ab|cd) does.
- .P
- For a full account on how each modifier works, please check the
- PCRE API reference page.
- .P
- For each modifier, there are two member functions whose name is made
- out of the modifier in lowercase, without the "PCRE_" prefix. For
- instance, PCRE_CASELESS is handled by
- .sp
- bool caseless()
- .sp
- which returns true if the modifier is set, and
- .sp
- RE_Options & set_caseless(bool)
- .sp
- which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
- accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
- functions. Setting \fImatch_limit\fP to a non-zero value will limit the
- execution of pcre to keep it from doing bad things like blowing the stack or
- taking an eternity to return a result. A value of 5000 is good enough to stop
- stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
- match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
- which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
- recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
- \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
- therefore the amount of stack that is used.
- .P
- Normally, to pass one or more modifiers to a RE class, you declare
- a \fIRE_Options\fP object, set the appropriate options, and pass this
- object to a RE constructor. Example:
- .sp
- RE_Options opt;
- opt.set_caseless(true);
- if (RE("HELLO", opt).PartialMatch("hello world")) ...
- .sp
- RE_options has two constructors. The default constructor takes no arguments and
- creates a set of flags that are off by default. The optional parameter
- \fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
- This lets you do
- .sp
- RE(pattern,
- RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
- .sp
- However, new code is better off doing
- .sp
- RE(pattern,
- RE_Options().set_caseless(true).set_multiline(true))
- .PartialMatch(str);
- .sp
- If you are going to pass one of the most used modifiers, there are some
- convenience functions that return a RE_Options class with the
- appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
- \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
- .P
- If you need to set several options at once, and you don't want to go through
- the pains of declaring a RE_Options object and setting several options, there
- is a parallel method that give you such ability on the fly. You can concatenate
- several \fBset_xxxxx()\fP member functions, since each of them returns a
- reference to its class object. For example, to pass PCRE_CASELESS,
- PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
- .sp
- RE(" ^ xyz \e\es+ .* blah$",
- RE_Options()
- .set_caseless(true)
- .set_extended(true)
- .set_multiline(true)).PartialMatch(sometext);
- .sp
- .
- .
- .SH "SCANNING TEXT INCREMENTALLY"
- .rs
- .sp
- The "Consume" operation may be useful if you want to repeatedly
- match regular expressions at the front of a string and skip over
- them as they match. This requires use of the "StringPiece" type,
- which represents a sub-range of a real string. Like RE, StringPiece
- is defined in the pcrecpp namespace.
- .sp
- Example: read lines of the form "var = value" from a string.
- string contents = ...; // Fill string somehow
- pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
- .sp
- string var;
- int value;
- pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
- while (re.Consume(&input, &var, &value)) {
- ...;
- }
- .sp
- Each successful call to "Consume" will set "var/value", and also
- advance "input" so it points past the matched text.
- .P
- The "FindAndConsume" operation is similar to "Consume" but does not
- anchor your match at the beginning of the string. For example, you
- could extract all words from a string by repeatedly calling
- .sp
- pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
- .
- .
- .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
- .rs
- .sp
- By default, if you pass a pointer to a numeric value, the
- corresponding text is interpreted as a base-10 number. You can
- instead wrap the pointer with a call to one of the operators Hex(),
- Octal(), or CRadix() to interpret the text in another base. The
- CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
- prefixes, but defaults to base-10.
- .sp
- Example:
- int a, b, c, d;
- pcrecpp::RE re("(.*) (.*) (.*) (.*)");
- re.FullMatch("100 40 0100 0x40",
- pcrecpp::Octal(&a), pcrecpp::Hex(&b),
- pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
- .sp
- will leave 64 in a, b, c, and d.
- .
- .
- .SH "REPLACING PARTS OF STRINGS"
- .rs
- .sp
- You can replace the first match of "pattern" in "str" with "rewrite".
- Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
- used to insert text matching corresponding parenthesized group
- from the pattern. \e0 in "rewrite" refers to the entire matching
- text. For example:
- .sp
- string s = "yabba dabba doo";
- pcrecpp::RE("b+").Replace("d", &s);
- .sp
- will leave "s" containing "yada dabba doo". The result is true if the pattern
- matches and a replacement occurs, false otherwise.
- .P
- \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
- occurrences of the pattern in the string with the rewrite. Replacements are
- not subject to re-matching. For example:
- .sp
- string s = "yabba dabba doo";
- pcrecpp::RE("b+").GlobalReplace("d", &s);
- .sp
- will leave "s" containing "yada dada doo". It returns the number of
- replacements made.
- .P
- \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
- "rewrite" is copied into "out" (an additional argument) with substitutions.
- The non-matching portions of "text" are ignored. Returns true iff a match
- occurred and the extraction happened successfully; if no match occurs, the
- string is left unaffected.
- .
- .
- .SH AUTHOR
- .rs
- .sp
- .nf
- The C++ wrapper was contributed by Google Inc.
- Copyright (c) 2007 Google Inc.
- .fi
- .
- .
- .SH REVISION
- .rs
- .sp
- .nf
- Last updated: 08 January 2012
- .fi
|