123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561 |
- <html>
- <head>
- <title>pcresyntax specification</title>
- </head>
- <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
- <h1>pcresyntax man page</h1>
- <p>
- Return to the <a href="index.html">PCRE index page</a>.
- </p>
- <p>
- This page is part of the PCRE HTML documentation. It was generated automatically
- from the original man page. If there is any nonsense in it, please consult the
- man page, in case the conversion went wrong.
- <br>
- <ul>
- <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
- <li><a name="TOC2" href="#SEC2">QUOTING</a>
- <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
- <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
- <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
- <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
- <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
- <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
- <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
- <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
- <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
- <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
- <li><a name="TOC13" href="#SEC13">CAPTURING</a>
- <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
- <li><a name="TOC15" href="#SEC15">COMMENT</a>
- <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
- <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
- <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
- <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
- <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
- <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
- <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
- <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
- <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
- <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
- <li><a name="TOC26" href="#SEC26">AUTHOR</a>
- <li><a name="TOC27" href="#SEC27">REVISION</a>
- </ul>
- <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
- <P>
- The full syntax and semantics of the regular expressions that are supported by
- PCRE are described in the
- <a href="pcrepattern.html"><b>pcrepattern</b></a>
- documentation. This document contains a quick-reference summary of the syntax.
- </P>
- <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
- <P>
- <pre>
- \x where x is non-alphanumeric is a literal x
- \Q...\E treat enclosed characters as literal
- </PRE>
- </P>
- <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
- <P>
- <pre>
- \a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any ASCII character
- \e escape (hex 1B)
- \f form feed (hex 0C)
- \n newline (hex 0A)
- \r carriage return (hex 0D)
- \t tab (hex 09)
- \0dd character with octal code 0dd
- \ddd character with octal code ddd, or backreference
- \o{ddd..} character with octal code ddd..
- \xhh character with hex code hh
- \x{hhh..} character with hex code hhh..
- </pre>
- Note that \0dd is always an octal code, and that \8 and \9 are the literal
- characters "8" and "9".
- </P>
- <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
- <P>
- <pre>
- . any character except newline;
- in dotall mode, any character whatsoever
- \C one data unit, even in UTF mode (best avoided)
- \d a decimal digit
- \D a character that is not a decimal digit
- \h a horizontal white space character
- \H a character that is not a horizontal white space character
- \N a character that is not a newline
- \p{<i>xx</i>} a character with the <i>xx</i> property
- \P{<i>xx</i>} a character without the <i>xx</i> property
- \R a newline sequence
- \s a white space character
- \S a character that is not a white space character
- \v a vertical white space character
- \V a character that is not a vertical white space character
- \w a "word" character
- \W a "non-word" character
- \X a Unicode extended grapheme cluster
- </pre>
- By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
- or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
- happening, \s and \w may also match characters with code points in the range
- 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
- is changed to use Unicode properties and they match many more characters.
- </P>
- <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
- <P>
- <pre>
- C Other
- Cc Control
- Cf Format
- Cn Unassigned
- Co Private use
- Cs Surrogate
- L Letter
- Ll Lower case letter
- Lm Modifier letter
- Lo Other letter
- Lt Title case letter
- Lu Upper case letter
- L& Ll, Lu, or Lt
- M Mark
- Mc Spacing mark
- Me Enclosing mark
- Mn Non-spacing mark
- N Number
- Nd Decimal number
- Nl Letter number
- No Other number
- P Punctuation
- Pc Connector punctuation
- Pd Dash punctuation
- Pe Close punctuation
- Pf Final punctuation
- Pi Initial punctuation
- Po Other punctuation
- Ps Open punctuation
- S Symbol
- Sc Currency symbol
- Sk Modifier symbol
- Sm Mathematical symbol
- So Other symbol
- Z Separator
- Zl Line separator
- Zp Paragraph separator
- Zs Space separator
- </PRE>
- </P>
- <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
- <P>
- <pre>
- Xan Alphanumeric: union of properties L and N
- Xps POSIX space: property Z or tab, NL, VT, FF, CR
- Xsp Perl space: property Z or tab, NL, VT, FF, CR
- Xuc Univerally-named character: one that can be
- represented by a Universal Character Name
- Xwd Perl word: property Xan or underscore
- </pre>
- Perl and POSIX space are now the same. Perl added VT to its space character set
- at release 5.18 and PCRE changed at release 8.34.
- </P>
- <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
- <P>
- Arabic,
- Armenian,
- Avestan,
- Balinese,
- Bamum,
- Bassa_Vah,
- Batak,
- Bengali,
- Bopomofo,
- Brahmi,
- Braille,
- Buginese,
- Buhid,
- Canadian_Aboriginal,
- Carian,
- Caucasian_Albanian,
- Chakma,
- Cham,
- Cherokee,
- Common,
- Coptic,
- Cuneiform,
- Cypriot,
- Cyrillic,
- Deseret,
- Devanagari,
- Duployan,
- Egyptian_Hieroglyphs,
- Elbasan,
- Ethiopic,
- Georgian,
- Glagolitic,
- Gothic,
- Grantha,
- Greek,
- Gujarati,
- Gurmukhi,
- Han,
- Hangul,
- Hanunoo,
- Hebrew,
- Hiragana,
- Imperial_Aramaic,
- Inherited,
- Inscriptional_Pahlavi,
- Inscriptional_Parthian,
- Javanese,
- Kaithi,
- Kannada,
- Katakana,
- Kayah_Li,
- Kharoshthi,
- Khmer,
- Khojki,
- Khudawadi,
- Lao,
- Latin,
- Lepcha,
- Limbu,
- Linear_A,
- Linear_B,
- Lisu,
- Lycian,
- Lydian,
- Mahajani,
- Malayalam,
- Mandaic,
- Manichaean,
- Meetei_Mayek,
- Mende_Kikakui,
- Meroitic_Cursive,
- Meroitic_Hieroglyphs,
- Miao,
- Modi,
- Mongolian,
- Mro,
- Myanmar,
- Nabataean,
- New_Tai_Lue,
- Nko,
- Ogham,
- Ol_Chiki,
- Old_Italic,
- Old_North_Arabian,
- Old_Permic,
- Old_Persian,
- Old_South_Arabian,
- Old_Turkic,
- Oriya,
- Osmanya,
- Pahawh_Hmong,
- Palmyrene,
- Pau_Cin_Hau,
- Phags_Pa,
- Phoenician,
- Psalter_Pahlavi,
- Rejang,
- Runic,
- Samaritan,
- Saurashtra,
- Sharada,
- Shavian,
- Siddham,
- Sinhala,
- Sora_Sompeng,
- Sundanese,
- Syloti_Nagri,
- Syriac,
- Tagalog,
- Tagbanwa,
- Tai_Le,
- Tai_Tham,
- Tai_Viet,
- Takri,
- Tamil,
- Telugu,
- Thaana,
- Thai,
- Tibetan,
- Tifinagh,
- Tirhuta,
- Ugaritic,
- Vai,
- Warang_Citi,
- Yi.
- </P>
- <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
- <P>
- <pre>
- [...] positive character class
- [^...] negative character class
- [x-y] range (can be used for hex characters)
- [[:xxx:]] positive POSIX named set
- [[:^xxx:]] negative POSIX named set
- alnum alphanumeric
- alpha alphabetic
- ascii 0-127
- blank space or tab
- cntrl control character
- digit decimal digit
- graph printing, excluding space
- lower lower case letter
- print printing, including space
- punct printing, excluding alphanumeric
- space white space
- upper upper case letter
- word same as \w
- xdigit hexadecimal digit
- </pre>
- In PCRE, POSIX character set names recognize only ASCII characters by default,
- but some of them use Unicode properties if PCRE_UCP is set. You can use
- \Q...\E inside a character class.
- </P>
- <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
- <P>
- <pre>
- ? 0 or 1, greedy
- ?+ 0 or 1, possessive
- ?? 0 or 1, lazy
- * 0 or more, greedy
- *+ 0 or more, possessive
- *? 0 or more, lazy
- + 1 or more, greedy
- ++ 1 or more, possessive
- +? 1 or more, lazy
- {n} exactly n
- {n,m} at least n, no more than m, greedy
- {n,m}+ at least n, no more than m, possessive
- {n,m}? at least n, no more than m, lazy
- {n,} n or more, greedy
- {n,}+ n or more, possessive
- {n,}? n or more, lazy
- </PRE>
- </P>
- <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
- <P>
- <pre>
- \b word boundary
- \B not a word boundary
- ^ start of subject
- also after internal newline in multiline mode
- \A start of subject
- $ end of subject
- also before newline at end of subject
- also before internal newline in multiline mode
- \Z end of subject
- also before newline at end of subject
- \z end of subject
- \G first matching position in subject
- </PRE>
- </P>
- <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
- <P>
- <pre>
- \K reset start of match
- </pre>
- \K is honoured in positive assertions, but ignored in negative ones.
- </P>
- <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
- <P>
- <pre>
- expr|expr|expr...
- </PRE>
- </P>
- <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
- <P>
- <pre>
- (...) capturing group
- (?<name>...) named capturing group (Perl)
- (?'name'...) named capturing group (Perl)
- (?P<name>...) named capturing group (Python)
- (?:...) non-capturing group
- (?|...) non-capturing group; reset group numbers for
- capturing groups in each alternative
- </PRE>
- </P>
- <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
- <P>
- <pre>
- (?>...) atomic, non-capturing group
- </PRE>
- </P>
- <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
- <P>
- <pre>
- (?#....) comment (not nestable)
- </PRE>
- </P>
- <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
- <P>
- <pre>
- (?i) caseless
- (?J) allow duplicate names
- (?m) multiline
- (?s) single line (dotall)
- (?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
- (?-...) unset option(s)
- </pre>
- The following are recognized only at the very start of a pattern or after one
- of the newline or \R options with similar syntax. More than one of them may
- appear.
- <pre>
- (*LIMIT_MATCH=d) set the match limit to d (decimal number)
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
- (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
- (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
- (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
- (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
- (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
- (*UTF) set appropriate UTF mode for the library in use
- (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
- </pre>
- Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
- limits set by the caller of pcre_exec(), not increase them.
- </P>
- <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
- <P>
- These are recognized only at the very start of the pattern or after option
- settings with a similar syntax.
- <pre>
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
- </PRE>
- </P>
- <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
- <P>
- These are recognized only at the very start of the pattern or after option
- setting with a similar syntax.
- <pre>
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
- </PRE>
- </P>
- <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
- <P>
- <pre>
- (?=...) positive look ahead
- (?!...) negative look ahead
- (?<=...) positive look behind
- (?<!...) negative look behind
- </pre>
- Each top-level branch of a look behind must be of a fixed length.
- </P>
- <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
- <P>
- <pre>
- \n reference by number (can be ambiguous)
- \gn reference by number
- \g{n} reference by number
- \g{-n} relative reference by number
- \k<name> reference by name (Perl)
- \k'name' reference by name (Perl)
- \g{name} reference by name (Perl)
- \k{name} reference by name (.NET)
- (?P=name) reference by name (Python)
- </PRE>
- </P>
- <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
- <P>
- <pre>
- (?R) recurse whole pattern
- (?n) call subpattern by absolute number
- (?+n) call subpattern by relative number
- (?-n) call subpattern by relative number
- (?&name) call subpattern by name (Perl)
- (?P>name) call subpattern by name (Python)
- \g<name> call subpattern by name (Oniguruma)
- \g'name' call subpattern by name (Oniguruma)
- \g<n> call subpattern by absolute number (Oniguruma)
- \g'n' call subpattern by absolute number (Oniguruma)
- \g<+n> call subpattern by relative number (PCRE extension)
- \g'+n' call subpattern by relative number (PCRE extension)
- \g<-n> call subpattern by relative number (PCRE extension)
- \g'-n' call subpattern by relative number (PCRE extension)
- </PRE>
- </P>
- <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
- <P>
- <pre>
- (?(condition)yes-pattern)
- (?(condition)yes-pattern|no-pattern)
- (?(n)... absolute reference condition
- (?(+n)... relative reference condition
- (?(-n)... relative reference condition
- (?(<name>)... named reference condition (Perl)
- (?('name')... named reference condition (Perl)
- (?(name)... named reference condition (PCRE)
- (?(R)... overall recursion condition
- (?(Rn)... specific group recursion condition
- (?(R&name)... specific recursion condition
- (?(DEFINE)... define subpattern for reference
- (?(assert)... assertion condition
- </PRE>
- </P>
- <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
- <P>
- The following act immediately they are reached:
- <pre>
- (*ACCEPT) force successful match
- (*FAIL) force backtrack; synonym (*F)
- (*MARK:NAME) set name to be passed back; synonym (*:NAME)
- </pre>
- The following act only when a subsequent match failure causes a backtrack to
- reach them. They all force a match failure, but they differ in what happens
- afterwards. Those that advance the start-of-match point do so only if the
- pattern is not anchored.
- <pre>
- (*COMMIT) overall failure, no advance of starting point
- (*PRUNE) advance to next starting character
- (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
- (*SKIP) advance to current matching position
- (*SKIP:NAME) advance to position corresponding to an earlier
- (*MARK:NAME); if not found, the (*SKIP) is ignored
- (*THEN) local failure, backtrack to next alternation
- (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
- </PRE>
- </P>
- <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
- <P>
- <pre>
- (?C) callout
- (?Cn) callout with data n
- </PRE>
- </P>
- <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
- <P>
- <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
- <b>pcrematching</b>(3), <b>pcre</b>(3).
- </P>
- <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
- <P>
- Philip Hazel
- <br>
- University Computing Service
- <br>
- Cambridge CB2 3QH, England.
- <br>
- </P>
- <br><a name="SEC27" href="#TOC1">REVISION</a><br>
- <P>
- Last updated: 08 January 2014
- <br>
- Copyright © 1997-2014 University of Cambridge.
- <br>
- <p>
- Return to the <a href="index.html">PCRE index page</a>.
- </p>
|