123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540 |
- .TH PCRESYNTAX 3 "08 January 2014" "PCRE 8.35"
- .SH NAME
- PCRE - Perl-compatible regular expressions
- .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
- .rs
- .sp
- The full syntax and semantics of the regular expressions that are supported by
- PCRE are described in the
- .\" HREF
- \fBpcrepattern\fP
- .\"
- documentation. This document contains a quick-reference summary of the syntax.
- .
- .
- .SH "QUOTING"
- .rs
- .sp
- \ex where x is non-alphanumeric is a literal x
- \eQ...\eE treat enclosed characters as literal
- .
- .
- .SH "CHARACTERS"
- .rs
- .sp
- \ea alarm, that is, the BEL character (hex 07)
- \ecx "control-x", where x is any ASCII character
- \ee escape (hex 1B)
- \ef form feed (hex 0C)
- \en newline (hex 0A)
- \er carriage return (hex 0D)
- \et tab (hex 09)
- \e0dd character with octal code 0dd
- \eddd character with octal code ddd, or backreference
- \eo{ddd..} character with octal code ddd..
- \exhh character with hex code hh
- \ex{hhh..} character with hex code hhh..
- .sp
- Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
- characters "8" and "9".
- .
- .
- .SH "CHARACTER TYPES"
- .rs
- .sp
- . any character except newline;
- in dotall mode, any character whatsoever
- \eC one data unit, even in UTF mode (best avoided)
- \ed a decimal digit
- \eD a character that is not a decimal digit
- \eh a horizontal white space character
- \eH a character that is not a horizontal white space character
- \eN a character that is not a newline
- \ep{\fIxx\fP} a character with the \fIxx\fP property
- \eP{\fIxx\fP} a character without the \fIxx\fP property
- \eR a newline sequence
- \es a white space character
- \eS a character that is not a white space character
- \ev a vertical white space character
- \eV a character that is not a vertical white space character
- \ew a "word" character
- \eW a "non-word" character
- \eX a Unicode extended grapheme cluster
- .sp
- By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
- or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
- happening, \es and \ew may also match characters with code points in the range
- 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
- is changed to use Unicode properties and they match many more characters.
- .
- .
- .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
- .rs
- .sp
- C Other
- Cc Control
- Cf Format
- Cn Unassigned
- Co Private use
- Cs Surrogate
- .sp
- L Letter
- Ll Lower case letter
- Lm Modifier letter
- Lo Other letter
- Lt Title case letter
- Lu Upper case letter
- L& Ll, Lu, or Lt
- .sp
- M Mark
- Mc Spacing mark
- Me Enclosing mark
- Mn Non-spacing mark
- .sp
- N Number
- Nd Decimal number
- Nl Letter number
- No Other number
- .sp
- P Punctuation
- Pc Connector punctuation
- Pd Dash punctuation
- Pe Close punctuation
- Pf Final punctuation
- Pi Initial punctuation
- Po Other punctuation
- Ps Open punctuation
- .sp
- S Symbol
- Sc Currency symbol
- Sk Modifier symbol
- Sm Mathematical symbol
- So Other symbol
- .sp
- Z Separator
- Zl Line separator
- Zp Paragraph separator
- Zs Space separator
- .
- .
- .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
- .rs
- .sp
- Xan Alphanumeric: union of properties L and N
- Xps POSIX space: property Z or tab, NL, VT, FF, CR
- Xsp Perl space: property Z or tab, NL, VT, FF, CR
- Xuc Univerally-named character: one that can be
- represented by a Universal Character Name
- Xwd Perl word: property Xan or underscore
- .sp
- Perl and POSIX space are now the same. Perl added VT to its space character set
- at release 5.18 and PCRE changed at release 8.34.
- .
- .
- .SH "SCRIPT NAMES FOR \ep AND \eP"
- .rs
- .sp
- Arabic,
- Armenian,
- Avestan,
- Balinese,
- Bamum,
- Bassa_Vah,
- Batak,
- Bengali,
- Bopomofo,
- Brahmi,
- Braille,
- Buginese,
- Buhid,
- Canadian_Aboriginal,
- Carian,
- Caucasian_Albanian,
- Chakma,
- Cham,
- Cherokee,
- Common,
- Coptic,
- Cuneiform,
- Cypriot,
- Cyrillic,
- Deseret,
- Devanagari,
- Duployan,
- Egyptian_Hieroglyphs,
- Elbasan,
- Ethiopic,
- Georgian,
- Glagolitic,
- Gothic,
- Grantha,
- Greek,
- Gujarati,
- Gurmukhi,
- Han,
- Hangul,
- Hanunoo,
- Hebrew,
- Hiragana,
- Imperial_Aramaic,
- Inherited,
- Inscriptional_Pahlavi,
- Inscriptional_Parthian,
- Javanese,
- Kaithi,
- Kannada,
- Katakana,
- Kayah_Li,
- Kharoshthi,
- Khmer,
- Khojki,
- Khudawadi,
- Lao,
- Latin,
- Lepcha,
- Limbu,
- Linear_A,
- Linear_B,
- Lisu,
- Lycian,
- Lydian,
- Mahajani,
- Malayalam,
- Mandaic,
- Manichaean,
- Meetei_Mayek,
- Mende_Kikakui,
- Meroitic_Cursive,
- Meroitic_Hieroglyphs,
- Miao,
- Modi,
- Mongolian,
- Mro,
- Myanmar,
- Nabataean,
- New_Tai_Lue,
- Nko,
- Ogham,
- Ol_Chiki,
- Old_Italic,
- Old_North_Arabian,
- Old_Permic,
- Old_Persian,
- Old_South_Arabian,
- Old_Turkic,
- Oriya,
- Osmanya,
- Pahawh_Hmong,
- Palmyrene,
- Pau_Cin_Hau,
- Phags_Pa,
- Phoenician,
- Psalter_Pahlavi,
- Rejang,
- Runic,
- Samaritan,
- Saurashtra,
- Sharada,
- Shavian,
- Siddham,
- Sinhala,
- Sora_Sompeng,
- Sundanese,
- Syloti_Nagri,
- Syriac,
- Tagalog,
- Tagbanwa,
- Tai_Le,
- Tai_Tham,
- Tai_Viet,
- Takri,
- Tamil,
- Telugu,
- Thaana,
- Thai,
- Tibetan,
- Tifinagh,
- Tirhuta,
- Ugaritic,
- Vai,
- Warang_Citi,
- Yi.
- .
- .
- .SH "CHARACTER CLASSES"
- .rs
- .sp
- [...] positive character class
- [^...] negative character class
- [x-y] range (can be used for hex characters)
- [[:xxx:]] positive POSIX named set
- [[:^xxx:]] negative POSIX named set
- .sp
- alnum alphanumeric
- alpha alphabetic
- ascii 0-127
- blank space or tab
- cntrl control character
- digit decimal digit
- graph printing, excluding space
- lower lower case letter
- print printing, including space
- punct printing, excluding alphanumeric
- space white space
- upper upper case letter
- word same as \ew
- xdigit hexadecimal digit
- .sp
- In PCRE, POSIX character set names recognize only ASCII characters by default,
- but some of them use Unicode properties if PCRE_UCP is set. You can use
- \eQ...\eE inside a character class.
- .
- .
- .SH "QUANTIFIERS"
- .rs
- .sp
- ? 0 or 1, greedy
- ?+ 0 or 1, possessive
- ?? 0 or 1, lazy
- * 0 or more, greedy
- *+ 0 or more, possessive
- *? 0 or more, lazy
- + 1 or more, greedy
- ++ 1 or more, possessive
- +? 1 or more, lazy
- {n} exactly n
- {n,m} at least n, no more than m, greedy
- {n,m}+ at least n, no more than m, possessive
- {n,m}? at least n, no more than m, lazy
- {n,} n or more, greedy
- {n,}+ n or more, possessive
- {n,}? n or more, lazy
- .
- .
- .SH "ANCHORS AND SIMPLE ASSERTIONS"
- .rs
- .sp
- \eb word boundary
- \eB not a word boundary
- ^ start of subject
- also after internal newline in multiline mode
- \eA start of subject
- $ end of subject
- also before newline at end of subject
- also before internal newline in multiline mode
- \eZ end of subject
- also before newline at end of subject
- \ez end of subject
- \eG first matching position in subject
- .
- .
- .SH "MATCH POINT RESET"
- .rs
- .sp
- \eK reset start of match
- .sp
- \eK is honoured in positive assertions, but ignored in negative ones.
- .
- .
- .SH "ALTERNATION"
- .rs
- .sp
- expr|expr|expr...
- .
- .
- .SH "CAPTURING"
- .rs
- .sp
- (...) capturing group
- (?<name>...) named capturing group (Perl)
- (?'name'...) named capturing group (Perl)
- (?P<name>...) named capturing group (Python)
- (?:...) non-capturing group
- (?|...) non-capturing group; reset group numbers for
- capturing groups in each alternative
- .
- .
- .SH "ATOMIC GROUPS"
- .rs
- .sp
- (?>...) atomic, non-capturing group
- .
- .
- .
- .
- .SH "COMMENT"
- .rs
- .sp
- (?#....) comment (not nestable)
- .
- .
- .SH "OPTION SETTING"
- .rs
- .sp
- (?i) caseless
- (?J) allow duplicate names
- (?m) multiline
- (?s) single line (dotall)
- (?U) default ungreedy (lazy)
- (?x) extended (ignore white space)
- (?-...) unset option(s)
- .sp
- The following are recognized only at the very start of a pattern or after one
- of the newline or \eR options with similar syntax. More than one of them may
- appear.
- .sp
- (*LIMIT_MATCH=d) set the match limit to d (decimal number)
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
- (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
- (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
- (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
- (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
- (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
- (*UTF) set appropriate UTF mode for the library in use
- (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
- .sp
- Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
- limits set by the caller of pcre_exec(), not increase them.
- .
- .
- .SH "NEWLINE CONVENTION"
- .rs
- .sp
- These are recognized only at the very start of the pattern or after option
- settings with a similar syntax.
- .sp
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
- .
- .
- .SH "WHAT \eR MATCHES"
- .rs
- .sp
- These are recognized only at the very start of the pattern or after option
- setting with a similar syntax.
- .sp
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
- .
- .
- .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
- .rs
- .sp
- (?=...) positive look ahead
- (?!...) negative look ahead
- (?<=...) positive look behind
- (?<!...) negative look behind
- .sp
- Each top-level branch of a look behind must be of a fixed length.
- .
- .
- .SH "BACKREFERENCES"
- .rs
- .sp
- \en reference by number (can be ambiguous)
- \egn reference by number
- \eg{n} reference by number
- \eg{-n} relative reference by number
- \ek<name> reference by name (Perl)
- \ek'name' reference by name (Perl)
- \eg{name} reference by name (Perl)
- \ek{name} reference by name (.NET)
- (?P=name) reference by name (Python)
- .
- .
- .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
- .rs
- .sp
- (?R) recurse whole pattern
- (?n) call subpattern by absolute number
- (?+n) call subpattern by relative number
- (?-n) call subpattern by relative number
- (?&name) call subpattern by name (Perl)
- (?P>name) call subpattern by name (Python)
- \eg<name> call subpattern by name (Oniguruma)
- \eg'name' call subpattern by name (Oniguruma)
- \eg<n> call subpattern by absolute number (Oniguruma)
- \eg'n' call subpattern by absolute number (Oniguruma)
- \eg<+n> call subpattern by relative number (PCRE extension)
- \eg'+n' call subpattern by relative number (PCRE extension)
- \eg<-n> call subpattern by relative number (PCRE extension)
- \eg'-n' call subpattern by relative number (PCRE extension)
- .
- .
- .SH "CONDITIONAL PATTERNS"
- .rs
- .sp
- (?(condition)yes-pattern)
- (?(condition)yes-pattern|no-pattern)
- .sp
- (?(n)... absolute reference condition
- (?(+n)... relative reference condition
- (?(-n)... relative reference condition
- (?(<name>)... named reference condition (Perl)
- (?('name')... named reference condition (Perl)
- (?(name)... named reference condition (PCRE)
- (?(R)... overall recursion condition
- (?(Rn)... specific group recursion condition
- (?(R&name)... specific recursion condition
- (?(DEFINE)... define subpattern for reference
- (?(assert)... assertion condition
- .
- .
- .SH "BACKTRACKING CONTROL"
- .rs
- .sp
- The following act immediately they are reached:
- .sp
- (*ACCEPT) force successful match
- (*FAIL) force backtrack; synonym (*F)
- (*MARK:NAME) set name to be passed back; synonym (*:NAME)
- .sp
- The following act only when a subsequent match failure causes a backtrack to
- reach them. They all force a match failure, but they differ in what happens
- afterwards. Those that advance the start-of-match point do so only if the
- pattern is not anchored.
- .sp
- (*COMMIT) overall failure, no advance of starting point
- (*PRUNE) advance to next starting character
- (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
- (*SKIP) advance to current matching position
- (*SKIP:NAME) advance to position corresponding to an earlier
- (*MARK:NAME); if not found, the (*SKIP) is ignored
- (*THEN) local failure, backtrack to next alternation
- (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
- .
- .
- .SH "CALLOUTS"
- .rs
- .sp
- (?C) callout
- (?Cn) callout with data n
- .
- .
- .SH "SEE ALSO"
- .rs
- .sp
- \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
- \fBpcrematching\fP(3), \fBpcre\fP(3).
- .
- .
- .SH AUTHOR
- .rs
- .sp
- .nf
- Philip Hazel
- University Computing Service
- Cambridge CB2 3QH, England.
- .fi
- .
- .
- .SH REVISION
- .rs
- .sp
- .nf
- Last updated: 08 January 2014
- Copyright (c) 1997-2014 University of Cambridge.
- .fi
|