pcresyntax.3 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540
  1. .TH PCRESYNTAX 3 "08 January 2014" "PCRE 8.35"
  2. .SH NAME
  3. PCRE - Perl-compatible regular expressions
  4. .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
  5. .rs
  6. .sp
  7. The full syntax and semantics of the regular expressions that are supported by
  8. PCRE are described in the
  9. .\" HREF
  10. \fBpcrepattern\fP
  11. .\"
  12. documentation. This document contains a quick-reference summary of the syntax.
  13. .
  14. .
  15. .SH "QUOTING"
  16. .rs
  17. .sp
  18. \ex where x is non-alphanumeric is a literal x
  19. \eQ...\eE treat enclosed characters as literal
  20. .
  21. .
  22. .SH "CHARACTERS"
  23. .rs
  24. .sp
  25. \ea alarm, that is, the BEL character (hex 07)
  26. \ecx "control-x", where x is any ASCII character
  27. \ee escape (hex 1B)
  28. \ef form feed (hex 0C)
  29. \en newline (hex 0A)
  30. \er carriage return (hex 0D)
  31. \et tab (hex 09)
  32. \e0dd character with octal code 0dd
  33. \eddd character with octal code ddd, or backreference
  34. \eo{ddd..} character with octal code ddd..
  35. \exhh character with hex code hh
  36. \ex{hhh..} character with hex code hhh..
  37. .sp
  38. Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
  39. characters "8" and "9".
  40. .
  41. .
  42. .SH "CHARACTER TYPES"
  43. .rs
  44. .sp
  45. . any character except newline;
  46. in dotall mode, any character whatsoever
  47. \eC one data unit, even in UTF mode (best avoided)
  48. \ed a decimal digit
  49. \eD a character that is not a decimal digit
  50. \eh a horizontal white space character
  51. \eH a character that is not a horizontal white space character
  52. \eN a character that is not a newline
  53. \ep{\fIxx\fP} a character with the \fIxx\fP property
  54. \eP{\fIxx\fP} a character without the \fIxx\fP property
  55. \eR a newline sequence
  56. \es a white space character
  57. \eS a character that is not a white space character
  58. \ev a vertical white space character
  59. \eV a character that is not a vertical white space character
  60. \ew a "word" character
  61. \eW a "non-word" character
  62. \eX a Unicode extended grapheme cluster
  63. .sp
  64. By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
  65. or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
  66. happening, \es and \ew may also match characters with code points in the range
  67. 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
  68. is changed to use Unicode properties and they match many more characters.
  69. .
  70. .
  71. .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
  72. .rs
  73. .sp
  74. C Other
  75. Cc Control
  76. Cf Format
  77. Cn Unassigned
  78. Co Private use
  79. Cs Surrogate
  80. .sp
  81. L Letter
  82. Ll Lower case letter
  83. Lm Modifier letter
  84. Lo Other letter
  85. Lt Title case letter
  86. Lu Upper case letter
  87. L& Ll, Lu, or Lt
  88. .sp
  89. M Mark
  90. Mc Spacing mark
  91. Me Enclosing mark
  92. Mn Non-spacing mark
  93. .sp
  94. N Number
  95. Nd Decimal number
  96. Nl Letter number
  97. No Other number
  98. .sp
  99. P Punctuation
  100. Pc Connector punctuation
  101. Pd Dash punctuation
  102. Pe Close punctuation
  103. Pf Final punctuation
  104. Pi Initial punctuation
  105. Po Other punctuation
  106. Ps Open punctuation
  107. .sp
  108. S Symbol
  109. Sc Currency symbol
  110. Sk Modifier symbol
  111. Sm Mathematical symbol
  112. So Other symbol
  113. .sp
  114. Z Separator
  115. Zl Line separator
  116. Zp Paragraph separator
  117. Zs Space separator
  118. .
  119. .
  120. .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
  121. .rs
  122. .sp
  123. Xan Alphanumeric: union of properties L and N
  124. Xps POSIX space: property Z or tab, NL, VT, FF, CR
  125. Xsp Perl space: property Z or tab, NL, VT, FF, CR
  126. Xuc Univerally-named character: one that can be
  127. represented by a Universal Character Name
  128. Xwd Perl word: property Xan or underscore
  129. .sp
  130. Perl and POSIX space are now the same. Perl added VT to its space character set
  131. at release 5.18 and PCRE changed at release 8.34.
  132. .
  133. .
  134. .SH "SCRIPT NAMES FOR \ep AND \eP"
  135. .rs
  136. .sp
  137. Arabic,
  138. Armenian,
  139. Avestan,
  140. Balinese,
  141. Bamum,
  142. Bassa_Vah,
  143. Batak,
  144. Bengali,
  145. Bopomofo,
  146. Brahmi,
  147. Braille,
  148. Buginese,
  149. Buhid,
  150. Canadian_Aboriginal,
  151. Carian,
  152. Caucasian_Albanian,
  153. Chakma,
  154. Cham,
  155. Cherokee,
  156. Common,
  157. Coptic,
  158. Cuneiform,
  159. Cypriot,
  160. Cyrillic,
  161. Deseret,
  162. Devanagari,
  163. Duployan,
  164. Egyptian_Hieroglyphs,
  165. Elbasan,
  166. Ethiopic,
  167. Georgian,
  168. Glagolitic,
  169. Gothic,
  170. Grantha,
  171. Greek,
  172. Gujarati,
  173. Gurmukhi,
  174. Han,
  175. Hangul,
  176. Hanunoo,
  177. Hebrew,
  178. Hiragana,
  179. Imperial_Aramaic,
  180. Inherited,
  181. Inscriptional_Pahlavi,
  182. Inscriptional_Parthian,
  183. Javanese,
  184. Kaithi,
  185. Kannada,
  186. Katakana,
  187. Kayah_Li,
  188. Kharoshthi,
  189. Khmer,
  190. Khojki,
  191. Khudawadi,
  192. Lao,
  193. Latin,
  194. Lepcha,
  195. Limbu,
  196. Linear_A,
  197. Linear_B,
  198. Lisu,
  199. Lycian,
  200. Lydian,
  201. Mahajani,
  202. Malayalam,
  203. Mandaic,
  204. Manichaean,
  205. Meetei_Mayek,
  206. Mende_Kikakui,
  207. Meroitic_Cursive,
  208. Meroitic_Hieroglyphs,
  209. Miao,
  210. Modi,
  211. Mongolian,
  212. Mro,
  213. Myanmar,
  214. Nabataean,
  215. New_Tai_Lue,
  216. Nko,
  217. Ogham,
  218. Ol_Chiki,
  219. Old_Italic,
  220. Old_North_Arabian,
  221. Old_Permic,
  222. Old_Persian,
  223. Old_South_Arabian,
  224. Old_Turkic,
  225. Oriya,
  226. Osmanya,
  227. Pahawh_Hmong,
  228. Palmyrene,
  229. Pau_Cin_Hau,
  230. Phags_Pa,
  231. Phoenician,
  232. Psalter_Pahlavi,
  233. Rejang,
  234. Runic,
  235. Samaritan,
  236. Saurashtra,
  237. Sharada,
  238. Shavian,
  239. Siddham,
  240. Sinhala,
  241. Sora_Sompeng,
  242. Sundanese,
  243. Syloti_Nagri,
  244. Syriac,
  245. Tagalog,
  246. Tagbanwa,
  247. Tai_Le,
  248. Tai_Tham,
  249. Tai_Viet,
  250. Takri,
  251. Tamil,
  252. Telugu,
  253. Thaana,
  254. Thai,
  255. Tibetan,
  256. Tifinagh,
  257. Tirhuta,
  258. Ugaritic,
  259. Vai,
  260. Warang_Citi,
  261. Yi.
  262. .
  263. .
  264. .SH "CHARACTER CLASSES"
  265. .rs
  266. .sp
  267. [...] positive character class
  268. [^...] negative character class
  269. [x-y] range (can be used for hex characters)
  270. [[:xxx:]] positive POSIX named set
  271. [[:^xxx:]] negative POSIX named set
  272. .sp
  273. alnum alphanumeric
  274. alpha alphabetic
  275. ascii 0-127
  276. blank space or tab
  277. cntrl control character
  278. digit decimal digit
  279. graph printing, excluding space
  280. lower lower case letter
  281. print printing, including space
  282. punct printing, excluding alphanumeric
  283. space white space
  284. upper upper case letter
  285. word same as \ew
  286. xdigit hexadecimal digit
  287. .sp
  288. In PCRE, POSIX character set names recognize only ASCII characters by default,
  289. but some of them use Unicode properties if PCRE_UCP is set. You can use
  290. \eQ...\eE inside a character class.
  291. .
  292. .
  293. .SH "QUANTIFIERS"
  294. .rs
  295. .sp
  296. ? 0 or 1, greedy
  297. ?+ 0 or 1, possessive
  298. ?? 0 or 1, lazy
  299. * 0 or more, greedy
  300. *+ 0 or more, possessive
  301. *? 0 or more, lazy
  302. + 1 or more, greedy
  303. ++ 1 or more, possessive
  304. +? 1 or more, lazy
  305. {n} exactly n
  306. {n,m} at least n, no more than m, greedy
  307. {n,m}+ at least n, no more than m, possessive
  308. {n,m}? at least n, no more than m, lazy
  309. {n,} n or more, greedy
  310. {n,}+ n or more, possessive
  311. {n,}? n or more, lazy
  312. .
  313. .
  314. .SH "ANCHORS AND SIMPLE ASSERTIONS"
  315. .rs
  316. .sp
  317. \eb word boundary
  318. \eB not a word boundary
  319. ^ start of subject
  320. also after internal newline in multiline mode
  321. \eA start of subject
  322. $ end of subject
  323. also before newline at end of subject
  324. also before internal newline in multiline mode
  325. \eZ end of subject
  326. also before newline at end of subject
  327. \ez end of subject
  328. \eG first matching position in subject
  329. .
  330. .
  331. .SH "MATCH POINT RESET"
  332. .rs
  333. .sp
  334. \eK reset start of match
  335. .sp
  336. \eK is honoured in positive assertions, but ignored in negative ones.
  337. .
  338. .
  339. .SH "ALTERNATION"
  340. .rs
  341. .sp
  342. expr|expr|expr...
  343. .
  344. .
  345. .SH "CAPTURING"
  346. .rs
  347. .sp
  348. (...) capturing group
  349. (?<name>...) named capturing group (Perl)
  350. (?'name'...) named capturing group (Perl)
  351. (?P<name>...) named capturing group (Python)
  352. (?:...) non-capturing group
  353. (?|...) non-capturing group; reset group numbers for
  354. capturing groups in each alternative
  355. .
  356. .
  357. .SH "ATOMIC GROUPS"
  358. .rs
  359. .sp
  360. (?>...) atomic, non-capturing group
  361. .
  362. .
  363. .
  364. .
  365. .SH "COMMENT"
  366. .rs
  367. .sp
  368. (?#....) comment (not nestable)
  369. .
  370. .
  371. .SH "OPTION SETTING"
  372. .rs
  373. .sp
  374. (?i) caseless
  375. (?J) allow duplicate names
  376. (?m) multiline
  377. (?s) single line (dotall)
  378. (?U) default ungreedy (lazy)
  379. (?x) extended (ignore white space)
  380. (?-...) unset option(s)
  381. .sp
  382. The following are recognized only at the very start of a pattern or after one
  383. of the newline or \eR options with similar syntax. More than one of them may
  384. appear.
  385. .sp
  386. (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  387. (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  388. (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
  389. (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  390. (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
  391. (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
  392. (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
  393. (*UTF) set appropriate UTF mode for the library in use
  394. (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
  395. .sp
  396. Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
  397. limits set by the caller of pcre_exec(), not increase them.
  398. .
  399. .
  400. .SH "NEWLINE CONVENTION"
  401. .rs
  402. .sp
  403. These are recognized only at the very start of the pattern or after option
  404. settings with a similar syntax.
  405. .sp
  406. (*CR) carriage return only
  407. (*LF) linefeed only
  408. (*CRLF) carriage return followed by linefeed
  409. (*ANYCRLF) all three of the above
  410. (*ANY) any Unicode newline sequence
  411. .
  412. .
  413. .SH "WHAT \eR MATCHES"
  414. .rs
  415. .sp
  416. These are recognized only at the very start of the pattern or after option
  417. setting with a similar syntax.
  418. .sp
  419. (*BSR_ANYCRLF) CR, LF, or CRLF
  420. (*BSR_UNICODE) any Unicode newline sequence
  421. .
  422. .
  423. .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
  424. .rs
  425. .sp
  426. (?=...) positive look ahead
  427. (?!...) negative look ahead
  428. (?<=...) positive look behind
  429. (?<!...) negative look behind
  430. .sp
  431. Each top-level branch of a look behind must be of a fixed length.
  432. .
  433. .
  434. .SH "BACKREFERENCES"
  435. .rs
  436. .sp
  437. \en reference by number (can be ambiguous)
  438. \egn reference by number
  439. \eg{n} reference by number
  440. \eg{-n} relative reference by number
  441. \ek<name> reference by name (Perl)
  442. \ek'name' reference by name (Perl)
  443. \eg{name} reference by name (Perl)
  444. \ek{name} reference by name (.NET)
  445. (?P=name) reference by name (Python)
  446. .
  447. .
  448. .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
  449. .rs
  450. .sp
  451. (?R) recurse whole pattern
  452. (?n) call subpattern by absolute number
  453. (?+n) call subpattern by relative number
  454. (?-n) call subpattern by relative number
  455. (?&name) call subpattern by name (Perl)
  456. (?P>name) call subpattern by name (Python)
  457. \eg<name> call subpattern by name (Oniguruma)
  458. \eg'name' call subpattern by name (Oniguruma)
  459. \eg<n> call subpattern by absolute number (Oniguruma)
  460. \eg'n' call subpattern by absolute number (Oniguruma)
  461. \eg<+n> call subpattern by relative number (PCRE extension)
  462. \eg'+n' call subpattern by relative number (PCRE extension)
  463. \eg<-n> call subpattern by relative number (PCRE extension)
  464. \eg'-n' call subpattern by relative number (PCRE extension)
  465. .
  466. .
  467. .SH "CONDITIONAL PATTERNS"
  468. .rs
  469. .sp
  470. (?(condition)yes-pattern)
  471. (?(condition)yes-pattern|no-pattern)
  472. .sp
  473. (?(n)... absolute reference condition
  474. (?(+n)... relative reference condition
  475. (?(-n)... relative reference condition
  476. (?(<name>)... named reference condition (Perl)
  477. (?('name')... named reference condition (Perl)
  478. (?(name)... named reference condition (PCRE)
  479. (?(R)... overall recursion condition
  480. (?(Rn)... specific group recursion condition
  481. (?(R&name)... specific recursion condition
  482. (?(DEFINE)... define subpattern for reference
  483. (?(assert)... assertion condition
  484. .
  485. .
  486. .SH "BACKTRACKING CONTROL"
  487. .rs
  488. .sp
  489. The following act immediately they are reached:
  490. .sp
  491. (*ACCEPT) force successful match
  492. (*FAIL) force backtrack; synonym (*F)
  493. (*MARK:NAME) set name to be passed back; synonym (*:NAME)
  494. .sp
  495. The following act only when a subsequent match failure causes a backtrack to
  496. reach them. They all force a match failure, but they differ in what happens
  497. afterwards. Those that advance the start-of-match point do so only if the
  498. pattern is not anchored.
  499. .sp
  500. (*COMMIT) overall failure, no advance of starting point
  501. (*PRUNE) advance to next starting character
  502. (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
  503. (*SKIP) advance to current matching position
  504. (*SKIP:NAME) advance to position corresponding to an earlier
  505. (*MARK:NAME); if not found, the (*SKIP) is ignored
  506. (*THEN) local failure, backtrack to next alternation
  507. (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
  508. .
  509. .
  510. .SH "CALLOUTS"
  511. .rs
  512. .sp
  513. (?C) callout
  514. (?Cn) callout with data n
  515. .
  516. .
  517. .SH "SEE ALSO"
  518. .rs
  519. .sp
  520. \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
  521. \fBpcrematching\fP(3), \fBpcre\fP(3).
  522. .
  523. .
  524. .SH AUTHOR
  525. .rs
  526. .sp
  527. .nf
  528. Philip Hazel
  529. University Computing Service
  530. Cambridge CB2 3QH, England.
  531. .fi
  532. .
  533. .
  534. .SH REVISION
  535. .rs
  536. .sp
  537. .nf
  538. Last updated: 08 January 2014
  539. Copyright (c) 1997-2014 University of Cambridge.
  540. .fi