pcresyntax.html 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561
  1. <html>
  2. <head>
  3. <title>pcresyntax specification</title>
  4. </head>
  5. <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
  6. <h1>pcresyntax man page</h1>
  7. <p>
  8. Return to the <a href="index.html">PCRE index page</a>.
  9. </p>
  10. <p>
  11. This page is part of the PCRE HTML documentation. It was generated automatically
  12. from the original man page. If there is any nonsense in it, please consult the
  13. man page, in case the conversion went wrong.
  14. <br>
  15. <ul>
  16. <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
  17. <li><a name="TOC2" href="#SEC2">QUOTING</a>
  18. <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
  19. <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
  20. <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
  21. <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
  22. <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
  23. <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
  24. <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
  25. <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
  26. <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
  27. <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
  28. <li><a name="TOC13" href="#SEC13">CAPTURING</a>
  29. <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
  30. <li><a name="TOC15" href="#SEC15">COMMENT</a>
  31. <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
  32. <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
  33. <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
  34. <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
  35. <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
  36. <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
  37. <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
  38. <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
  39. <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
  40. <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
  41. <li><a name="TOC26" href="#SEC26">AUTHOR</a>
  42. <li><a name="TOC27" href="#SEC27">REVISION</a>
  43. </ul>
  44. <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
  45. <P>
  46. The full syntax and semantics of the regular expressions that are supported by
  47. PCRE are described in the
  48. <a href="pcrepattern.html"><b>pcrepattern</b></a>
  49. documentation. This document contains a quick-reference summary of the syntax.
  50. </P>
  51. <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
  52. <P>
  53. <pre>
  54. \x where x is non-alphanumeric is a literal x
  55. \Q...\E treat enclosed characters as literal
  56. </PRE>
  57. </P>
  58. <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
  59. <P>
  60. <pre>
  61. \a alarm, that is, the BEL character (hex 07)
  62. \cx "control-x", where x is any ASCII character
  63. \e escape (hex 1B)
  64. \f form feed (hex 0C)
  65. \n newline (hex 0A)
  66. \r carriage return (hex 0D)
  67. \t tab (hex 09)
  68. \0dd character with octal code 0dd
  69. \ddd character with octal code ddd, or backreference
  70. \o{ddd..} character with octal code ddd..
  71. \xhh character with hex code hh
  72. \x{hhh..} character with hex code hhh..
  73. </pre>
  74. Note that \0dd is always an octal code, and that \8 and \9 are the literal
  75. characters "8" and "9".
  76. </P>
  77. <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
  78. <P>
  79. <pre>
  80. . any character except newline;
  81. in dotall mode, any character whatsoever
  82. \C one data unit, even in UTF mode (best avoided)
  83. \d a decimal digit
  84. \D a character that is not a decimal digit
  85. \h a horizontal white space character
  86. \H a character that is not a horizontal white space character
  87. \N a character that is not a newline
  88. \p{<i>xx</i>} a character with the <i>xx</i> property
  89. \P{<i>xx</i>} a character without the <i>xx</i> property
  90. \R a newline sequence
  91. \s a white space character
  92. \S a character that is not a white space character
  93. \v a vertical white space character
  94. \V a character that is not a vertical white space character
  95. \w a "word" character
  96. \W a "non-word" character
  97. \X a Unicode extended grapheme cluster
  98. </pre>
  99. By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
  100. or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
  101. happening, \s and \w may also match characters with code points in the range
  102. 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
  103. is changed to use Unicode properties and they match many more characters.
  104. </P>
  105. <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  106. <P>
  107. <pre>
  108. C Other
  109. Cc Control
  110. Cf Format
  111. Cn Unassigned
  112. Co Private use
  113. Cs Surrogate
  114. L Letter
  115. Ll Lower case letter
  116. Lm Modifier letter
  117. Lo Other letter
  118. Lt Title case letter
  119. Lu Upper case letter
  120. L& Ll, Lu, or Lt
  121. M Mark
  122. Mc Spacing mark
  123. Me Enclosing mark
  124. Mn Non-spacing mark
  125. N Number
  126. Nd Decimal number
  127. Nl Letter number
  128. No Other number
  129. P Punctuation
  130. Pc Connector punctuation
  131. Pd Dash punctuation
  132. Pe Close punctuation
  133. Pf Final punctuation
  134. Pi Initial punctuation
  135. Po Other punctuation
  136. Ps Open punctuation
  137. S Symbol
  138. Sc Currency symbol
  139. Sk Modifier symbol
  140. Sm Mathematical symbol
  141. So Other symbol
  142. Z Separator
  143. Zl Line separator
  144. Zp Paragraph separator
  145. Zs Space separator
  146. </PRE>
  147. </P>
  148. <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  149. <P>
  150. <pre>
  151. Xan Alphanumeric: union of properties L and N
  152. Xps POSIX space: property Z or tab, NL, VT, FF, CR
  153. Xsp Perl space: property Z or tab, NL, VT, FF, CR
  154. Xuc Univerally-named character: one that can be
  155. represented by a Universal Character Name
  156. Xwd Perl word: property Xan or underscore
  157. </pre>
  158. Perl and POSIX space are now the same. Perl added VT to its space character set
  159. at release 5.18 and PCRE changed at release 8.34.
  160. </P>
  161. <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
  162. <P>
  163. Arabic,
  164. Armenian,
  165. Avestan,
  166. Balinese,
  167. Bamum,
  168. Bassa_Vah,
  169. Batak,
  170. Bengali,
  171. Bopomofo,
  172. Brahmi,
  173. Braille,
  174. Buginese,
  175. Buhid,
  176. Canadian_Aboriginal,
  177. Carian,
  178. Caucasian_Albanian,
  179. Chakma,
  180. Cham,
  181. Cherokee,
  182. Common,
  183. Coptic,
  184. Cuneiform,
  185. Cypriot,
  186. Cyrillic,
  187. Deseret,
  188. Devanagari,
  189. Duployan,
  190. Egyptian_Hieroglyphs,
  191. Elbasan,
  192. Ethiopic,
  193. Georgian,
  194. Glagolitic,
  195. Gothic,
  196. Grantha,
  197. Greek,
  198. Gujarati,
  199. Gurmukhi,
  200. Han,
  201. Hangul,
  202. Hanunoo,
  203. Hebrew,
  204. Hiragana,
  205. Imperial_Aramaic,
  206. Inherited,
  207. Inscriptional_Pahlavi,
  208. Inscriptional_Parthian,
  209. Javanese,
  210. Kaithi,
  211. Kannada,
  212. Katakana,
  213. Kayah_Li,
  214. Kharoshthi,
  215. Khmer,
  216. Khojki,
  217. Khudawadi,
  218. Lao,
  219. Latin,
  220. Lepcha,
  221. Limbu,
  222. Linear_A,
  223. Linear_B,
  224. Lisu,
  225. Lycian,
  226. Lydian,
  227. Mahajani,
  228. Malayalam,
  229. Mandaic,
  230. Manichaean,
  231. Meetei_Mayek,
  232. Mende_Kikakui,
  233. Meroitic_Cursive,
  234. Meroitic_Hieroglyphs,
  235. Miao,
  236. Modi,
  237. Mongolian,
  238. Mro,
  239. Myanmar,
  240. Nabataean,
  241. New_Tai_Lue,
  242. Nko,
  243. Ogham,
  244. Ol_Chiki,
  245. Old_Italic,
  246. Old_North_Arabian,
  247. Old_Permic,
  248. Old_Persian,
  249. Old_South_Arabian,
  250. Old_Turkic,
  251. Oriya,
  252. Osmanya,
  253. Pahawh_Hmong,
  254. Palmyrene,
  255. Pau_Cin_Hau,
  256. Phags_Pa,
  257. Phoenician,
  258. Psalter_Pahlavi,
  259. Rejang,
  260. Runic,
  261. Samaritan,
  262. Saurashtra,
  263. Sharada,
  264. Shavian,
  265. Siddham,
  266. Sinhala,
  267. Sora_Sompeng,
  268. Sundanese,
  269. Syloti_Nagri,
  270. Syriac,
  271. Tagalog,
  272. Tagbanwa,
  273. Tai_Le,
  274. Tai_Tham,
  275. Tai_Viet,
  276. Takri,
  277. Tamil,
  278. Telugu,
  279. Thaana,
  280. Thai,
  281. Tibetan,
  282. Tifinagh,
  283. Tirhuta,
  284. Ugaritic,
  285. Vai,
  286. Warang_Citi,
  287. Yi.
  288. </P>
  289. <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
  290. <P>
  291. <pre>
  292. [...] positive character class
  293. [^...] negative character class
  294. [x-y] range (can be used for hex characters)
  295. [[:xxx:]] positive POSIX named set
  296. [[:^xxx:]] negative POSIX named set
  297. alnum alphanumeric
  298. alpha alphabetic
  299. ascii 0-127
  300. blank space or tab
  301. cntrl control character
  302. digit decimal digit
  303. graph printing, excluding space
  304. lower lower case letter
  305. print printing, including space
  306. punct printing, excluding alphanumeric
  307. space white space
  308. upper upper case letter
  309. word same as \w
  310. xdigit hexadecimal digit
  311. </pre>
  312. In PCRE, POSIX character set names recognize only ASCII characters by default,
  313. but some of them use Unicode properties if PCRE_UCP is set. You can use
  314. \Q...\E inside a character class.
  315. </P>
  316. <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
  317. <P>
  318. <pre>
  319. ? 0 or 1, greedy
  320. ?+ 0 or 1, possessive
  321. ?? 0 or 1, lazy
  322. * 0 or more, greedy
  323. *+ 0 or more, possessive
  324. *? 0 or more, lazy
  325. + 1 or more, greedy
  326. ++ 1 or more, possessive
  327. +? 1 or more, lazy
  328. {n} exactly n
  329. {n,m} at least n, no more than m, greedy
  330. {n,m}+ at least n, no more than m, possessive
  331. {n,m}? at least n, no more than m, lazy
  332. {n,} n or more, greedy
  333. {n,}+ n or more, possessive
  334. {n,}? n or more, lazy
  335. </PRE>
  336. </P>
  337. <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
  338. <P>
  339. <pre>
  340. \b word boundary
  341. \B not a word boundary
  342. ^ start of subject
  343. also after internal newline in multiline mode
  344. \A start of subject
  345. $ end of subject
  346. also before newline at end of subject
  347. also before internal newline in multiline mode
  348. \Z end of subject
  349. also before newline at end of subject
  350. \z end of subject
  351. \G first matching position in subject
  352. </PRE>
  353. </P>
  354. <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
  355. <P>
  356. <pre>
  357. \K reset start of match
  358. </pre>
  359. \K is honoured in positive assertions, but ignored in negative ones.
  360. </P>
  361. <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
  362. <P>
  363. <pre>
  364. expr|expr|expr...
  365. </PRE>
  366. </P>
  367. <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
  368. <P>
  369. <pre>
  370. (...) capturing group
  371. (?&#60;name&#62;...) named capturing group (Perl)
  372. (?'name'...) named capturing group (Perl)
  373. (?P&#60;name&#62;...) named capturing group (Python)
  374. (?:...) non-capturing group
  375. (?|...) non-capturing group; reset group numbers for
  376. capturing groups in each alternative
  377. </PRE>
  378. </P>
  379. <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
  380. <P>
  381. <pre>
  382. (?&#62;...) atomic, non-capturing group
  383. </PRE>
  384. </P>
  385. <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
  386. <P>
  387. <pre>
  388. (?#....) comment (not nestable)
  389. </PRE>
  390. </P>
  391. <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
  392. <P>
  393. <pre>
  394. (?i) caseless
  395. (?J) allow duplicate names
  396. (?m) multiline
  397. (?s) single line (dotall)
  398. (?U) default ungreedy (lazy)
  399. (?x) extended (ignore white space)
  400. (?-...) unset option(s)
  401. </pre>
  402. The following are recognized only at the very start of a pattern or after one
  403. of the newline or \R options with similar syntax. More than one of them may
  404. appear.
  405. <pre>
  406. (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  407. (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  408. (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
  409. (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  410. (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
  411. (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
  412. (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
  413. (*UTF) set appropriate UTF mode for the library in use
  414. (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
  415. </pre>
  416. Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
  417. limits set by the caller of pcre_exec(), not increase them.
  418. </P>
  419. <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
  420. <P>
  421. These are recognized only at the very start of the pattern or after option
  422. settings with a similar syntax.
  423. <pre>
  424. (*CR) carriage return only
  425. (*LF) linefeed only
  426. (*CRLF) carriage return followed by linefeed
  427. (*ANYCRLF) all three of the above
  428. (*ANY) any Unicode newline sequence
  429. </PRE>
  430. </P>
  431. <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
  432. <P>
  433. These are recognized only at the very start of the pattern or after option
  434. setting with a similar syntax.
  435. <pre>
  436. (*BSR_ANYCRLF) CR, LF, or CRLF
  437. (*BSR_UNICODE) any Unicode newline sequence
  438. </PRE>
  439. </P>
  440. <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
  441. <P>
  442. <pre>
  443. (?=...) positive look ahead
  444. (?!...) negative look ahead
  445. (?&#60;=...) positive look behind
  446. (?&#60;!...) negative look behind
  447. </pre>
  448. Each top-level branch of a look behind must be of a fixed length.
  449. </P>
  450. <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
  451. <P>
  452. <pre>
  453. \n reference by number (can be ambiguous)
  454. \gn reference by number
  455. \g{n} reference by number
  456. \g{-n} relative reference by number
  457. \k&#60;name&#62; reference by name (Perl)
  458. \k'name' reference by name (Perl)
  459. \g{name} reference by name (Perl)
  460. \k{name} reference by name (.NET)
  461. (?P=name) reference by name (Python)
  462. </PRE>
  463. </P>
  464. <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
  465. <P>
  466. <pre>
  467. (?R) recurse whole pattern
  468. (?n) call subpattern by absolute number
  469. (?+n) call subpattern by relative number
  470. (?-n) call subpattern by relative number
  471. (?&name) call subpattern by name (Perl)
  472. (?P&#62;name) call subpattern by name (Python)
  473. \g&#60;name&#62; call subpattern by name (Oniguruma)
  474. \g'name' call subpattern by name (Oniguruma)
  475. \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
  476. \g'n' call subpattern by absolute number (Oniguruma)
  477. \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
  478. \g'+n' call subpattern by relative number (PCRE extension)
  479. \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
  480. \g'-n' call subpattern by relative number (PCRE extension)
  481. </PRE>
  482. </P>
  483. <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
  484. <P>
  485. <pre>
  486. (?(condition)yes-pattern)
  487. (?(condition)yes-pattern|no-pattern)
  488. (?(n)... absolute reference condition
  489. (?(+n)... relative reference condition
  490. (?(-n)... relative reference condition
  491. (?(&#60;name&#62;)... named reference condition (Perl)
  492. (?('name')... named reference condition (Perl)
  493. (?(name)... named reference condition (PCRE)
  494. (?(R)... overall recursion condition
  495. (?(Rn)... specific group recursion condition
  496. (?(R&name)... specific recursion condition
  497. (?(DEFINE)... define subpattern for reference
  498. (?(assert)... assertion condition
  499. </PRE>
  500. </P>
  501. <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
  502. <P>
  503. The following act immediately they are reached:
  504. <pre>
  505. (*ACCEPT) force successful match
  506. (*FAIL) force backtrack; synonym (*F)
  507. (*MARK:NAME) set name to be passed back; synonym (*:NAME)
  508. </pre>
  509. The following act only when a subsequent match failure causes a backtrack to
  510. reach them. They all force a match failure, but they differ in what happens
  511. afterwards. Those that advance the start-of-match point do so only if the
  512. pattern is not anchored.
  513. <pre>
  514. (*COMMIT) overall failure, no advance of starting point
  515. (*PRUNE) advance to next starting character
  516. (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
  517. (*SKIP) advance to current matching position
  518. (*SKIP:NAME) advance to position corresponding to an earlier
  519. (*MARK:NAME); if not found, the (*SKIP) is ignored
  520. (*THEN) local failure, backtrack to next alternation
  521. (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
  522. </PRE>
  523. </P>
  524. <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
  525. <P>
  526. <pre>
  527. (?C) callout
  528. (?Cn) callout with data n
  529. </PRE>
  530. </P>
  531. <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
  532. <P>
  533. <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
  534. <b>pcrematching</b>(3), <b>pcre</b>(3).
  535. </P>
  536. <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
  537. <P>
  538. Philip Hazel
  539. <br>
  540. University Computing Service
  541. <br>
  542. Cambridge CB2 3QH, England.
  543. <br>
  544. </P>
  545. <br><a name="SEC27" href="#TOC1">REVISION</a><br>
  546. <P>
  547. Last updated: 08 January 2014
  548. <br>
  549. Copyright &copy; 1997-2014 University of Cambridge.
  550. <br>
  551. <p>
  552. Return to the <a href="index.html">PCRE index page</a>.
  553. </p>