123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449 |
- Oniguruma Regular Expressions Version 5.9.1 2007/09/05
- syntax: ONIG_SYNTAX_RUBY (default)
- 1. Syntax elements
- \ escape (enable or disable meta character meaning)
- | alternation
- (...) group
- [...] character class
- 2. Characters
- \t horizontal tab (0x09)
- \v vertical tab (0x0B)
- \n newline (0x0A)
- \r return (0x0D)
- \b back space (0x08)
- \f form feed (0x0C)
- \a bell (0x07)
- \e escape (0x1B)
- \nnn octal char (encoded byte value)
- \xHH hexadecimal char (encoded byte value)
- \x{7HHHHHHH} wide hexadecimal char (character code point value)
- \cx control char (character code point value)
- \C-x control char (character code point value)
- \M-x meta (x|0x80) (character code point value)
- \M-\C-x meta control char (character code point value)
- (* \b is effective in character class [...] only)
- 3. Character types
- . any character (except newline)
- \w word character
- Not Unicode:
- alphanumeric, "_" and multibyte char.
- Unicode:
- General_Category -- (Letter|Mark|Number|Connector_Punctuation)
- \W non word char
- \s whitespace char
- Not Unicode:
- \t, \n, \v, \f, \r, \x20
- Unicode:
- 0009, 000A, 000B, 000C, 000D, 0085(NEL),
- General_Category -- Line_Separator
- -- Paragraph_Separator
- -- Space_Separator
- \S non whitespace char
- \d decimal digit char
- Unicode: General_Category -- Decimal_Number
- \D non decimal digit char
- \h hexadecimal digit char [0-9a-fA-F]
- \H non hexadecimal digit char
- Character Property
- * \p{property-name}
- * \p{^property-name} (negative)
- * \P{property-name} (negative)
- property-name:
- + works on all encodings
- Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
- Print, Punct, Space, Upper, XDigit, Word, ASCII,
- + works on EUC_JP, Shift_JIS
- Hiragana, Katakana
- + works on UTF8, UTF16, UTF32
- Any, Assigned, C, Cc, Cf, Cn, Co, Cs, L, Ll, Lm, Lo, Lt, Lu,
- M, Mc, Me, Mn, N, Nd, Nl, No, P, Pc, Pd, Pe, Pf, Pi, Po, Ps,
- S, Sc, Sk, Sm, So, Z, Zl, Zp, Zs,
- Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese,
- Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic,
- Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian,
- Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul,
- Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana,
- Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam,
- Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian,
- Oriya, Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac,
- Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan,
- Tifinagh, Ugaritic, Yi
- 4. Quantifier
- greedy
- ? 1 or 0 times
- * 0 or more times
- + 1 or more times
- {n,m} at least n but not more than m times
- {n,} at least n times
- {,n} at least 0 but not more than n times ({0,n})
- {n} n times
- reluctant
- ?? 1 or 0 times
- *? 0 or more times
- +? 1 or more times
- {n,m}? at least n but not more than m times
- {n,}? at least n times
- {,n}? at least 0 but not more than n times (== {0,n}?)
- possessive (greedy and does not backtrack after repeated)
- ?+ 1 or 0 times
- *+ 0 or more times
- ++ 1 or more times
- ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
- ex. /a*+/ === /(?>a*)/
- 5. Anchors
- ^ beginning of the line
- $ end of the line
- \b word boundary
- \B not word boundary
- \A beginning of string
- \Z end of string, or before newline at the end
- \z end of string
- \G matching start position
- 6. Character class
- ^... negative class (lowest precedence operator)
- x-y range from x to y
- [...] set (character class in character class)
- ..&&.. intersection (low precedence at the next of ^)
-
- ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
- * If you want to use '[', '-', ']' as a normal character
- in a character class, you should escape these characters by '\'.
- POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
- Not Unicode Case:
- alnum alphabet or digit char
- alpha alphabet
- ascii code value: [0 - 127]
- blank \t, \x20
- cntrl
- digit 0-9
- graph include all of multibyte encoded characters
- lower
- print include all of multibyte encoded characters
- punct
- space \t, \n, \v, \f, \r, \x20
- upper
- xdigit 0-9, a-f, A-F
- word alphanumeric, "_" and multibyte characters
- Unicode Case:
- alnum Letter | Mark | Decimal_Number
- alpha Letter | Mark
- ascii 0000 - 007F
- blank Space_Separator | 0009
- cntrl Control | Format | Unassigned | Private_Use | Surrogate
- digit Decimal_Number
- graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
- lower Lowercase_Letter
- print [[:graph:]] | [[:space:]]
- punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
- Final_Punctuation | Initial_Punctuation | Other_Punctuation |
- Open_Punctuation
- space Space_Separator | Line_Separator | Paragraph_Separator |
- 0009 | 000A | 000B | 000C | 000D | 0085
- upper Uppercase_Letter
- xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066
- (0-9, a-f, A-F)
- word Letter | Mark | Decimal_Number | Connector_Punctuation
- 7. Extended groups
- (?#...) comment
- (?imx-imx) option on/off
- i: ignore case
- m: multi-line (dot(.) match newline)
- x: extended form
- (?imx-imx:subexp) option on/off for subexp
- (?:subexp) not captured group
- (subexp) captured group
- (?=subexp) look-ahead
- (?!subexp) negative look-ahead
- (?<=subexp) look-behind
- (?<!subexp) negative look-behind
- Subexp of look-behind must be fixed character length.
- But different character length is allowed in top level
- alternatives only.
- ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
- In negative-look-behind, captured group isn't allowed,
- but shy group(?:) is allowed.
- (?>subexp) atomic group
- don't backtrack in subexp.
- (?<name>subexp), (?'name'subexp)
- define named group
- (All characters of the name must be a word character.)
- Not only a name but a number is assigned like a captured
- group.
- Assigning the same name as two or more subexps is allowed.
- In this case, a subexp call can not be performed although
- the back reference is possible.
- 8. Back reference
- \n back reference by group number (n >= 1)
- \k<n> back reference by group number (n >= 1)
- \k'n' back reference by group number (n >= 1)
- \k<-n> back reference by relative group number (n >= 1)
- \k'-n' back reference by relative group number (n >= 1)
- \k<name> back reference by group name
- \k'name' back reference by group name
- In the back reference by the multiplex definition name,
- a subexp with a large number is referred to preferentially.
- (When not matched, a group of the small number is referred to.)
- * Back reference by group number is forbidden if named group is defined
- in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
- back reference with nest level
- level: 0, 1, 2, ...
- \k<n+level> (n >= 1)
- \k<n-level> (n >= 1)
- \k'n+level' (n >= 1)
- \k'n-level' (n >= 1)
- \k<name+level>
- \k<name-level>
- \k'name+level'
- \k'name-level'
- Destinate relative nest level from back reference position.
- ex 1.
- /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
- ex 2.
- r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
- (?<element> \g<stag> \g<content>* \g<etag> ){0}
- (?<stag> < \g<name> \s* > ){0}
- (?<name> [a-zA-Z_:]+ ){0}
- (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
- (?<etag> </ \k<name+1> >){0}
- \g<element>
- __REGEXP__
- p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
- 9. Subexp call ("Tanaka Akira special")
- \g<name> call by group name
- \g'name' call by group name
- \g<n> call by group number (n >= 1)
- \g'n' call by group number (n >= 1)
- \g<-n> call by relative group number (n >= 1)
- \g'-n' call by relative group number (n >= 1)
- * left-most recursive call is not allowed.
- ex. (?<name>a|\g<name>b) => error
- (?<name>a|b\g<name>c) => OK
- * Call by group number is forbidden if named group is defined in the pattern
- and ONIG_OPTION_CAPTURE_GROUP is not setted.
- * If the option status of called group is different from calling position
- then the group's option is effective.
- ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A"
- 10. Captured group
- Behavior of the no-named group (...) changes with the following conditions.
- (But named group is not changed.)
- case 1. /.../ (named group is not used, no option)
- (...) is treated as a captured group.
- case 2. /.../g (named group is not used, 'g' option)
- (...) is treated as a no-captured group (?:...).
- case 3. /..(?<name>..)../ (named group is used, no option)
- (...) is treated as a no-captured group (?:...).
- numbered-backref/call is not allowed.
- case 4. /..(?<name>..)../G (named group is used, 'G' option)
- (...) is treated as a captured group.
- numbered-backref/call is allowed.
- where
- g: ONIG_OPTION_DONT_CAPTURE_GROUP
- G: ONIG_OPTION_CAPTURE_GROUP
- ('g' and 'G' options are argued in ruby-dev ML)
- -----------------------------
- A-1. Syntax depend options
- + ONIG_SYNTAX_RUBY
- (?m): dot(.) match newline
- + ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
- (?s): dot(.) match newline
- (?m): ^ match after newline, $ match before newline
- A-2. Original extensions
- + hexadecimal digit char type \h, \H
- + named group (?<name>...), (?'name'...)
- + named backref \k<name>
- + subexp call \g<name>, \g<group-num>
- A-3. Lacked features compare with perl 5.8.0
- + \N{name}
- + \l,\u,\L,\U, \X, \C
- + (?{code})
- + (??{code})
- + (?(condition)yes-pat|no-pat)
- * \Q...\E
- This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
- A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
- + add character property (\p{property}, \P{property})
- + add hexadecimal digit char type (\h, \H)
- + add look-behind
- (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
- + add possessive quantifier. ?+, *+, ++
- + add operations in character class. [], &&
- ('[' must be escaped as an usual char in character class.)
- + add named group and subexp call.
- + octal or hexadecimal number sequence can be treated as
- a multibyte code char in character class if multibyte encoding
- is specified.
- (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
- + allow the range of single byte char and multibyte char in character
- class.
- ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
- + effect range of isolated option is to next ')'.
- ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
- + isolated option is not transparent to previous pattern.
- ex. a(?i)* is a syntax error pattern.
- + allowed incompleted left brace as an usual string.
- ex. /{/, /({)/, /a{2,3/ etc...
- + negative POSIX bracket [:^xxxx:] is supported.
- + POSIX bracket [:ascii:] is added.
- + repeat of look-ahead is not allowed.
- ex. /(?=a)*/, /(?!b){5}/
- + Ignore case option is effective to numbered character.
- ex. /\x61/i =~ "A"
- + In the range quantifier, the number of the minimum is omissible.
- /a{,n}/ == /a{0,n}/
- The simultanious abbreviation of the number of times of the minimum
- and the maximum is not allowed. (/a{,}/)
- + /a{n}?/ is not a non-greedy operator.
- /a{n}?/ == /(?:a{n})?/
- + invalid back reference is checked and cause error.
- /\1/, /(a)\2/
- + Zero-length match in infinite repeat stops the repeat,
- then changes of the capture group status are checked as stop condition.
- /(?:()|())*\1\2/ =~ ""
- /(?:\1a|())*/ =~ "a"
- A-5. Disabled functions by default syntax
- + capture history
- (?@...) and (?@<name>...)
- ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
- see sample/listcap.c file.
- A-6. Problems
- + Invalid encoding byte sequence is not checked.
- ex. UTF-8
- * Invalid first byte is treated as a character.
- /./u =~ "\xa3"
- * Incomplete byte sequence is not checked.
- /\w+/ =~ "a\xf3\x8ec"
- // END
|