Ruby's regex engine defines a lot of shortcut character classes. Besides the common meta characters (\w
, etc.), there is also the POSIX style expressions and the unicode property syntax. This is an overview of all character classes:
Meta Chars
Char | Negation | ASCII | Unicode |
---|---|---|---|
. |
- | ¹ Any | ¹ Any |
\X |
- | Any | Grapheme clusters (\P{M}\p{M}* ) |
\d |
\D |
[0-9] |
² ASCII plus Decimal_Number (Nd) |
\h |
\H |
[0-9a-fA-F] |
Like ASCII |
\w |
\W |
[0-9a-zA-Z_] |
² ASCII plus Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Number (Nd / Nl / No), Connector_Punctuation (Pc) |
\s |
\S |
[ \t\r\v\n\f] |
² ASCII plus Separator (Zl / Zp / Zs) |
\R |
- | [\n\v\f\r] ,\r\n |
ASCII plus
, Line_Separator (Zl), Paragraph_Separator (Zp) |
¹ Will only match linebreaks with /m
flag
² You'll need to manually turn on unicode matching for these to work
POSIX and Unicode Property Style
POSIX | Negation | Property | Negation³ | ASCII | Unicode |
---|---|---|---|---|---|
[:alnum:] |
[:^alnum:] |
\p{Alnum} |
\p{^Alnum} |
[0-9a-zA-Z] |
Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Decimal_Number (Nd) |
[:alpha:] |
[:^alpha:] |
\p{Alpha} |
\p{^Alpha} |
[a-zA-Z] |
Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn) |
[:ascii:] |
[:^ascii:] |
\p{ASCII} |
\p{^ASCII} |
[\x00-\x7F] |
Like ASCII |
[:blank:] |
[:^blank:] |
\p{Blank} |
\p{^Blank} |
[ \t] |
\t , Space_Separator (Zs) |
[:cntrl ] |
[:^cntrl:] |
\p{Cntrl} |
\p{^Cntrl} |
[\x00-\x1F] , \x7F |
Other (Cc / Cf / Cn / Co / Cs) |
[:digit:] |
[:^digit:] |
\p{Digit} |
\p{^Digit} |
[0-9] |
ASCII plus Decimal_Number (Nd) |
[:graph:] |
[:^graph:] |
\p{Graph} |
\p{^Graph} |
[\x21-\x7E] |
ALL, EXCEPT: Separator (Zl / Zp / Zs), Control (Cc), Unassigned (Cn), Surrogate (Cs) |
[:lower:] |
[:^lower:] |
\p{Lower} |
\p{^Lower} |
[a-z] |
Lowercase_Letter (Ll) |
[:print:] |
[:^print:] |
\p{Print} |
\p{^Print} |
[\x20-\x7E] |
ALL, EXCEPT: Line_Separator (Zl), Paragraph_Separator (Zp) , Control (Cc), Unassigned (Cn), Surrogate (Cs) |
[:punct:] |
[:^punct:] |
\p{Punct} |
\p{^Punct} |
[!-/:-@\[-`{-~] |
Punctuation (Pc / Pd / Pe / Pf / Pi / Po / Ps) |
[:space:] |
[:^space:] |
\p{Space} |
\p{^Space} |
[ \t\r\v\n\f] |
ASCII plus Separator (Zl / Zp / Zs) |
[:upper:] |
[:^upper:] |
\p{Upper} |
\p{^Upper} |
[A-Z] |
Uppercase_Letter (Lu) |
[:xdigit:] |
[:^xdigit:] |
\p{XDigit} |
\p{^XDigit} |
[0-9a-fA-F] |
Like ASCII |
[:word:] |
[:^word:] |
\p{Word} |
\p{^Word} |
[0-9a-zA-Z_] |
ASCII plus Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Number (Nd / Nl / No), Connector_Punctuation (Pc) |
³ An alternative way of negating unicode properties is \P{Property}
More Properties
The above groups are only the tip of the iceberg. Using the \p{}
syntax, you can match for a lot more unicode properties, see Episode 41: Proper Unicoding for details!
Further Reading
- Onigmo Documentation
- Unicode Character Property Model
- RDoc: Regexp (Character Properties)
- Unicode Data
- Unicode Property List
- Unicode Property Aliases
- Unicode Property Values Aliases
More Idiosyncratic Ruby
- Please Comment on GitHub
- Next Article: Roots of Rubyism
- Previous Article: Limitations of Language