Ruby's regex engine defines a lot of shortcut character classes. Besides the common meta characters (\w, etc.), there is also the POSIX style expressions and the unicode property syntax. This is an overview of all character classes:
Meta Chars
| Char | Negation | ASCII | Unicode |
|---|---|---|---|
. |
- | ¹ Any | ¹ Any |
\X |
- | Any | Grapheme clusters (\P{M}\p{M}*) |
\d |
\D |
[0-9] |
² ASCII plus Decimal_Number (Nd) |
\h |
\H |
[0-9a-fA-F] |
Like ASCII |
\w |
\W |
[0-9a-zA-Z_] |
² ASCII plus Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Number (Nd / Nl / No), Connector_Punctuation (Pc) |
\s |
\S |
[ \t\r\v\n\f] |
² ASCII plus Separator (Zl / Zp / Zs) |
\R |
- | [\n\v\f\r],\r\n |
ASCII plus
, Line_Separator (Zl), Paragraph_Separator (Zp) |
¹ Will only match linebreaks with /m flag
² You'll need to manually turn on unicode matching for these to work
POSIX and Unicode Property Style
| POSIX | Negation | Property | Negation³ | ASCII | Unicode |
|---|---|---|---|---|---|
[:alnum:] |
[:^alnum:] |
\p{Alnum} |
\p{^Alnum} |
[0-9a-zA-Z] |
Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Decimal_Number (Nd) |
[:alpha:] |
[:^alpha:] |
\p{Alpha} |
\p{^Alpha} |
[a-zA-Z] |
Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn) |
[:ascii:] |
[:^ascii:] |
\p{ASCII} |
\p{^ASCII} |
[\x00-\x7F] |
Like ASCII |
[:blank:] |
[:^blank:] |
\p{Blank} |
\p{^Blank} |
[ \t] |
\t, Space_Separator (Zs) |
[:cntrl] |
[:^cntrl:] |
\p{Cntrl} |
\p{^Cntrl} |
[\x00-\x1F], \x7F |
Other (Cc / Cf / Cn / Co / Cs) |
[:digit:] |
[:^digit:] |
\p{Digit} |
\p{^Digit} |
[0-9] |
ASCII plus Decimal_Number (Nd) |
[:graph:] |
[:^graph:] |
\p{Graph} |
\p{^Graph} |
[\x21-\x7E] |
ALL, EXCEPT: Separator (Zl / Zp / Zs), Control (Cc), Unassigned (Cn), Surrogate (Cs) |
[:lower:] |
[:^lower:] |
\p{Lower} |
\p{^Lower} |
[a-z] |
Lowercase_Letter (Ll) |
[:print:] |
[:^print:] |
\p{Print} |
\p{^Print} |
[\x20-\x7E] |
ALL, EXCEPT: Line_Separator (Zl), Paragraph_Separator (Zp) , Control (Cc), Unassigned (Cn), Surrogate (Cs) |
[:punct:] |
[:^punct:] |
\p{Punct} |
\p{^Punct} |
[!-/:-@\[-`{-~] |
Punctuation (Pc / Pd / Pe / Pf / Pi / Po / Ps) |
[:space:] |
[:^space:] |
\p{Space} |
\p{^Space} |
[ \t\r\v\n\f] |
ASCII plus Separator (Zl / Zp / Zs) |
[:upper:] |
[:^upper:] |
\p{Upper} |
\p{^Upper} |
[A-Z] |
Uppercase_Letter (Lu) |
[:xdigit:] |
[:^xdigit:] |
\p{XDigit} |
\p{^XDigit} |
[0-9a-fA-F] |
Like ASCII |
[:word:] |
[:^word:] |
\p{Word} |
\p{^Word} |
[0-9a-zA-Z_] |
ASCII plus Letter (LC / Ll / Lm / Lo / Lt / Lu), Mark (Mc / Me / Mn), Number (Nd / Nl / No), Connector_Punctuation (Pc) |
³ An alternative way of negating unicode properties is \P{Property}
More Properties
The above groups are only the tip of the iceberg. Using the \p{} syntax, you can match for a lot more unicode properties, see Episode 41: Proper Unicoding for details!
Further Reading
- Onigmo Documentation
- Unicode Character Property Model
- RDoc: Regexp (Character Properties)
- Unicode Data
- Unicode Property List
- Unicode Property Aliases
- Unicode Property Values Aliases
More Idiosyncratic Ruby
- Please Comment on GitHub
- Next Article: Roots of Rubyism
- Previous Article: Limitations of Language
