Idiosyncratic Ruby: Proper Unicoding

Ruby's Regexp engine has a powerful feature built in: It can match for Unicode character properties. But what exactly are properties you can match for?

The Unicode consortium not only assigns all codepoints, it also publishes additional data about their assigned characters. When searching through a string, Ruby allows you to utilize some of this extra knowledge.

Property Regexp Syntax

Within a regular expression, use the \p directive:

/\p{ PROPERTY NAME }/

To invert the property (matching characters that do not fit), you can either use a big \P:

/\P{ PROPERTY NAME }/

Or add the ^ sign:

/\p{ ^PROPERTY NAME }/

Ruby will strip all spaces, dashes, underscores from the given property and convert it to a lowercased string. So the following examples are all valid syntax:

/\p{AGE = 6.3}/
/\p{^In Supplementary Private Use Area-B}/
/\p{In_Egyptian_Hieroglyphs}/
/\P{inemoticons}/
/\P{inno block}/
/\p{^zzzz}/
/\p{ z___ y-y-y }/

Supported Unicode Versions

See table at Episode 73: Unicode Version Mapping

List of Properties as of Ruby 3.0 / Unicode 12.1

General Category

Each code point has a General Category, one of the most basic categorizations. Codepoints without an explicit general category will implicitly get Cn (Unassigned):

"Find decimal numbers (like 2 or 3)".scan(/\p{Nd}+/) # => ["2", "3"]

See the Unicode::Categories micro gem for a way find all general categories a string belongs to and a list of possible categories.

Major Category

The Major category is basically the first letter of the general category:

L: Letter
M: Mark
N: Number
P: Punctuation
S: Symbol
Z: Separator
C: Other

Example:

"Find punctuation characters (like : or ;)".scan(/\p{P}+/) # => ["(", ":", ";)"]

Block

Unicode codepoints are also structured as contiguous blocks: Each codepoint is part of one or has the special value No_Block. To make the block name a Unicode property, you have to prefix it with "in":

"Do not look directly into the ☼".scan /\p{In Miscellaneous Symbols}/ # => ["☼"]

See the Unicode::Blocks micro gem for a way to retrieve the blocks of a string and a list of all valid block names.

Script

The script of a character can also be matched:

"ᴦ".scan/\p{Greek}/ # => "ᴦ"

See the Unicode::Scripts micro gem for a way to find all scripts a string contains and a list of valid script names. A great way to explore the different scripts is codepoints.net.

Age

The age property lets you find out the required Unicode version to display a string:

"Train: 🛲 " =~ /\A\p{age=3.1}*\z/ # => nil
"Train: 🛲 " =~ /\A\p{age=7.0}*\z/ # => 0

Combined/POSIX like Properties

All properties of the POSIX brackets syntax are available with the \p syntax: For example, [[:print:]] simply becomes \p{print}. You can find the full list of properties in Episode 30: Regex with Class.

Generic Properties

Any
Assigned

While \p{Any} will just match any representable codepoint, \p{Assigned} will ignore Reserved codepoints and Non-Characters

Derived Core Properties

These can be found in DerivedCoreProperties.txt (explanation), along with a comment how the property gets constructed. Possible values are (short form in parenthesis):

Math
Alphabetic (Alpha)
Lowercase (Lower)
Uppercase (Upper)
Cased
Case Ignorable
Changes When Lowercased (CWL)
Changes When Uppercased (CWU)
Changes When Titlecased (CWT)
Changes When Casefolded (CWCF)
Changes When Casemapped (CWCF)
ID Start (IDS)
ID Continue (IDC)
XID Start (XIDS)
XID Continue (XIDC)
Default Ignorable Code Point (DI)
Grapheme Extend (Gr Ext)
Grapheme Base (Gr Base)
Grapheme Link (Gr Link)

Ruby's regex engine supports matching for grapheme clusters using \X. But it can also match for very specific grapheme related properties:

Grapheme Cluster Break = Prepend
Grapheme Cluster Break = CR
Grapheme Cluster Break = LF
Grapheme Cluster Break = Control
Grapheme Cluster Break = Extend
Grapheme Cluster Break = Regional Indicator
Grapheme Cluster Break = SpacingMark
Grapheme Cluster Break = L
Grapheme Cluster Break = V
Grapheme Cluster Break = T
Grapheme Cluster Break = LV
Grapheme Cluster Break = LVT
Grapheme Cluster Break = ZWJ

Binary Properties

Other matchable character properties are:

White Space (W Space)
Bidi Control (Bidi C)
Join Control (Join C)
Dash
Hyphen
Quotation Mark (Q Mark)
Terminal Punctuation (Term)
Other Math (O Math)
Hex Digit (Hex)
ASCII Hex Digit (A Hex)
Other Alphabetic (O Alpha)
Ideographic (Ideo)
Diacritic (Dia)
Extender (Ext)
Other Lowercase (O Lower)
Other Uppercase (O Upper)
Noncharacter Code Point (N Char)
Other Grapheme Extend (O Gr Ext)
IDS Binary Operator (IDSB)
IDS Trinary Operator (IDST)
Radical
Unified Ideograph (U Ideo)
Other Default Ignorable Code Point (ODI)
Deprecated (Dep)
Soft Dotted (SD)
Logical Order Exception (LOE)
Other ID Start (OIDS)
Other ID Continue (OIDC)
Sentence Terminal (S Term)
Variation Selector (VS)
Pattern White Space (Pat WS)
Pattern Syntax (Pat Syn)
Prepended Concatenation Mark (PCM)
Regional Indicator (RI)

Emoji Properties

Also see: unicode-emoji

Emoji
Emoji Presentation
Emoji Modifier
Emoji Modifier_Base
Emoji Component
Extended Pictographic

Proper Unicoding

Property Regexp Syntax

Supported Unicode Versions

List of Properties as of Ruby 3.0 / Unicode 12.1

General Category

Major Category

Block

Script

Age

Combined/POSIX like Properties

Generic Properties

Derived Core Properties

Binary Properties

Emoji Properties

Resources

More Idiosyncratic Ruby

Proper Unicoding

Property Regexp Syntax

Supported Unicode Versions

List of Properties as of Ruby 3.0 / Unicode 12.1

General Category

Major Category

Block

Script

Age

Combined/POSIX like Properties

Generic Properties

Derived Core Properties

Grapheme Related

Binary Properties

Emoji Properties

Resources

More Idiosyncratic Ruby