Proper Unicoding

Ruby's Regexp engine has a powerful feature built in: It can match for Unicode character properties. But what exactly are properties you can match for?

The Unicode consortium not only assigns all codepoints, it also publishes additional data about their assigned characters. When searching through a string, Ruby allows you to utilize some of this extra knowledge.

Property Regexp Syntax

Within a regular expression, use the \p directive:

To invert the property (matching characters that do not fit), you can either use a big \P:

Or add the ^ sign:

Ruby will strip all spaces, dashes, underscores from the given property and convert it to a lowercased string. So the following examples are all valid syntax:

Supported Unicode Versions

See table at Episode 73: Unicode Version Mapping

List of Properties as of Ruby 3.0 / Unicode 12.1

General Category

Each code point has a General Category, one of the most basic categorizations. Codepoints without an explicit general category will implicitly get Cn (Unassigned):

"Find decimal numbers (like 2 or 3)".scan(/\p{Nd}+/) # => ["2", "3"]

See the Unicode::Categories micro gem for a way find all general categories a string belongs to and a list of possible categories.

Major Category

The Major category is basically the first letter of the general category:

Example:

"Find punctuation characters (like : or ;)".scan(/\p{P}+/) # => ["(", ":", ";)"]

Block

Unicode codepoints are also structured as contiguous blocks: Each codepoint is part of one or has the special value No_Block. To make the block name a Unicode property, you have to prefix it with "in":

"Do not look directly into the ☼".scan /\p{In Miscellaneous Symbols}/ # => ["☼"]

See the Unicode::Blocks micro gem for a way to retrieve the blocks of a string and a list of all valid block names.

Script

The script of a character can also be matched:

"ᴦ".scan/\p{Greek}/ # => "ᴦ"

See the Unicode::Scripts micro gem for a way to find all scripts a string contains and a list of valid script names. A great way to explore the different scripts is codepoints.net.

Age

The age property lets you find out the required Unicode version to display a string:

"Train: 🛲 " =~ /\A\p{age=3.1}*\z/ # => nil
"Train: 🛲 " =~ /\A\p{age=7.0}*\z/ # => 0

Combined/POSIX like Properties

All properties of the POSIX brackets syntax are available with the \p syntax: For example, [[:print:]] simply becomes \p{print}. You can find the full list of properties in Episode 30: Regex with Class.

Generic Properties

While \p{Any} will just match any representable codepoint, \p{Assigned} will ignore Reserved codepoints and Non-Characters

Derived Core Properties

These can be found in DerivedCoreProperties.txt (explanation), along with a comment how the property gets constructed. Possible values are (short form in parenthesis):

Ruby's regex engine supports matching for grapheme clusters using \X. But it can also match for very specific grapheme related properties:

Binary Properties

Other matchable character properties are:

Emoji Properties

Also see: unicode-emoji

Resources

More Idiosyncratic Ruby