Proper Unicoding

Ruby's Regexp engine has a powerful feature built in: It can match for Unicode character properties. But what exactly are properties you can match for?

The Unicode consortium not only assigns all codepoints, it also publishes additional data about their assigned characters. When searching through a string, Ruby allows you to utilize some of this extra knowledge.

Property Regexp Syntax

Within a regular expression, use the \p directive:

To invert the property (matching characters that do not fit), you can either use a big \P:

Or add the ^ sign:

Ruby will strip all spaces, dashes, underscores from the given propertiy and convert it to a lowercased string. So the following examples are all valid syntax:

Supported Unicode Versions

Ruby Version Unicode Version
2.4 9.0.0
2.3 8.0.0
2.2 7.0.0
2.1 6.1.0

List of Properties

General Category

Each code point has a General Category, one of the most basic categorizations. Codepoints without an explicit general category will implicitely get Cn (Unassigned):

"Find decimal numbers (like 2 or 3)".scan(/\p{Nd}+/) # => ["2", "3"]

See the Unicode::Categories micro gem for a way find all general categories a string belongs to and a list of possible categories.

Major Category

The Major category is basically the first letter of the general category:

Example:

"Find punctation characters (like : or ;)".scan(/\p{P}+/) # => ["(", ":", ";)"]

Block

Unicode codepoints are also structured as contiguous blocks: Each codepoint is part of one or has the special value No_Block. To make the block name a Unicode property, you have to prefix it with "in":

"Do not look directly into the ☼".scan /\p{In Miscellaneous Symbols}/ # => ["☼"]

See the Unicode::Blocks micro gem for a way to retrieve the blocks of a string and a list of all valid block names.

Script

The script of a character can also be matched:

"ᴦ".scan/\p{Greek}/ # => "ᴦ"

See the Unicode::Scripts micro gem for a way to find all scripts a string containso and a list of valid script names. A great way to explore the different scripts is codepoints.net.

Age

The age property lets you find out the required Unicode version to display a string:

"Train: 🛲 " =~ /\A\p{age=3.1}*\z/ # => nil
"Train: 🛲 " =~ /\A\p{age=7.0}*\z/ # => 0

Combined/POSIX like Properties

All properties of the POSIX brackets syntax are available with the \p syntax: For example, [[:print:]] simply becomes \p{print}. You can find the full list of properties in Episode 30: Regex with Class.

Derived Core Properties

These can be found in DerivedCoreProperties.txt (explanation), along with a comment how the property gets constructed. Possible values are:

Binary Properties

Other matchable character properties are:

Resources

More Idiosyncratic Ruby