Idiosyncratic Ruby: Ruby has Character

Ruby comes with good support for Unicode-related features. Read on if you want to learn more about important Unicode fundamentals and how to use them in Ruby…

…or just watch my talk from RubyConf 2017:
⑩ Unicode Characters You Should Know About as a 👩‍💻

Ruby ♡ Unicode

Characters in Unicode
- Codepoints & Encodings
- Grapheme Clusters
Normalization
- Confusables
Case-Mapping
- Case-Folding
Regex Unicode Properties
- Emoji Regex
Monospace Display Width
Unicode Special Codepoints
CLI Tools for Codepoint Analysis

Characters in Unicode

Unicode has come a long way and is now available in version 13.0 (core specification). The standard defines a lot of things related to characters, however, it is not always easy to grasp what a character actually is. Is Ǆ a single character or not? What about non-Latin languages?

We will need some more fine-grained concepts to distinguish and talk about characters in Unicode:

Codepoint: A base unit to construct characters from. Often this maps directly to a single character. Depending on the encoding, a codepoint might require multiple bytes.
Grapheme cluster: Smallest linguistic unit, a user-perceived character, constructed out of one or multiple codepoints.
Glyph: The actual rendered shape which represents the grapheme cluster

Codepoints & Encodings

Codepoints are the base unit of Unicode: It is a number mapped to some meaning. Often this resolves to a single character:

"\u{41}" # => "A"
"\u{ABCD}" # => "ꯍ"
"\u{1F6A1}" # => "🚡"

There are 1114112 (in hexadecimal: 0x110000) different codepoints. On byte-level, a codepoint can be represented in different ways, which depends on the encoding used. Popular encodings for Unicode are UTF-8, UTF-16, and UTF-32, which all have different mechanisms of representing codepoints:

Codepoint	Decimal	Glyph	Bytes UTF-8	Bytes UTF-16LE	Bytes UTF-32LE
U+0041	65	`A`	41	41 00	41 00 00 00
U+ABCD	43981	`ꯍ`	EA AF 8D	CD AB	CD AB 00 00
U+1F6A1	128673	`🚡`	F0 9F 9A A1	3D D8 A1 DE	A1 F6 01 00

Here is an overview, without going into too much detail:

UTF-8 uses a dynamic number of bytes: While ASCII characters fit into a single byte, it can use up to 4 bytes for higher codepoints.
UTF-16 uses 2 bytes, if possible, but has a 4 byte mechanism to represent higher codepoints.
UTF-32 is a direct representation of the codepoint and always uses 4 bytes, no logic is involved. It is also a little lavish, because even the largest codepoint U+10FFFF only uses 21 bit of information. As a consequence the last byte is always 00.

You can visualize and learn about encodings on the command-line with the unibits CLI utility.

The rest of this blog post will not deal with encodings and byte representations, but use codepoints as the smallest unit.

Grapheme Clusters

A user-perceived character might be constructed out of multiple codepoints. There are a lot of enclosing characters (like diacritics) which get combined with the previous character to form a new one:

"Ä" = U+0041 "A" + U+0308 "◌̈"

An example from the Thai language:

"กำ" = U+0E01 "ก" + U+0E33 " ำ"

Emoji are another example of grapheme clusters that require multiple codepoints:

"👨🏻‍🍳"¹ = U+1F468 "👨" + U+1F3FB "🏻" + U+0200D "‍" + U+1F373 "🍳"

Ruby 2.5 introduced a convenient way to iterate through all grapheme clusters:

"abกำcd".grapheme_clusters # => ["a", "b", "กำ", "c", "d"]

There is also /\X/², a regex feature that you can use instead of the default /./ to match for grapheme clusters instead of codepoints:

"abกำcd".scan(/./) # => ["a", "b", "ก", "ำ", "c", "d"]
"abกำcd".scan(/\X/) # => ["a", "b", "กำ", "c", "d"]

¹ Depending on the recentness of your rendering software, this is displayed as a single male cook
² This regex matcher was already introduced in earlier versions of Ruby

Normalization

Sometimes, the Unicode standard defines multiple ways to describe the same (or a very similar) glyph. Let us revisit the example from above: the German letter "Ä", which is a "A" with two dots above. It is defined as codepoint U+00C4. At the same time, there is a mechanism to put two dots above just any letter using the combining codepoint U+0308. Combine it with "A" and you get "Ä" - A different representation, although semantically, it is the same character.

However, sometimes you need one canonical representation of a string. This is why the Unicode consortium came up with a normalization algorithm. It is included in Ruby's standard library and required automatically. There are several types of normalization forms:

Form	Description
*NFC*	Default. The C stands for composed, it uses the composed format for graphemes (if available).
*NFD*	The D stands for decomposed, it uses separate codepoints for such graphemes
*NFKC*	Like *NFC*, but uses compatibility mode, instead of canonical mode
*NFKD*	Like *NFD*, but uses compatibility mode, instead of canonical mode

NFC

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize.codepoints.map{|c| "U+%04X"%c }
# => ["U+00C4"]

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize.codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]

"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize.codepoints.map{ |c| "U+%04X"%c }
# =>  ["U+0041", "U+00B2"]

NFD

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]

"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfd).codepoints.map{ |c| "U+%04X"%c }
# =>  ["U+0041", "U+00B2"]

NFKC

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# => ["U+00C4"]

"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfkc).codepoints.map{ |c| "U+%04X"%c }
# =>  ["U+0041", "U+0032"]

NFKD

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+00C4"]
"Ä".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]

"Ä".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+0308"]
"Ä".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# => ["U+0041", "U+0308"]

"A²".codepoints.map{ |c| "U+%04X"%c } # => ["U+0041", "U+00B2"]
"A²".unicode_normalize(:nfkd).codepoints.map{ |c| "U+%04X"%c }
# =>  ["U+0041", "U+0032"]

See the standard and documentation for more details, including the differences between the normalization forms:

Special Case: Visual Confusable Characters

Even in normalization form, there are characters which look very similar (sometimes even identical):

Codepoints A	String A	String B	Codepoints B
U+003F + U+003F	`??`	`⁇`	U+2047
U+0043	`C`	`С`	U+0421
U+0031	`1`	`l`	U+006C

The record holder is LATIN SMALL LETTER O which is currently linked to 75 other characters that it could be confused with:

ం ಂ ം ං ० ੦ ૦ ௦ ౦ ೦ ൦ ๐ ໐ ၀ ‎٥‎ ۵ ｏ ℴ 𝐨 𝑜 𝒐 𝓸 𝔬 𝕠 𝖔 𝗈 𝗼 𝘰 𝙤 𝚘 ᴏ ᴑ ꬽ ο 𝛐 𝜊 𝝄 𝝾 𝞸 σ 𝛔 𝜎 𝝈 𝞂 𝞼 ⲟ о ჿ օ ‎ס‎ ‎ه‎ ‎𞸤‎ ‎𞹤‎ ‎𞺄‎ ‎ﻫ‎ ‎ﻬ‎ ﻪ‎ ‎ﻩ‎ ‎ھ‎ ‎ﮬ‎ ‎ﮭ‎ ‎ﮫ‎ ‎ﮪ‎ ‎ہ‎ ‎ﮨ‎ ‎ﮩ‎ ‎ﮧ‎ ‎ﮦ‎ ‎ە‎ ഠ ဝ 𐓪 𑣈 𑣗 𐐬

Detecting confusable characters is not built-in, it is possible with some gem support from unicode-confusable:

require "unicode/confusable"
Unicode::Confusable.confusable? "ℜ𝘂ᖯʏ", "Ruby" # => true

Case-Mapping

Another Unicode topic is converting a word from lowercase to uppercase or vice versa. Up until Ruby 2.3, string methods like #upcase,#capitalize, #downcase, or #swapcase would just not work with non-ASCII characters:

"ä".upcase # => "ä" # Ruby 2.3

This has been fixed and more recent versions of Ruby are able to do this out of the box:

"ä".upcase # => "Ä"

The old, ASCII-only behavior can be achieved by passing the :ascii option:

"ä".upcase(:ascii) # => "ä"

This is already much better than before, however, keep in mind that case-mapping is a locale-dependent operation! Not all languages use the same rules for converting between lower- and uppercase. For example, in most languages, the uppercase version of letter i is I:

"i".upcase # => "I"

However, in Turkic languages, it's the letter İ:

"i".upcase(:turkic) # => "İ"

Although Ruby supports special local case mapping rules, as of Ruby 2.5.1, only :turkic is supported. More options might be supported in the future.

Unicode® Standard Annex #44: Unicode Character Database / Section 5.6 (overview, see the respective sections in the Unicode standard itself)
RDoc: String#downcase

Special Case: Case-Folding

There is another special option that you can pass to the String#downcase method: The :fold symbol. It will turn on case-folding, which should be used instead of the default case-mapping behavior if you are interested in comparing/ordering strings. The case-folding algorithm might produce a different output than the case-mapping one. Fer example, the German letter sharp s should be treated like two normal s letters in comparisons:

"ẞ".downcase # => "ß"
"ẞ".downcase(:fold) # => "ss"

There is another String method in Ruby core which makes use of case-folding: String#casecmp?¹ which compares two strings ignoring their case:

 "A".casecmp? "a" # => true
 "ẞ".casecmp? "ss" # => true

¹ You should pay attention that its sister method String#casecmp only uses ASCII, despite the similar naming.

Regex Unicode Property Matching

Ruby's regex engine supports matching of Unicode characteristics, like a characters general purpose (general category), its script, or in which codepoint range it is defined (block):

"String, with: punctuation.".scan(/\p{P}/) # => [",", ":", "."]

See my previous articles for more details:

Episode 41: Proper Unicoding - More about the Unicode property syntax
Episode 30: Regex with Class - Unicode behavior of regex matchers & POSIX-style character classes

Special Case: Emoji Matching

Detecting emoji is especially complicated, because there are multiple mechanisms to build up the final emoji glyph. You can use the unicode-emoji gem to find all kinds of emoji:

require "unicode/emoji"
"😴 🛌🏽 🇵🇹 🤾🏽‍♀️".scan(Unicode::Emoji::REGEX) # => ["😴", "🛌🏽", "🇵🇹", "🤾🏽‍♀️"]

Monospace Display-Width

Sometimes, you might find yourself in a situation where you would like to know the width of a character. But this is not easily possible, because the character width is just not defined! This, of course, leads to problems in fixed-width environments like terminals.

If you don't believe me, here are some wide characters for you to checkout:

Codepoint	Glyph	Name
U+1242B	𒐫	CUNEIFORM NUMERIC SIGN NINE SHAR2
U+12219	𒈙	CUNEIFORM SIGN LUGAL OPPOSING LUGAL
U+A9C4	꧄	JAVANESE PADA MADYA
U+2E3B	⸻	THREE-EM DASH
U+2031	‱	PER TEN THOUSAND SIGN

To complicate things further, some Asian characters are marked as ambiguous and get displayed wide or narrow, depending on the software displaying them. The unicode-display_width can help:

require "unicode/display_width"

Unicode::DisplayWidth.of("⚀") # => 1
Unicode::DisplayWidth.of("一") # => 2

# Ambiguous example
Unicode::DisplayWidth.of("·", 1) # => 1
Unicode::DisplayWidth.of("·", 2) # => 2

Unicode Special Codepoints

The last section will put the focus on four types of codepoints that require some attention. This is just a selection, there are many more notable codepoints and a good starting point to dig deeper is the Awesome Codepoints list!

Invalid Codepoints

There are two kinds of codepoints which are invalid. If you have these in your data, the data is invalid and String#valid_encoding? will return false. Both of them are encoding-related:

UTF-16 Surrogates

The four byte mechanism that UTF-16 uses to represent codepoints higher than U+FFFF (= 65 535) needs auxiliary codepoints. These are U+D800..U+DFFF and they are strictly forbidden in UTF-8 and UTF-32.

Too Large Codepoints

Any codepoint above U+10FFFF (= 1 114 111) is not allowed. The theoretical UTF-32 maximum is U+FFFFFFFF (= 4 294 967 295) and four byte UTF-8 could represent codepoints upto U+1FFFFF (= 2 097 151).

Ruby does not let you create these from literals:

"\u{D800}" # => SyntaxError: (irb):52: invalid Unicode codepoint
"\u{110000}" # => SyntaxError: (irb):54: invalid Unicode codepoint (too large)

But, if you really need to…, you can use Array#pack:

[0xD800].pack("U") # => "\xED\xA0\x80"
[0x110000].pack("U") # => "\xF4\x90\x80\x80"

Ruby also includes a useful method that removes all invalid bytes, for example, surrogates:

"a\xED\xA0\x80b" # => "a\xED\xA0\x80b"
"a\xED\xA0\x80b".scrub # => "a���b"
"a\xED\xA0\x80b".scrub("") # => "ab"

Unstandardized Codepoints

Another group of codepoints that require extra care are the unstandardized ones. When you look at the following diagram, you will see that a lot of codepoints actually do not have a meaning assigned by the consortium (yet):

Codepoint Distribution as of Unicode 10

Types of Unstandardized Codepoints

Private-Use Codepoints: Meant for custom allocations by anyone. You will find vendor logos here, for example, U+F8FF for the Apple logo "" and U+F200 for the ubuntu logo "". Both may only display correctly on the respective operating system with a proper font). Other uses of the private plane include assigning codepoints to fantasy languages like Tengwar by J.R.R. Tolkien.
Non-Characters: A handful of codepoints that will never be assigned. Different than invalid codepoints, they are allowed to be used in your data. But they have no meaning.
Reserved Codepoints: Will (or might) be assigned in a later version of Unicode

Type	Count	Codepoints	Ruby Regex
Private-Use	137 468¹	U+E000..U+F8FF, U+F0000..U+FFFFD, U+100000..U+10FFFD	`/\p{private use}`
Non-Characters	66	U+FDD0..U+FDEF and the last two codepoints of each plane: U+XFFFE, U+XFFFF	`/\p{nchar}/`
Reserved	837 775	(not yet assigned)	`/\p{unassigned}(?<!\p{nchar})/`

¹ Two additional private-use codepoints are U+0091 and U+0092, but they are counted as control characters (see next section)

Unicode: Private-Use Characters, Noncharacters & Sentinels FAQ

Control Characters

For historical reasons Unicode includes a set of 65 control codepoints. They were not defined by the Unicode Consortium and a lot of them are not universally standardized. However, some of them are extremely common, such as U+0009, the tab-stop character. It also contains the newline characters U+0010 "\n" and U+0013 "\r"; depending on your operating system, use one or both of them for a newline.

Control characters are divided into the two sections C0, covering U+0000..U+001F, and C1, covering U+0080..U+009F. Furthermore, the delete character U+007F ␡ is also considered to be a control character.

In regexes, you can match for control characters with \p{control} or just \p{cc}.

List of C0 Control Codepoints

Codepoint	Symbol	Ruby Escape	Name
U+0000	␀ NUL	`\0`	NULL
U+0001	␁ SOH	`\u{1}`	START OF HEADING
U+0002	␂ STX	`\u{2}`	START OF TEXT
U+0003	␃ ETX	`\u{3}`	END OF TEXT
U+0004	␄ EOT	`\u{4}`	END OF TRANSMISSION
U+0005	␅ ENQ	`\u{5}`	ENQUIRY
U+0006	␆ ACK	`\u{6}`	ACKNOWLEDGE
U+0007	␇ BEL	`\a`	ALERT
U+0008	␈ BS	`\b`	BACKSPACE
U+0009	␉ HT	`\t`	CHARACTER TABULATION
U+000A	␊ LF	`\n`	LINE FEED
U+000B	␋ VT	`\v`	LINE TABULATION
U+000C	␌ FF	`\f`	FORM FEED
U+000D	␍ CR	`\r`	CARRIAGE RETURN
U+000E	␎ SS	`\u{e}`	SHIFT OUT
U+000F	␏ SI	`\u{f}`	SHIFT IN
U+0010	␐ DLE	`\u{10}`	DATA LINK ESCAPE
U+0011	␑ DC1	`\u{11}`	DEVICE CONTROL ONE
U+0012	␒ DC2	`\u{12}`	DEVICE CONTROL TWO
U+0013	␓ DC3	`\u{13}`	DEVICE CONTROL THREE
U+0014	␔ DC4	`\u{14}`	DEVICE CONTROL FOUR
U+0015	␕ NAK	`\u{15}`	NEGATIVE ACKNOWLEDGE
U+0016	␖ SYN	`\u{16}`	SYNCHRONOUS IDLE
U+0017	␗ ETB	`\u{17}`	END OF TRANSMISSION BLOCK
U+0018	␘ CAN	`\u{18}`	CANCEL
U+0019	␙ EM	`\u{19}`	END OF MEDIUM
U+001A	␚ SUB	`\u{1a}`	SUBSTITUTE
U+001B	␛ ESC	`\e`	ESCAPE
U+001C	␜ FS	`\u{1c}`	INFORMATION SEPARATOR FOUR
U+001D	␝ GS	`\u{1d}`	INFORMATION SEPARATOR THREE
U+001E	␞ RS	`\u{1e}`	INFORMATION SEPARATOR TWO
U+001F	␟ US	`\u{1f}`	INFORMATION SEPARATOR ONE

List of C1 Control Codepoints

Codepoint	Symbol	Ruby Escape	Name
U+0080	PAD	`\u{80}`	PADDING CHARACTER
U+0081	HOP	`\u{81}`	HIGH OCTET PRESET
U+0082	BPH	`\u{82}`	BREAK PERMITTED HERE
U+0083	NBH	`\u{83}`	NO BREAK HERE
U+0084	IND	`\u{84}`	INDEX
U+0085	NEL¹	`\u{85}`	NEXT LINE¹
U+0086	SSA	`\u{86}`	START OF SELECTED AREA
U+0087	ESA	`\u{87}`	END OF SELECTED AREA
U+0088	HTS	`\u{88}`	CHARACTER TABULATION SET
U+0089	HTJ	`\u{89}`	CHARACTER TABULATION WITH JUSTIFICATION
U+008A	VTS	`\u{8a}`	LINE TABULATION SET
U+008B	PLD	`\u{8b}`	PARTIAL LINE FORWARD
U+008C	PLU	`\u{8c}`	PARTIAL LINE BACKWARD
U+008D	RI	`\u{8d}`	REVERSE LINE FEED
U+008E	SS2	`\u{8e}`	SINGLE SHIFT TWO
U+008F	SS3	`\u{8f}`	SINGLE SHIFT THREE
U+0090	DCS	`\u{90}`	DEVICE CONTROL STRING
U+0091	PU1	`\u{91}`	PRIVATE USE ONE
U+0092	PU2	`\u{92}`	PRIVATE USE TWO
U+0093	STS	`\u{93}`	SET TRANSMIT STATE
U+0094	CCH	`\u{94}`	CANCEL CHARACTER
U+0095	MW	`\u{95}`	MESSAGE WAITING
U+0096	SPA	`\u{96}`	START OF GUARDED AREA
U+0097	EPA	`\u{97}`	END OF GUARDED AREA
U+0098	SOS	`\u{98}`	START OF STRING
U+0099	SGC	`\u{99}`	SINGLE GRAPHIC CHARACTER INTRODUCER
U+009A	SCI	`\u{9a}`	SINGLE CHARACTER INTRODUCER
U+009B	CSI	`\u{9b}`	CONTROL SEQUENCE INTRODUCER
U+009C	ST	`\u{9c}`	STRING TERMINATOR
U+009D	OSC	`\u{9d}`	OPERATING SYSTEM COMMAND
U+009E	PM	`\u{9e}`	PRIVACY MESSAGE
U+009F	APC	`\u{9f}`	APPLICATION PROGRAM COMMAND

¹ The NEXT LINE control character was introduced to have an universal codepoint for newlines. This goal was not reached. Still, on some systems (for example, my ubuntu machine), it will actually create a newline!

The characteristics gem lets you check if a codepoint belongs to a specific control group:

Characteristics.create("\u{80}").c0? # => false
Characteristics.create("\u{80}").c1? # => true

Ignorable Codepoints

My last example of special codepoints are the so called ignorable codepoints. Their meaning varies, but most of them are invisible and they are often not treated as a whitespace by Unicode. They are ignorable in the sense that if your Unicode rendering engine does not know how to display it, it should just display nothing. The ignorable property is even given to some ranges of unassigned codepoints¹ (which is usually not done).

You can check for ignorable codepoints using the /\p{default ignorable code point}/ (or its shorthand \p{di}) regex.

For example, the following piece of code is made out of tag characters, which resemble all ASCII characters, but as ignorable characters:

eval "󠁰󠁵󠁴󠁳󠀠󠀧󠁉󠁤󠁩󠁯󠁳󠁹󠁮󠁣󠁲󠁡󠁴󠁩󠁣󠀠󠁕󠁮󠁩󠁣󠁯󠁤󠁥󠀧".codepoints.map{ |c| c - 0xE0000 }.pack("U*")

This program will output Idiosyncratic Unicode

¹ The whole range of E0000..E0FFF is ignorable!

CLI Tools for Codepoint Analysis

I hope that you are now ready to closely inspect your own Unicode data! To help you do so, I made a few command-line tools, I hope you like them:

uniscribe for codepoint analysis
unibits for encoding analysis, also supports a lot of non-Unicode encodings
unicopy for converting & copying codepoints

Also See

character.construction

Ruby has Character