Idiosyncratic Ruby: US-ASCII-8BIT

How come that Ruby has two ASCII encodings?

Encoding.name_list.grep(/ASCII/)
# => ["ASCII-8BIT", "US-ASCII"]

Which one is the normal one you should use for ASCII?

Aliases

ASCII-8BIT	US-ASCII
BINARY	ASCII
	ANSI_X3.4-1968
	646

So, US-ASCII is aliased to ASCII, but then what is ASCII-8BIT for? Encodings' RDoc has some help:

Encoding::ASCII_8BIT is a special encoding that is usually
used for a byte string, not a character string. But as the name insists,
its characters in the range of ASCII are considered as ASCII characters.
This is useful when you use ASCII-8BIT characters with other ASCII
compatible characters.

So basically, it is not a real encoding, but represents an arbitrary stream of bytes (bytes with a value between 0 and 255). It is used for raw byte stream or if you want to make clear that you do not know about a string's encoding!

The ASCII charset only takes 7 bits, so in strict ASCII, the 8th bit should never be set. The allowed byte value range is from 0 to 127. This is what the US-ASCII encoding is all about: It is used when dealing with ASCII encoded strings. Think: "ASCII-7BIT"

A simple example illustrating the difference:

 out_of_ascii_range = 128.chr # => "\x80"
 out_of_ascii_range.force_encoding("US-ASCII").valid_encoding? # => false
 out_of_ascii_range.force_encoding("ASCII-8BIT").valid_encoding? # => true

More Idiosyncratic Ruby

Please Comment on GitHub
Next Article: What the Time?
Previous Article: Struggling Four Equality