You are here for a collection of 10 advanced features of regular expressions in Ruby!
Regex Conditionals
Regular expressions can have embedded conditionals (if-then-else) with (?ref)then|else
. "ref" stands for a group reference (number or name of a capture group):
# will match everything if string contains "ä", or only match first two chars
regex = /(.*ä)?(?(1).*|..)/
"Ruby"[regex] #=> "Ru"
"Idiosyncrätic"[regex] #=> "Idiosyncrätic"
Keep Expressions
The possible ways to look around within a regex are:
Syntax | Description | Example |
---|---|---|
(?=X) |
Positive lookahead | "Ruby"[/.(?=b)/] #=> "u" |
(?!X) |
Negative lookahead | "Ruby"[/.(?!u)/] #=> "u" |
(?<=X) |
Positive lookbehind | "Ruby"[/(?<=u)./] #=> "b" |
(?<!X) |
Negative lookbehind | "Ruby"[/(?<!R|^)./] #=> "b" |
But Ruby also has "Keep Expressions", an additional shortcut syntax to do positive lookbehinds using \K
:
"Ruby"[/Ru\Kby/] #=> "by"
"Ruby"[/ru\Kby/] #=> nil
Character Class Intersections
You can nest character classes and AND-connect them with &&
. Matching all non-vowels here:
"Idiosyncratic".scan /[[a-z]&&[^aeiou]]+/
# => ["d", "syncr", "t", "c"]
Regex Sub-Expressions
You can recursively apply regex groups again with \g<ref>
. "ref" stands for a group reference (number or name of a capture group). This is different from back-references (\1
.. \9
), which will re-match the already matched string, instead of executing the regex again:
# match any number of sequences of 3 identical chars
regex = /((.)\2{2})\g<1>*/
"aaa"[regex] #=> "aaa"
"abc"[regex] #=> nil
"aaab"[regex] #=> "aaa"
"aaabbb"[regex] #=> "aaabbb"
"aaabbbc"[regex] #=> "aaabbb"
"aaabbbccc"[regex] #=> "aaabbbccc"
Match Characters that Belong Together
\X
treats combined characters as a single character. See grapheme clusters for more information.
string = "R\u{030A}uby"
string[/./] #=> "R"
string[/.../] #=> "R̊u"
string[/\X\X/] #=> "R̊u"
Relative Back-References
Back-refs can be relatively referenced from the current position via \k<-n>
:
"Ruby by"[/(R)(u)(by) \k<-1>/] #=> "Ruby by"
Deactivate Backtracking
Atomic groups, defined via (?>X)
, will always try to match the first of all alternatives:
"Rüby"[/R(u*|ü)by/] #=> "Rüby"
"Rüby"[/R(?>u*|ü)by/] #=> nil
Turn On Unicode-Matching for \w
, \d
, \s
, and \b
"Rüby"[/\w*/] #=> "R"
"Rüby"[/(?u)\w*/] #=> "Rüby"
Continue Matching at Last Match Position
When using a method that matches a regex multiple times against a string (like String#gsub
or String#scan
), you can reference the position of the last match via \G
:
"abc1abc22abc333".scan /\Gabc./ # => ["abc1", "abc2"]
String#split
with Capture Groups
The normal way of using String#split
is this:
"0-0".split(/-/) #=> ["0", "0"]
But if you want to make your code as hard to read as possible, remember that captured groups will be added to the resulting array:
"0-0".split(/(-)/) #=> ["0", "-", "0"]
"0-0".split(/-(?=(.))/) #=> ["0", "0", "0"]
"0-0".split(/(((-)))/) #=> ["0", "-", "-", "-", "0"]
Resources
More Idiosyncratic Ruby
- Please Comment on GitHub
- Next Article: More Inspections
- Previous Article: Know your Environment