Ruby's URI standard library contains a very sophisticated regex for matching URLs:
"At https://idiosyncratic-ruby.com you can learn about the " \
"obscure parts of Ruby"[URI.regexp]
# => "https://idiosyncratic-ruby.com"
This regex is built in uri/rfc2396_parser.rb and looks like this:
$ ruby -v
ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
$ ruby -r uri -e 'p URI.regexp'
/
([a-zA-Z][\-+.a-zA-Z\d]*): (?# 1: scheme)
(?:
((?:[\-_.!~*'()a-zA-Z\d;?:@&=+$,]|%[a-fA-F\d]{2})(?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*) (?# 2: opaque)
|
(?:(?:
\/\/(?:
(?:(?:((?:[\-_.!~*'()a-zA-Z\d;:&=+$,]|%[a-fA-F\d]{2})*)@)? (?# 3: userinfo)
(?:((?:(?:[a-zA-Z0-9\-.]|%\h\h)+|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|\[(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:(?:[a-fA-F\d]{1,4}:)*[a-fA-F\d]{1,4})?::(?:(?:[a-fA-F\d]{1,4}:)*(?:[a-fA-F\d]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?)\]))(?::(\d*))?))? (?# 4: host, 5: port)
|
((?:[\-_.!~*'()a-zA-Z\d$,;:@&=+]|%[a-fA-F\d]{2})+) (?# 6: registry)
)
|
(?!\/\/)) (?# XXX: '\/\/' is the mark for hostport)
(\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*(?:\/(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*(?:;(?:[\-_.!~*'()a-zA-Z\d:@&=+$,]|%[a-fA-F\d]{2})*)*)*)? (?# 7: path)
)(?:\?((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 8: query)
)
(?:\#((?:[\-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]|%[a-fA-F\d]{2})*))? (?# 9: fragment)
/x
Further Reading
- Simpler regex to match a lot of URIs (you should increase the maximum length of TLDs, though)
- RFC 2396
- RFC 3986
More Idiosyncratic Ruby
- Please Comment on GitHub
- Next Article: Literate Ruby
- Previous Article: Better Standards