Special Characters: Matching the Protocol and Hostname

Look at how the regular expressions are written in JavaScript to match the protocol and hostname of a URL.

Regular expression to match the protocol

This part is easy. There aren’t that many protocols around, and since our testing strings are quite simple, we can say a protocol is either the string http, https, or ftp, followed by the “://” portion. This is how we can write that in a RegExp:

/(https?|ftp):\/\//g

To double-check what we’ve seen so far (and the new part with the OR operator), let’s break it down:

  • We’re matching both http and https with the addition of the ? character after https (essentially making the last “s” optional).

  • We’re using the OR operator (that | character) to let the parser know we’re good with either side of the expressions inside the parenthesis. In other words, groups can contain subexpressions inside them, and you can let the parser know you’re looking for one of many different options inside a string.

  • Finally, we’re matching the “://” portion of the protocol by escaping the slash because, if you remember correctly, that is the character we use to delimit the start and end of the literal regular expression’s syntax in JavaScript (and other languages as well). So if we don’t escape them, the parser will not really understand them as part of the matching string.

Regular expression to match the hostname

Now the hostname should have an optional www, the name of your server, and a .com to complete it (hostnames are a bit more complicated than that, but, for the sake of simplicity, we’re keeping the list of conditions to a minimum). So, an expression like the following might help:

/(www\.)?([a-z-]+)\.com/g

Again, breaking things down:

  • The optional www is represented by making the first group optional (remember that adding the ? after it will let the parser know we’re looking for either one or zero instances of that group).

  • The server name is represented by the character class at the middle. We’re allowing any character from a to z and dashes (-).

  • The final mandatory .com is represented by the ending, .com.

Notice how we had to escape the dot in both places. If we hadn’t, the expression would’ve worked anyway because the dot, as you might remember, is a wildcard for almost any character. So, without escaping it, the following string would’ve also matched our expressions (which is something we don’t want): wwwmyhostcom.

Look at the following screenshot and see how regex101.com helps you understand which parts of your expression match different parts of your testing strings.

Get hands-on with 1200+ tech skills courses.