HTML Encoding

HTML encoding can help us defend against XSS attacks. Let's see how.

We'll cover the following

Now let’s consider how we can defend against this. A frequently suggested defense that doesn’t work is to strip out < and > characters. One problem with this defense is that sometimes people need to discuss dangerous inputs. Readers of this course, for example, may want to discuss XSS payloads on a web-based forum. Attempts to strip out < and > would stop these conversations. Also, we’ll see that not every XSS attack needs < or >.

HTML encoding

Before we look at its application for defense, let’s take a look at how HTML encoding works. In the previous paragraph, we touched on an interesting problem in HTML. We use < and > to make HTML tags in our web pages. But what if HTML tags are what we want to talk about in the content of our web pages? At first glance, it would seem that we can’t do that because writing about tags would insert tags into our HTML documents and the tags themselves wouldn’t be displayed. Fortunately, HTML’s authors thought of this and provided a mechanism for allowing discussions of HTML itself in HTML.

Most of the time, the content of an HTML document will consist of literal characters, which get rendered into exactly the characters that make up the source. So HTML markup like this:

<div>abcdefg</div>

gets rendered like this:

abcdefg

Each character inside the div gets rendered just as it appears in the source.

Character references

But there is another kind of character in HTML called a character reference. Character references are rendered differently than they appear in source. Character references play two roles in HTML. One role is that they allow you to create content in non-Western languages even if you’re using a Western keyboard. The second role is that they allow you to create content that displays key HTML characters like &, <, >, and " when rendered by a browser. This second role is exactly what we need to defend ourselves from HTML injection and XSS attacks.

HTML has two kinds of character references: named character references and numeric character references. All HTML character references start with an ampersand and end with a semicolon. Named character references will have a mnemonic in the middle. Numeric character references will have a Unicode code point in the middle. The Unicode code point can be represented in either hex or decimal. Named character references only exist for a set of the most commonly used characters. Numeric character references exist for each Unicode character.

Any character can be encoded this way. Let’s take a look at four examples. In this table, each row shows a rendered character in the leftmost column followed by three different ways of writing the character in the source of an HTML page.

Rendered Character Named Character Decimal Numeric Character Hex Numeric Character
& &amp; &#38; &#x26;
< &lt; &#60; &#x3C;
> &gt; &#62; &#x3E;
" &quot; &#34; &#x22;

                                                 Q U I Z  

Get hands-on with 1000+ tech skills courses.