HTML, XML, and Unicode

HTML enables us to mark up the structure of a human-readable web document or web user interface. XML, in contrast, enables us to mark up the structure of all kinds of documents, data files, and messages, whether they’re human-readable or not. XML can also be used as the basis for defining a version of HTML called XHTML.

XML provides a syntax for expressing structured information in the form of an XML document and includes nested elements and their attributes. The specific elements and attributes used in an XML document can come from any vocabulary, such as public standards, private standards, or user-defined XML formats. XML is used for specifying document formats, such as XHTML5, the SVG format, or the DocBook format. It can also be used with data interchange file formats such as the Mathematical Markup Language (MathML) or the Universal Business Language (UBL), as well as message formats such as the web service message format SOAP.

XML is based on Unicode, a platform-independent character set that includes almost all characters from most of the world’s script languages, including Hindi, Burmese, and Gaelic. Each character is assigned a unique integer code ranging between 0 and 1,114,111.

For example, the Greek letter π has the code 960, so it can be inserted in an XML document as π using the XML entity syntax. Unicode includes legacy character sets like ASCII and ISO-8859-1 (Latin-1) as subsets.

The default encoding of an XML document is UTF-8, which uses only a single byte for ASCII characters, but three bytes for less common characters. Almost all Unicode characters are legal in a well-formed XML document. Illegal characters are the control characters that are assigned codes 0 through 31, with the exception of the carriage return, line feed, and tab. It’s therefore dangerous to copy text from another non-XML text to an XML document because often, the form feed character creates a problem.

XML namespaces

Generally, namespaces help avoid name conflicts. They allow the reuse of the same local name in different namespace contexts. Many computational languages have some form of namespace concept, for instance, Java and PHP. XML namespaces are identified with the help of a namespace URI, such as the SVG namespace URI associated with a namespace prefix, such as svg. This kind of namespace represents a collection of names both for elements and attributes. It also allows namespace-qualified names of the form prefix:name (like svg:circle) as a namespace-qualified name for SVG circle elements.

A default namespace is declared at the start tag of an element in the following way:

<html xmlns="http://www.w3.org/1999/xhtml">

This example shows the start tag of the HTML root element where the XHTML namespace is declared as the default namespace.

The following example shows an SVG namespace declaration for an svg element embedded in an HTML document:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
</head>
<body>
<figure>
<figcaption>Figure 1: A blue circle</figcaption>
<svg:svg xmlns:svg="http://www.w3.org/2000/svg">
<svg:circle cx="100" cy="100" r="50" fill="blue" />
</svg:svg>
</figure>
</body>
</html>

Correct XML documents

XML defines two syntactic correctness criteria. An XML document must be well-formed, and if it’s based on a grammar or schema, then it must be valid with respect to that grammar. In other words, it must satisfy all rules of the grammar. An XML document is called well-formed if it satisfies the following syntactic conditions:

  1. There must be exactly one root element.

  2. Each element needs a start tag and an end tag. However, empty elements can be closed as <phone/> instead of <phone></phone>.

  3. Tags shouldn’tt overlap. For instance, we cannot have:

    <author><name>Lee Hong</author></name>
    
  4. Attribute names must be unique within the scope of an element. For instance, the following code is not correct:

    <attachment file="lecture2.html" file="lecture3.html"/>
    

An XML document is called valid against a particular grammar (such as a DTD or an XML Schema) if the following conditions are met:

  1. If it is well-formed.
  2. If it respects the grammar.

The history of HTML

Berners-Lee developed the first version of HTML in 1990. A few years later, in 1995, Berners-Lee and computer scientist Dan Connolly wrote the HTML2 standard, which outlined the common use of HTML elements at that time. In the following years, HTML has been used and gradually extended by a growing community of early WWW adopters. This evolution of HTML, which has led to a messy set of elements and attributes (called “tag soup”), has been mainly guided by browser vendors and their competition with each other.

The development of XHTML in 2000 was an attempt by the World-Wide Web Committee (W3C) to improve these issues. However, this neglected to advance HTML’s functionality towards a richer user interface, which was the focus of the Web Hypertext Application Technology (WHAT) working group led by Ian Hickson, who is often considered as the mastermind and main author of HTML5 and many of its accompanying JS APIs that have adapted HTML for mobile applications.

The evolution of HTML

W3C has developed the following important versions of HTML:

  • HTML4 was developed in 1997 as a Standard Generalized Markup Language.
  • XHTML was developed in the year 2000 as an XML-based cleanup of HTML4.
  • XHTML5 was developed in 2014. It was created in cooperation (and competition) with the WHAT working group and was supported by browser vendors.

HTML was originally designed as a structure description language, not as a presentation description language. But, HTML4 is made up of many purely presentational elements, like font. XHTML has been taking HTML back to its roots, dropping presentational elements and defining a simple and clear syntax. XHTML was designed to support the following goals:

  • Device independence
  • Accessibility
  • Usability

For our purposes, we’ll adopt the following symbolic equation:

HTML=HTML5=XHTML5HTML = HTML5 = XHTML5

When we say “HTML” or “HTML5”, we actually mean XHTML5 because the syntax of XML documents is much more clear and much less confusing than the HTML4-style syntax also allowed by HTML5.

Note: Since HTML5 isn’t case-sensitive, all XHTML tags can go by the same tag.

The following simple example shows the basic code template that can be used for any HTML document:

  • HTML
  • Output
XHTML5 template example

  • In line 1, the HTML5 document type is declared so that browsers are instructed to use the HTML5 document object model (DOM).
  • The HTML start tag in line 2 uses the default namespace declaration attribute xmlns. The XHTML namespace URI is declared as the default namespace so that browsers and other tools understand that all non-qualified element names like HTML, head, body, and others are from the XHTML namespace. Additionally, in the HTML start tag, we set the default language for the text content of all elements (in this case, to "en" for English) using both the xml:lang attribute and the HTML lang attribute. This attribute duplication is a small price to pay for having a hybrid document that can be processed both by HTML and by XML tools
  • Finally, in line 4, using an empty meta element with a charset attribute, we set the HTML document’s character encoding to UTF-8. This is also the default for XML documents.