Parsing Broken XM

The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in an attribute value. (It is a common misconception that html has no defined error handling. HTML error handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)

Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors.

So, I have both theoretical and practical reasons to parse XML documents “at any cost,” that is, not to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.

Here is a fragment of a broken XML document. I’ve highlighted the wellformedness error.

Get hands-on with 1200+ tech skills courses.