Why you should be using HTML 4.01 instead of XHTML

Original Post: Beware of XHTML

If you’re a web developer, you’ve probably heard about XHTML, the markup language developed in 1999 to implement HTML as an XML format. Most people who use and promote XHTML do so because they think it’s the newest and hottest thing, and they may have heard of some (usually false) benefits here and there. But there is a lot more to it than you may realize, and if you’re using it on your website, even if it validates, you are probably using it incorrectly.

I should make it clear that I hope XHTML has a bright future on the Web. That is precisely why I have written this article. The state of XHTML on the Web today is more broken than the state of HTML, and most people don’t realize because the major browsers aren’t even treating those pages like real XHTML. If you hope for XHTML to succeed on the Web, you should read this article carefully.

Some of the issues discussed in this article are complicated and technical. If you find it difficult to follow, I suggest at least taking a look at the myths of XHTML, examples of latent compatibility issues, and the list of standards-related XHTML sites that break when treated properly.

Some quotes from prominent people/vendors:

Microsoft (Internet Explorer):
“If we tried to support real XHTML in IE 7 we would have ended up using our existing HTML parser (which is focused on compatibility) and hacking in XML constructs. It is highly unlikely we could support XHTML well in this way”
Mozilla (Firefox):
“If you are using the usual HTML features […] serving valid HTML 4.01 as text/html ensures the widest browser and search engine support.”
Apple (Safari):
“On today’s web, the best thing to do is to make your document HTML4 all the way. Full XHTML processing is not an option, so the best choice is to stick consistently with HTML4.”
Håkon Wium Lie (from Opera, W3C):
“I don’t think XHTML is a realistic option for the masses. HTML5 is it.”
Anne van Kesteren (from Opera):
“I’m an advocate of using XHTML only in the correct way, which basically means you have to use HTML. Period.”
Ian Hickson (from Opera, Google, W3C):
“Authors intending their work for public consumption should stick to HTML 4.01″

 

Table of Contents

  1. What is XHTML?
  2. Myths of XHTML
  3. Benefits of XML
  4. Content type is everything
  5. HTML compatibility guidelines
  6. Internet Explorer incompatibility
  7. Content negotiation
  8. Null End Tags (NET)
  9. Firefox and other problems
  10. Conclusion
  11. List of standards-related sites that break as XHTML
  12. List of standards-related sites that stick with HTML
  13. Related sites
  14. See also

 

What is XHTML?

Up

XHTML is a markup language hoped to eventually (in the distant future) replace HTML on the Web. For the most part, an XHTML 1.0 document differs from an HTML 4.01 document only in the lexical and syntactic rules: HTML is written in its own unique subset of SGML, while XHTML is written in a different subset of SGML called XML. SGML subsets are differentiated by the sets of characters that delimit tags and other constructs, whether or not certain types of shorthand markup may be used (such as minimized attributes, omitted start/end tags, etc.), whether or not tag names or character entities are case sensitive, and so on.

The Document Type Definition (DTD, which is referenced by the doctype) then defines which elements, attributes, and character entities exist in the language and where the elements may be in the document. The DTDs of XHTML 1.0 and HTML 4.01 are nearly identical, meaning that, as far as things like elements and attributes go, XHTML 1.0 and HTML 4.01 are basically the same language. The only added benefit of XHTML is that it uses XML’s subset of SGML and shares the benefits XML has over HTML’s subset.

 

Myths of XHTML

Up

There are many false benefits of XHTML promoted on the Web. Let’s clear up some of them at a glance (with details and other pitfalls provided later):

  • XHTML does not promote separation of content and presentation any more than HTML does. XHTML has all of the same elements and attributes (including presentational ones) that HTML has, and it doesn’t offer any additional CSS features. Semantic markup and separation of content and presentation is absolutely possible in HTML and is equally easy.
  • Most XHTML pages on the Web are not parsed as XML by today’s web browsers. The vast majority of XHTML pages on the Web cannot be parsed as XML. Even many valid XHTML pages cannot be parsed as XML. See the Validity and Well-Formedness article for details and examples.
  • HTML is not deprecated and is not being phased out at this time. In fact, the World Wide Web Consortium recently renewed the HTML working group which is working to develop HTML 5.
  • XHTML does not have good browser support. Most browsers simply treat XHTML pages as regular HTML (which presents a number of problems). Some major browsers like Firefox, Opera, and Safari may attempt to handle the page as proper XHTML, but usually only if you include a certain special HTTP header. However, when you do so, Internet Explorer and a number of other user agents will choke on it and won’t display a page at all. Even when handled as XHTML, the supporting browsers have a number of additional bugs.
  • Browsers do not parse valid XHTML dramatically faster than valid HTML, even when they’re parsing XHTML correctly. Although the browser can lose certain shorthand logic, it now has to use extra logic to confirm that the document is well-formed. Although XHTML, when parsed with an XML parser, may be somewhat faster to parse than typical HTML, the difference usually isn’t very significant. And either way, download speed is usually the bottleneck when it comes to document parsing, so users won’t notice any speed improvement.
  • XHTML is not extensible if you hope to support Internet Explorer or the number of other user agents that can’t parse XHTML as XML. They will handle the document as HTML and you will have no extensibility benefit.
  • XHTML source does not necessarily look much different from HTML source. If you prefer making sure all of your non-empty elements have close tags, you may use close tags in HTML, too. The only real markup differences between an HTML document and an XHTML document following the legacy compatibility guidelines are the doctype, html element, and the /> tag ends (which are just XML shorthand constructs like so many people claim to dislike about HTML).

 

Benefits of XML

Up

XML has a number of improvements over HTML’s subset of SGML:

  • Although HTML’s subset allowed for a lot of shorthand markup and other flexibility, it proved too difficult to write a correct and fully-featured parser for it. As a result, most user agents, including all of today’s major web browsers, make many technically unsound assumptions about the lexical format of HTML documents and don’t support a number of shorthand features like Null End Tags (<tag/Content/), unclosed start/end tags (<tag<tag>), and empty tags (<>). XML was designed to eliminate these extra features and restrict documents to a tight set of rules that are more straight-forward for user agents to implement. In effect, XML defines the assumptions that user agents are allowed to make, while still resulting in a file that a theoretical fully-featured SGML user agent could parse once pointed to XML’s SGML declaration.It should be noted that an XML parser for the most part is not dramatically easier to write than the level of HTML support offered by most HTML parsers. Most of the features that would make HTML more difficult to write a parser for, such as custom SGML declarations, additional marked sections, and most of the shorthand constructs, have negligible use on the Web anyway and generally have poor or absent support in major web browsers. The most significant difference is XML’s lack of support for omitted start and end tags, which in theory could amount to complicated logic in HTML for elements not defined as empty. Even still, most browsers have those rules hard-coded rather than derived from the DTD, so this isn’t a major difference in difficulty either.
  • To minimize the occurrence of nasty surprises when parsing the document, XML user agents are told to not be flexible with error handling: if a user agent comes upon a problem in the XML document, it will simply give up trying to read it. Instead, the user will be presented with a simple parse error message instead of the webpage. This eliminates the compatibility issues with incorrectly-written markup and browser-specific error handling methods by requiring documents to be “well-formed”, while giving webpage authors immediate indication of the problem. This does, however, mean that a single minor issue like an unescaped ampersand (&) in a URL would cause the entire page to fail, and so most of today’s public web applications can’t safely be incorporated in a true XHTML page.While user agents are supposed to fail on any page that isn’t well-formed (in other words, one that doesn’t follow the generic XML grammar rules), they do not have to fail on a page that is well-formed but invalid. For example, although it is invalid to have a span element as an immediate child of the body element, most XML-supporting web browsers won’t provide indication of the error because the page is still well-formed — that is, the DTD is violated, but not the fundamental rules of XML itself. Some user agents may choose to be “validating” agents and will also fail on validity errors, but they aren’t common.Despite popular assumption, even if an XML page is perfectly valid, it still might not be well-formed.
  • Unlike HTML’s subset, which was specifically made for HTML, XML is a common subset used in many different languages. This means that a single simple parser can easily be written to support a number of different languages. It also paved the way for the Namespaces in XML standard which allows multiple documents in different XML formats to be combined in a single XML document, so that you can have, for example, an XHTML page that contains one or more SVG images that use MathML inside them.

 

Content type is everything

Up

When your website sends a document to the visitor’s browser, it adds on a special content type header that lets the browser know what kind of document it’s dealing with. For example, a PNG image has the content type image/png and a CSS file has the content type text/css. HTML documents have the content type text/html. Web servers typically send this content type whenever the file extension is .html, and server-side scripting languages like PHP also typically send documents as text/html by default.

XHTML does not have the same content type as HTML. The proper content type for XHTML is application/xhtml+xml. Currently, many web servers don’t have this content type reserved for any file extension, so you would need to modify the server configuration files or use a server-side scripting language to send the header manually. Simply specifying the content type in a meta element will not work over HTTP.

When a web browser sees the text/html content type, regardless of what the doctype says, it automatically assumes that it’s dealing with plain old HTML. Therefore, rather than using the XML parsing engine, it treats the document like tag soup, expecting HTML content. Because HTML 4.01 and simple XHTML 1.0 are often very similar, the browser can still understand the page fairly well. Most major browsers consider things like the self-closing portion of a tag (as in <br />) as a simple HTML error and strip it out, usually ending up with the HTML equivalent of what the author intended.

However, when the document is treated like HTML, you get none of the benefits XHTML offers. The browser won’t understand other XML formats like MathML and SVG that are included in the document, and it won’t do the automatic validation that XML parsers do. In order for the document to be treated properly, the server would need to send the application/xhtml+xml content type.

The problems go deeper. Comment markers are sometimes handled differently depending on the content type, and when you enclose the contents of a script or style element with basic SGML-style comments, it will cause your script and style information to be completely ignored when the document is treated like XML. Also, any special markup characters used in the inline contents of a style or script element will be parsed as markup instead of being treated as character data like in HTML. To solve these problems, you must use an elaborate escape sequence described in the article Escaping Style and Script Data, and even then there are situations in which it won’t work.

Furthermore, the CSS and DOM specifications have special provisions for HTML that don’t apply to XHTML when it’s treated as XML, so your page may look and behave in unexpected ways. The most common problem is a white gap around your page if you have a background on the body, no background on the html element, and any kind of spacing between the elements, such as a margin, padding, or a body height under 100% (browsers typically have some combination of these by default). In scripting, tag names are returned differently and document.write() doesn’t work in XHTML treated as XML. Table structure in the DOM is different between the two parsing modes. These are only a select few of the many differences.

The following are some examples of differing behavior between XHTML treated as HTML and XHTML treated as XML. The anticipated results are based on the way Internet Explorer, Firefox, and Opera treat XHTML served as HTML. Some other browsers are known to behave differently. Also note that Internet Explorer doesn’t recognize the application/xhtml+xml content type (see below for an explanation), so it will not be able to view the examples in the second column.

Example 1 Example 1
Example 2 Example 2
Example 3 Example 3
Example 4 Example 4
Example 5 Example 5
Example 6 Example 6
Example 7 Example 7
Example 8 Example 8
Example 9 Example 9
Example 10 Example 10

 

HTML compatibility guidelines

Up

When the XHTML 1.0 specification was first written, there were provisions that allowed an XHTML document to be sent as text/html as long as certain compatibility guidelines were followed. The idea was to ease migration to the new format without breaking old user agents. However, these provisions are now viewed by many as a mistake. The whole point of XHTML is to be an XML alternative to HTML, yet due to the allowance of XHTML documents to be sent as text/html, most so-called XHTML documents on the Web now would break if they were treated like XML (see the real-world examples below). Aware of the problem, the W3C had these provisions removed in the first revision of the XHTML specification. In XHTML 1.1 and onward, the W3C now clearly says that an XHTML document should not be sent as text/html. XHTML should be sent as application/xhtml+xml or one of the more elaborate XHTML content types.

 

Internet Explorer incompatibility

Up

Internet Explorer does not support XHTML. Like other web browsers, when a document is sent as text/html, it treats the document as if it was a poorly constructed HTML document. However, when the document is sent as application/xhtml+xml, Internet Explorer won’t recognize it as a webpage; instead, it will simply present the user with a download dialog. This issue still exists in Internet Explorer 7.

Although all other major web browsers, including Firefox, Opera, Safari, and Konqueror, support XHTML, the lack of support in Internet Explorer as well as major search engines and web applications makes use of it very discouraged.

 

Content negotiation

Up

Content negotiation is the idea of sending different content depending on what the user agent supports. Many sites attempt to send XHTML as application/xhtml+xml to those who support it, and either XHTML as text/html or real HTML to those who don’t.

There are two methods generally used to determine what the user agent supports, using the Accept HTTP header: most often, sites use the incorrect method where they simply look for the string “application/xhtml+xml” in the header value; although some sites will use the correct method, where they actually parse the header value, supporting wildcards and ordering by q value.

Unfortunately, neither of these methods works reliably.

The first method doesn’t work because not all XHTML-supporting user agents actually have the text “application/xhtml+xml” in the Accept header. Safari and Konqueror are two such browsers. The application/xhtml+xml content type is implied by a wildcard value instead. Meanwhile, not all HTML-supporting user agents have “text/html” in the header. Internet Explorer, for example, doesn’t mention this content type. Like Safari and Konqueror, it implies this support by using a wildcard. Even among those user agents that support XHTML and mention application/xhtml+xml in the header, it may have a lower q value than text/html (or a matching wildcard), which implies that the user agent actually prefers text/html (in other words, its XHTML support may be experimental or broken).

The second method (the correct, 100% standards-complaint one) doesn’t work because most major browsers have inaccurate Accept headers:

  • Firefox 2 and below have application/xhtml+xml listed with a higher q value than text/html, even though Mozilla has posted an official recommendation on its site saying that websites should use text/html for these versions if they can, for reasons described below.
  • Internet Explorer doesn’t list either text/html or application/xhtml+xml in its Accept header. Instead, both content types are covered by a single wildcard value (which implies that every content type in existence is supported equally well, which is obviously untrue). So Internet Explorer is saying that it supports both text/html and application/xhtml+xml equally, even though it actually doesn’t support application/xhtml+xml at all. In the case that a user agent claims to support both equally, the site is supposed to use its own preference. A possible workaround is for the site to “prefer” sending text/html or, in a toss-up situation, only send application/xhtml+xml if it’s actually mentioned explicitly in the header. However…
  • Safari and Konqueror, which support XHTML, also gives text/html and application/xhtml+xml the same q value (in fact, like Internet Explorer, they also claim to support everything in existence equally well). But they don’t mention application/xhtml+xml explicitly — it’s implied by a wildcard. So if you use the above workaround, Safari and Konqueror will receive text/html even though they really do support application/xhtml+xml.

As disappointing as it may be, content negotiation simply isn’t a reliable approach to this problem.

 

Null End Tags (NET)

Up

In XHTML, all elements are required to be closed, either by an end tag or by adding a slash to the start tag to make it self-closing. Since giving empty elements like img or br an end tag would confuse browsers treating the page like HTML, self-closing tags tend to be promoted. However, XML self-closing tags directly conflict with a little-known and poorly supported HTML/SGML feature: Null End Tags.

A Null End Tag is a special shorthand form of a tag that allows you to save a few characters in the document. Instead of writing <title>My page</title>, you could simply write <title/My page/ to accomplish the same thing. Due to the rules of Null End Tags, a single slash in an empty element’s start tag would close the tag right then and there, meaning <br/ is a complete and valid tag in HTML. As a result, if you have <br/> or <br />, a browser supporting Null End Tags would see that as a br element immediately followed by a simple > character. Therefore, an XHTML page treated as HTML could be littered with unwanted > characters.

This problem is often overlooked because most popular browsers today are lacking support for Null End Tags, as well as some other SGML shorthand features. However, there are still some smaller user agents that properly support Null End Tags. One of the more well-known user agents that support it is the W3C validator. If you send it a page that uses XHTML self-closing tags, but force it to parse the page as HTML/SGML like most user agents do for text/html pages, you can see the results in the outline: immediately after each of the self-closing elements, there is an unwanted > character that will be displayed on the page itself.

(It should be noted that the W3C Validator is unusual in that it generally determines the parsing mode from the doctype, rather than from the content type as most other user agents do. Therefore, an HTML doctype was used in the above example just so the validator would attempt to parse the page using the HTML subset of SGML as all major browsers will for text/html pages regardless of the doctype. The Null End Tag rules are actually set in the SGML subset definition, not the DTD, so this example is accurate to what you should expect in a fully compliant SGML user agent even with an XHTML doctype.)

Technically, a restricted and altered form of Null End Tags exists in XML and is frequently used: the self-closing portion of the start tag. While Null End Tags are defined as / … / in HTML’s subset of SGML, they are specially defined as / … > in XML with the added restriction that it must close immediately after it is opened, meaning the element must have no content. This was designed to look similar to a regular start tag for web developers who are unfamiliar with typical Null End Tags. However, in the process it creates inherent incompatibility with HTML’s subset of SGML for all empty elements.

In summary, although this issue doesn’t show in most popular web browsers, a user agent that more fully supports SGML would see unwanted > characters all over XHTML pages that are sent with the text/html content type. If the goal of using XHTML is to help promote standards, then it’s quite counterproductive to cause unnecessary problems for user agents that more correctly comply to the SGML standard.

 

Firefox and other problems

Up

Although Firefox supports the parsing of XHTML documents as XML when sent with the application/xhtml+xml content type, its performance in versions 2.0 and below is actually worse than with HTML. When parsing a page as HTML, Firefox will begin displaying the page while the content is being downloaded. This is called incremental rendering. However, when it’s parsing XML content, Firefox 2.0 and below will wait until the entire page is downloaded and checked for well-formedness before any of the content is displayed. This means that, although in theory XML is supposed to be faster to parse than HTML, in reality these versions of Firefox usually display HTML content to the user much faster than XHTML/XML content. Thankfully, this issue is expected to be resolved in Firefox 3.0.

However, there are also issues in other browsers, such as certain HTML-specific provisions in the CSS and DOM standards being mistakenly applied to XHTML content parsed as XML. For example, if there is a background set on the body element and none on the html element, Opera will apply the background to the html element as it would in HTML. So even when dealing exclusively with XHTML parsed as XML, you still run into a number of the same problems that you do when trying to serve XHTML either way.

All in all, true XHTML support in major user agents is still very weak. Because a key user agent — namely, Internet Explorer — has made no visible effort to support XHTML, other major user agents have continued to see it as a relatively low priority and so these bugs have lingered. HTML is recommended over XHTML by both Mozilla and Safari and is generally better supported than XHTML by all major browsers.

 

Conclusion

Up

XHTML is a very good thing, and I certainly hope to see it gain widespread acceptance in the future. However, it simply isn’t widely supported in its proper form. XHTML is an XML format, and to force a web browser to treat it like HTML is going against the whole purpose of XHTML and also inevitably causes other complications. Assuming you don’t want to dramatically limit access to your information, XHTML can only be used incorrectly, be interpretted as invalid markup by most user agents, cause unwanted results in others, and offer no added benefit over HTML. HTML 4.01 Strict is still what most user agents and search engines are most accustomed to, and there’s absolutely nothing wrong with using it if you don’t need the added benefits of XML. HTML 4.01 is still a W3C Recommendation, and the W3C has even announced plans to further develop HTML alongside XHTML in the future.

 

List of standards-related sites that break as XHTML

Up

The following are just a few of the countless sites that use an XHTML doctype but, as of this moment of writing, completely fail to load or otherwise work improperly when parsed as XML, thus missing the whole point of XHTML. The authors of most of these sites are quite prominent in the web standards community — many are involved in the Web Standards Project (WaSP) — yet they have still fallen victim to the pitfalls of current use of XHTML. In fact, I have found that nearly all XHTML websites owned by WaSP members have failures when parsed as XML.

You could consider this a “shame list” of sorts. These are the same people who are supposed to be teaching others how to use web standards properly, yet they have written markup that basically depends on browsers treating it incorrectly. But the main point of this list isn’t to pick on individuals; it’s to reinforce the fact that even so-called experts at web standards have trouble juggling the different ways XHTML will inevitably be handled on the Web. And what benefit does it bring? None of the following sites make use of anything XHTML offers over HTML.

You can test a page’s actual XHTML rendering in Firefox using the Force Content-type extension and setting the new content-type to application/xhtml+xml.

Accessify - WaSP Steering Committee, Accessibility Task Force
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted.
all in the <head> - WaSP Steering Committee
Page doesn’t load. Not well-formed. (Note: this page is valid according to the XHTML DTD and XML’s subset of SGML, but XML has additional rules to define well-formed pages which this page breaks, observed in the Textpattern and the Technorati Link Count Widget post. A similar test case is available.)
And all that Malarkey - WaSP Accessibility Task Force
Page doesn’t load. Not well-formed.
CSS Zen Garden - WaSP
Top background doesn’t display. The page relies on HTML-specific background behavior. Numerous designs have errors with a similar cause.
dean.edwards.name/weblog/ - WaSP DOM Scripting Task Force, Microsoft Task Force
For browsers that support behavior binding (including Firefox) for the dynamic syntax highlighting of the code snippits, most of the code boxes fail to load the contents, resulting in many empty boxes where code snippits should be.
dog or higher
Page doesn’t load. Not well-formed.
Elly Thompson’s Weblog
Page doesn’t load. Not well-formed.
g9g.org - WaSP Steering Committee
There is a thick white gap around the page. The page relies on HTML-specific background behavior.
holly marie - WaSP Steering Committee
Page doesn’t load. Not well-formed.
Jeffrey Veen - WaSP emeritus
Page doesn’t load. Not well-formed.
KuraFire - WaSP
Page doesn’t load. Not well-formed.
Meriblog
Background appears white instead of purple. The page relies on HTML-specific background behavior.
mezzoblue - WaSP
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted. Also, individual post pages don’t load. Not well-formed.
microformats
Page doesn’t load. Not well-formed.
molly.com - WaSP Group Lead
Flickr script fails to initialize because the script contents are commented out.
Off the Top - WaSP Steering Committee
Page doesn’t load. Not well-formed.
unadorned.org - WaSP Steering Committee
Stylesheet doesn’t load because the import rule is commented out.
WordPress - WaSP
Page doesn’t load. Not well-formed.

 

List of standards-related sites that stick with HTML

Up

The following are some significant sites relevant to web standards that continue to use HTML rather than XHTML.

  • 456 Berea Street
  • Anne van Kesteren
  • Bite Size Standards
  • David Baron’s Homepage
  • Hixie’s Natural Log
  • Jonathan Snook’s Blog
  • meyerweb.com
  • Mozilla
  • Web Devout
  • WebKit

This work is copyright © /2007/ David Hammond and is licensed under a Creative Commons Attribution Share-Alike License. It may be copied, modified, and distributed freely as long as it attributes the original author and maintains the original license. See the license for details.