As you progress in the process of becoming a better web designer or programmer, it is very helpful to know the characteristics of different markup languages. In this article we are going to discuss the steps necessary to determine whether a document is HTML or XML. HTML and XML are part of a family of languages derived from the Standard Generalized Markup Language (SGML). You can learn more about SGML from the World Wide Web Consortium (W3C).
Let’s review briefly HTML and XML:
XML – The extensible markup language is a text-based cross-platform language that is enables you to store data (like addresses in an address book) in a structured manner. The XML document is expected to have correct syntax. Therefore, when making XML documents they should be well-formed. A well-formed document has the following:
- Closed tags (<mutualfunds></mutualfunds> or <stocks />)
- An element’s attribute must be enclosed in double quotes (<stock price="73.45">)
- XML is case sensitive. In other words the beginning and ending tag should use the same case
HTML – The hypertext markup language is a text-based cross-platform language that is used to author pages for presentation on world wide web. Using HTML one can create static or dynamic content for others to view. HTML is a little more lenient and allows some tags to not be nested correctly. HTML is not case sensitive and empty elements (for example, <br> the line break or <p> for paragraph) do not have to be closed.
These two languages have different goals for the end users. You would not primarily use XML to create a website for people to visit. And HTML would not be the optimal tool to represent data for various platforms. Therefore, it is very helpful to understand the difference between the two so you can utilize them effectively.
In our first step we are going to take a look at two sample documents. By the end of our steps we will have to tools to determine whether our document is HTML or XML based on our discussion.
<?xml version="1.0" encoding="utf-8"?>
<description>Our company needs an accountant to perform accounting duties.</description>
<description>Our company needs a financial director to perform administrative duties.</description>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<title>E-Tech Company Employment Listings</title>
<div id="content">View a listing below<br>
<li><a href="financialdirector.html">Financial Director</a>
Clue #1: First line
Both document 1 and document 2 start out very differently and the first line gives us a valuable clue as to their origins. Document 1 has what is called a processing instruction. It is what XML uses. Document 2 uses <!DOCTYPE …>. This is a document type declaration which tells the browser which validator type to use. Compliant web pages, or in other words, web pages which follow the expected rules of HTML, should specify a document type. You can of course just use <html> for web pages, but specifying a document type is highly recommended.
Clue #2: Closed Tags and Case sensitivity
In our first document all of our tags are closed. We see only beginning and ending tag types. These elements are referred to as being closed. XML documents must have closed tags. In document 2 on the other hand we have <li> tags which are not closed as well as a <br> tag. We can still do this and yet have a valid HTML document. Document 2 definitely not a XML document. Therefore, we have a much stronger argument that document 2 is not a XML document.
Also, in document 1 everything is in lowercase and matching. As you recall XML documents must be case sensitive.
Clue #3: Expected Tags
Our final clue relates to the rules or syntax of HTML. HTML documents are expected to have certain tags to be able to present information. The primary example is the tag <html>. This element must be specified so that the browser knows the document is HTML. In order to have a title you must provide the <title> tags as well. XML is quite different in this regard. You can specify any tags you want for your data. In any XML document you will not find a body tag unless you are specifying an element you want called "body". It will be expected to be closed and be case sensitive.
Take a look at the example below:
<?xml version="1.0" encoding="UTF-8"?>
<title>E-Tech Company Employment Listings</title>
<div id="content">View a listing below
<li><a href="financialdirector.html">Financial Director</a></li>
This is actually a well-formed XML document. It may appear to be a HTML document, but our processing instruction gives a valuable clue as to the type of document we have.
In conclusion, we have identified document 1 as XML and document 2 as HTML.
As a final tip, I always recommend validating your HTML and XML code. Your HTML code should be validated by a complaint checker such as the W3C. A tool like XML Spy or Validome is also valuable in validating your XML code as well.