XML crib sheet

XML is the 'lingua franca' for e-documents. It provides a standard mark-up language for defining the structure of documents, thereby enabling their transmission, validation and interpretation between applications and between organisations.

Page 1 | Page 2 | Page 3 | All 3 Pages

What's all the fuss about?
The sudden boom in the web during the 1990s brought home to many vendors and many within IT organisations the value of a document format that could be read by virtually any other application.

The web's HTML document format is text-based and marks out content using a fixed set of tags determined by the World Wide Web Consortium (W3C). As a result, any program that understands those tags can parse an HTML document.

This set of tags, while useful for web pages, is too limited to provide a good basis for all document formats, however; and while SGML - a more sophisticated and expandable system than HTML - exists, it is too difficult to implement just for a web browser. Full SGML systems solve large, complex problems that justify their expense. Viewing structured documents sent over the web rarely carries such justification.

A halfway house between HTML and SGML was needed that would provide the flexibility and expandability of SGML on the web and to general applications; hence, XML was developed.

XML is a mark-up language for documents containing both content (words, pictures, etc) and some indication of what role that content plays (for example, content in a section heading has a different meaning from content in a footnote, which means something different to content in a figure caption or content in a database table, etc). By defining sets of tags appropriate to the application, XML can store pretty much any piece of content in a way intelligible and readable by any XML-compatible application.

Virtually any new file or messaging format developed in the last two years has come out of XML. As more and more mainstream applications begin to use it and as the tools to develop and deploy it become ubiquitous, XML is going to be as prevalent as Windows or even the text file in the IT world.

The shape of XML
To anyone familiar with HTML or the coding of a web page, XML looks both familiar and different. After a series of machine- and human-readable statements at the beginning of the file, content follows with different areas each marked up with an opening tag (eg <tag>) and a closing tag (eg </tag>). But while HTML has preset tags, such as <h1> ... </h1> and <p> ... </p>, which define a headline and a paragraph respectively, XML contains arbitrary tags: it would be perfectly possible to have one XML document use <h1> ... </h1> and <p> ... </p> to mark areas of content as headlines or paragraphs, and to have another use <headline> ... </headline> and <paragraph> ... </paragraph>.

The actual meanings of these XML tags are defined elsewhere in a Document Type Description (DTD) or schema file.

Easing the transfer of documents
Using XML as the basis of a document format means that there is a whole range of tools and software that can already parse it, making it unnecessary for the end user to obtain a particular program in order to read a document.

Even the doyen of the proprietary file format, Microsoft, is keen to embrace XML. Office 2003 provides the ability for Microsoft Word and its stablemates to save and read documents in XML-based formats. Microsoft has also published the schemas it has used for these formats (such as 'SpreadsheetML' and 'WordprocessingML'), making it possible for other programs to understand, not just read, files saved in these formats and to save files in that format as well.

Previously, developers had to reverse engineer Office's many document formats in order to read them, usually imperfectly.

“Perhaps the most important factor relating to standard XML file formats is that of human-readable tags and standard processing techniques,” says Gary Edwards, OpenOffice.org's representative on the OASIS OpenOffice XML Format Technical Committee. “With a proprietary file format, users had to either get special permission from the application vendor, or reverse engineer the binary format, in order to work with the files in ways that met their specific needs. With a standard XML file format, users can mine, re-use and re-purpose information any way they can think of. Plus, the standardisation of the file formats and related XML transformation technologies means that powerful machines can be constructed to service advanced content management and collaboration needs without having to beg the application vendor for permission or future enhancements.”

The many identities of XML
Unlike HTML, XML has no pre-defined tags or way of ordering tags for content. Consequently, any organisation that uses XML has to decide which set of tags ('schema') it will use. If it never intends the document's structure to be understandable to anyone outside the organisation, it can choose a completely arbitrary schema. But if it is to exchange documents with another organisation, that organisation will need to understand the schema underlying the document.

Rather than reinvent the wheel, many organisations are using schemas appropriate for their industries such as ebXML (ebusiness XML), LegalXML and Acrod XML for Life Insurance. There are currently many thousands of pre-defined schemas, so picking one appropriate to the organisation, partners and purpose can be hard. But by using a pre-defined schema and XML, organisations can have the benefit of industry experience in a pre-packaged form, and a document exchange format that almost any system can read without the need for a developer to create a parsing system especially for that format.

Making content fluid
XML documents store both content and structure. But they do not contain information about how to display either. By divorcing the two, XML has become a highly useful publishing tool. Using XML 'stylesheets' for different media and purposes, it is possible to use the same XML document as the content source for a web page, a brochure and a letter, for example. Each stylesheet will specify which entities should be displayed - so if headlines should only appear on the web page and the letter, the brochure's stylesheet would specify that headlines should be ignored. The stylesheet will also specify medium-appropriate formatting so that the same content can appear in different fonts, sizes, colours and spacing depending on where it will be seen.

Since content can be hidden and displayed using stylesheets, a single XML document can potentially contain multi-lingual content and using different stylesheets for different languages, display the appropriate text at the same point in different catalogues intended for different countries.

The future of the web
The web was founded on HTML, a scaled-down version of SGML intended for the simple mark-up of documents for viewing across a network. HTML version 1.0 included few of the formatting capabilities of the latest version, 4.0, and many of these were concerned with structure rather than how the document would end up looking on a browser.

For example, rather than specify that a piece of text would be bold or italic, documents would be marked up with tags such as < strong > and < em >, leaving it up to the browser to decide whether to highlight the text in bold, use a different font or increase the font size to indicate the emphasis.

Since the release of HTML version 3.0, its maintainers, the W3C, have been working on ways to move the web from using HTML to XML as the language of web page mark-up. A key first step came with version 4.0 of HTML, which included the ability to use stylesheets to separate the formatting of a web page from its structure.

The next step was the introduction of XHTML, a blend of HTML and XML. XHTML is an attempt to get web authors and browsers used to the strict formatting of XML before the final push to a fully XML world. All tags must now have a matching close tag, even when it makes no sense for there to be one (such as a horizontal rule tag); they must be formatted and nested correctly; and they need to have far more of the machine-readable declarations found in XML.

The ambition is that rather than convert XML documents into HTML, future browsers will be XML-based, displaying XML documents natively and avoiding any conversion process as XHTML documents will look like regular XML to these browsers.

The web services link
Web services has been one of the big technology rallying points for the last two years. Promising a standard way for distributed applications to exchange information and interrogate each other, web services is a combination of XML and standard web traffic protocols. It uses XML-based data to describe what services and information an application provides and then transfers that information using standard web traffic to other applications on a network or on the Internet. Since the means to describe those services and the messages the services can send are both well defined XML formats, web services provides a universally understandable, machine and human-readable way for applications to integrate with each other and pass information between themselves.

Page 1 | Page 2 | Page 3 | All 3 Pages

Rob Buckley – Freelance Journalist and Editor

The lean machine

Touching the void

XML crib sheet