Logo Rob Buckley – Freelance Journalist and Editor

XML crib sheet

XML crib sheet

XML is the 'lingua franca' for e-documents. It provides a standard mark-up language for defining the structure of documents, thereby enabling their transmission, validation and interpretation between applications and between organisations.

Page 1 | Page 2 | Page 3 | All 3 Pages

Previously, developers had to reverse engineer Office's many document formats in order to read them, usually imperfectly.

“Perhaps the most important factor relating to standard XML file formats is that of human-readable tags and standard processing techniques,” says Gary Edwards, OpenOffice.org's representative on the OASIS OpenOffice XML Format Technical Committee. “With a proprietary file format, users had to either get special permission from the application vendor, or reverse engineer the binary format, in order to work with the files in ways that met their specific needs. With a standard XML file format, users can mine, re-use and re-purpose information any way they can think of. Plus, the standardisation of the file formats and related XML transformation technologies means that powerful machines can be constructed to service advanced content management and collaboration needs without having to beg the application vendor for permission or future enhancements.”

The many identities of XML
Unlike HTML, XML has no pre-defined tags or way of ordering tags for content. Consequently, any organisation that uses XML has to decide which set of tags ('schema') it will use. If it never intends the document's structure to be understandable to anyone outside the organisation, it can choose a completely arbitrary schema. But if it is to exchange documents with another organisation, that organisation will need to understand the schema underlying the document.

Rather than reinvent the wheel, many organisations are using schemas appropriate for their industries such as ebXML (ebusiness XML), LegalXML and Acrod XML for Life Insurance. There are currently many thousands of pre-defined schemas, so picking one appropriate to the organisation, partners and purpose can be hard. But by using a pre-defined schema and XML, organisations can have the benefit of industry experience in a pre-packaged form, and a document exchange format that almost any system can read without the need for a developer to create a parsing system especially for that format.

Making content fluid
XML documents store both content and structure. But they do not contain information about how to display either. By divorcing the two, XML has become a highly useful publishing tool. Using XML 'stylesheets' for different media and purposes, it is possible to use the same XML document as the content source for a web page, a brochure and a letter, for example. Each stylesheet will specify which entities should be displayed - so if headlines should only appear on the web page and the letter, the brochure's stylesheet would specify that headlines should be ignored. The stylesheet will also specify medium-appropriate formatting so that the same content can appear in different fonts, sizes, colours and spacing depending on where it will be seen.

Since content can be hidden and displayed using stylesheets, a single XML document can potentially contain multi-lingual content and using different stylesheets for different languages, display the appropriate text at the same point in different catalogues intended for different countries.

The future of the web
The web was founded on HTML, a scaled-down version of SGML intended for the simple mark-up of documents for viewing across a network. HTML version 1.0 included few of the formatting capabilities of the latest version, 4.0, and many of these were concerned with structure rather than how the document would end up looking on a browser.

For example, rather than specify that a piece of text would be bold or italic, documents would be marked up with tags such as < strong > and < em >, leaving it up to the browser to decide whether to highlight the text in bold, use a different font or increase the font size to indicate the emphasis.

Since the release of HTML version 3.0, its maintainers, the W3C, have been working on ways to move the web from using HTML to XML as the language of web page mark-up. A key first step came with version 4.0 of HTML, which included the ability to use stylesheets to separate the formatting of a web page from its structure.

The next step was the introduction of XHTML, a blend of HTML and XML. XHTML is an attempt to get web authors and browsers used to the strict formatting of XML before the final push to a fully XML world. All tags must now have a matching close tag, even when it makes no sense for there to be one (such as a horizontal rule tag); they must be formatted and nested correctly; and they need to have far more of the machine-readable declarations found in XML.

Page 1 | Page 2 | Page 3 | All 3 Pages

Interested in commissioning a similar article? Please contact me to discuss details. Alternatively, return to the main gallery or search for another article: