Customer Portal

The eXtensible Markup Language (XML)

The eXtensible Markup LanguageThe eXtensible Markup Language (XML) is the core technology in the current generation of service oriented architecture mechanisms for building distributed applications, and provide the basis for effectively managing information in large enterprise-wide environments today. XML enables electronic data to be tagged and processed in a standard way, and omits the more complex and less-used syntax of its parent language - the Structured General Markup Language (SGML) providing benefits such as;

  1. easier to understand,
  2. standard definitions of document types (XML Schemas, DTDs),
  3. open architecture, which allows greater ability to exchange information,
  4. and more suited for delivery and interoperability over the Web.

Information passed between programs and computer systems is becoming richer and more complex as new web- and network-based applications are being developed. Such information must be self-describing, so that client software can successfully interpret and utilize external data. XML is an ideal data format for storing structured and semiuctured content.

An XML document contains special instructions, called tags, which usually enclose identifiable parts of a document, as well as identify specific characteristics (attributes) of the data. These tags are used to describe the nature of an object, as opposed to how it should be displayed or printed. Thus, since the information is self-describing, it can be easily located, extracted, manipulated, and exchanged as desired. In addition, since XML is a fully comprehensive subset of SGML, it also supports the parent-child relationships utilized in SGML (i.e., a “child” is a sub-element of a “parent” element).

XML can be utilized for describing both, the structure of a document, as well as the specific data contained within a document. However, XML is not restricted to just documents, it can be utilized to describe the structure and content of objects found in databases as well. This idea, coupled with advancements in browser technology, parsing, connectivity, application interfaces, databases, programming languages and IDEs, which all provide support for XML, has created a solid basis for establishing practical strategies utilizing XML to support both, small and large information frameworks.

Data Exchange Language

XML is considered a data exchange language. However, it goes far beyond the bounds of the data exchange formats (i.e., comma separated values – CSV formats, fixed width, etc.) utilized in the past. XML has the inherent capability to enforce standard mappings of data, so that one business can utilize the data obtained from another business. This type of data exchange can be supported for businesses in the same industry, or a business in unrelated industries – as long as an agreed to standard (XML Schema) has been setup to support this data exchange. This concept of exchanging business data with other businesses is sometimes referred to as B2B (business-to-business).

It is often thought that XML is only for the Internet - a so-called extension of HTML. Holding fast to this notion can blind-side the IT division(s) within an organization. It is common for IT departments to be actively involved with the exchange of data from one information system to another. Often times, mapping and exchanging data from spreadsheets or flat files to databases and/or the converse. This type of data exchange is often custom mapped for the specific application at hand, and generally not reusable. However, establishing standards within an organization for the purpose of data exchange and management can greatly reduce the data inconsistencies, and associated IT effort generally related with this type of activity. In addition, the data exchange standards can provide a means to exchange and share select information with other businesses or industries, and gain a greater visibility alongside your market competitors.

In this information age, the ability to exchange, analyze, and utilize data from a large variety of different organizations can greatly increase our understanding of the industries that propel us through the 21st century and beyond.

Using XML

XML is one of a class of markup languages - notations for describing some form of content. The most widely-used (and known) markup language is HTML, used to represent most web content. HTML is quite a restricted language in practice, and it describes content in terms of appearance. For example, most of the tags in HTML describe (or at least strongly hint) what a particular piece of content should look like when rendered on a web page. It should be said that many of the most display-related tags are now deprecated in favor of cascading style sheets. However, HTML is still essentially about appearance. XML on the other hand uses tags (in fundamentally the same way as HTML) which do not describe appearance: they simply describe content (and also structure) of documents.

Unlike HTML which has a pre-defined set of tags, XML does not. That is, you can 'make up your own'. For example, here is a fragment of XML representing Computer Science modules:

    <level num='1'>
        <module name='Principles and Practice of Programming' code='CS_141'/>
        <module name='Languages to Hardware' code='CS_113'/>
    <level num='2'>
        <module name='System Specification' code='CS_213'/>
    <level num='3'>
        <module name='Internet Computing' code='CS_338'/>
        <module name='High Performance Microprocessors' code='CS_313'/>
    <level num='M'>
        <module name='Writing Web and Web Service Applications' code='CS_M68'/>

In the example above, we have introduced three tags: modules; level; and module. Two of the modules have attributes: level has one (num) and module has two (name and code). In this example, we have used attributes to encode some information: however, we could have just used more tags. For example, we could rewrite one of the modules like this:

        <name>Program Design</name>

(The choice between using attributes or nested tags to represent content and structure is often fairly arbitrary. However, many people would argue that attributes should only be used to represent meta-data: that is, data about data. These people would argue against the first version above, because clearly `name' and `code' are not meta-data.)

Some basic rules regarding legal XML documents:

  • Start with version statement. They must always start with a statement of the version of XML in use. Typically: <?xml version = "1.0"?>. Currently, version 1.0 is the only one available. The World Wide Web Consortium (W3C) has a proposal for version 1.1 here.
  • Case sensitive. Unlike HTML tags, XML tags are case sensitive.
  • Tags must match. Again unlike (some) HTML tags, all opening tags in XML must be matched by a corresponding closing tag. For example, in the first example above, each opening <level> tag was matched by a closing </level> tag.
  • Empty tags. In the event that a tag is empty, it can be abbreviated. For example, <level></level> can be written <level/>. We did this with the module tags in the first example above.
  • Nested tags. All opening and closing tags must be properly nested.
  • Root tag. The entire contents of an XML document must be within a single opening and closing root tag pair. In the first example above, this is modules.
  • Literal data. Sometimes you wish to represent data in an un-interpreted form: for example, these notes are all written in XML, and this and other chapters contains fragments of XML that should appear unchanged in the final form. To do this, surround such data with the tags <[CDATA[ and ]]>.
  • Special characters. The characters &, <, >, " and ' are special, and must be written in the following form within a document: &amp;, &lt;, &gt;, &quot; and &apos. (Strictly, only the first two must always be written in this form: the others need only be used if the data is otherwise ambiguous.) In general, any character (including the many international characters not on standard keyboards can be written in the form &x; where 'x' is either a numeric code or (in many cases) an abbreviated name.
  • Comments. Comments are written between the following tags: <!-- and -->.

Some of the advantages of using the XML format include:

  • Support for User-Defined Tags. You might agree a set of tags within an interested community. This 'community' might be large and international or Industry based. On the other hand, they might just be you - if you wish to store, transfer or manipulate data in a certain form.
  • Support for Style Documents. XML documents can be parsed and transformed into other formats by using appropriate software and style sheets. One such set of software is the Apache Project Xalan. The course notes for this (and my other) modules are generated either using Xalan or the tools built into Microsoft's Visual Studio 2005 and a set of three style sheets for the web pages (HTML) and the slides and notes (both PDF, which also uses Apache's formatting object processor). Stylesheets are written in the extensible stylesheet language (XSL) and are themselves legal XML documents..
  • Schemas Support. If you have agreed a set of tags, and wish to manipulate documents using stylesheets based on them, or parse and generate them in programs: both Java and C# for example have rich sets of XML processing tools, and so do some other languages. You may well wish to validate that your XML documents correspond to the rules of the agreed tag set. This can be done using schemas, which are valid XML document.
  • Support for the Document Object Model. To read, write and process XML from within programming languages, you need a set of tools. A number of the available parsers for XML expose by means of an API the structure of an XML document once it has been parsed. There is a set of standard recommendations from the World Wide Web Consortium (W3C) for managing this, called the Document Object Model (DOM).
  • Support for Namespaces. XML namespaces provide a simple method for qualifying element and attribute names used in Extensible Markup Language documents by associating them with namespaces identified by URI references

XML is a simple, very flexible text format that has been designed to meet the challenges of large-scale electronic publishing, and today is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.