Information Technology Tips: Introduction to XML

The Extensible Markup Language, or XML, is a portable, human-readable format for exchanging text or data between programs. XML is derived from the parent standard SGML, as is the HTML language used on web pages worldwide. XML, then, is HTML's younger but more capable sibling. And since most developers know at least a bit of HTML, parts of this discussion compare XML with HTML. XML's lesser-known grandparent is IBM's GML (General Markup Language), and one of its cousins is Adobe FrameMaker's Maker Interchange Format (MIF). The following figure depicts the family tree.

XML's ancestry

One way of thinking about XML is that it's HTML cleaned up, consolidated, and with the ability for you to define your own tags. It's HTML with tags that can and should identify the informational content as opposed to the formatting. Another way of perceiving XML is as a general interchange format for such things as business-to-business communications over the Internet or as a human-editabledescription of things as diverse as word-processing files and Java documents. XML is all these things, depending on where you're coming from as a developer and where you want to go today—and tomorrow.

Because of the wide acceptance of XML, it is used as the basis for many other formats, including the Open Office (http://www.openoffice.org/) save file format, the SVG graphics file format, and many more.

From SGML, both HTML and XML inherit the syntax of using angle brackets (< and >) around tags, each pair of which delimits one part of an XML document, called an element . An element may contain content (like a <P> tag in HTML) or may not (like an <hr> in HTML). While HTML documents can begin with either an <html> tag or a <DOCTYPE...> tag (or, informally, with neither), an XML file may begin with an XML declaration. Indeed, it must begin with an XML processing instruction (<? ... ?> ) if the file's character encoding is other than UTF-8 or UTF-16:

<?xml version="1.0"  encoding="iso-8859-1"?>

The question mark is a special character used to identify the XML declaration (it's syntactically similar to the % used in ASP and JSP).

HTML has a number of elements that accept attributes, such as:

<BODY BGCOLOR=white> ... </body>

In XML, attribute values (such as the 1.0 for the version in the processing instruction or the white of BGCOLOR) must be quoted. In other words, quoting is optional in HTML, but required in XML.

The BODY example shown here, while allowed in traditional HTML, would draw complaints from any XML parser. XML is case-sensitive; in XML, BODY, Body, and body represent three different element names. Yes, each XML start tag must have a matching end tag. This is one of a small list of basic constraints detailed in the XML specification. Any XML file that satisfies all of these constraints is said to be well-formed and is accepted by an XML parser. A document that is not well-formed is rejected by an XML parser.

Speaking of XML parsing, a great variety of XML parsers are available. A parser is simply a program or class that reads an XML file, looks at it at least syntactically, and lets you access some or all of the elements. Most of these parsers conform to the Java bindings for one of the two well-known XML APIs, SAX and DOM. SAX, the Simple API for XML, reads the file and calls your code when it encounters certain events, such as start-of-element, end-of-element, start-of-document, and the like. DOM, the Document Object Model, reads the file and constructs an in-memory tree or graph corresponding to the elements and their attributes and contents in the file. This tree can be traversed, searched, modified (even constructed from scratch, using DOM), or written to a file.

But how does the parser know if an XML file contains the correct elements? Well, the simpler, "nonvalidating" parsers don't—their only concern is the well-formedness of the document. Validating parsers check that the XML file conforms to a given Document Type Definition (DTD) or an XML Schema. DTDs are inherited from SGML. Schemas are newer than DTDs and, while more complex, provide such object-based features as inheritance. DTDs are written in a special syntax derived from SGML while XML Schemas are expressed using ordinary XML elements and attributes.

In addition to parsing XML, you can use an XML processor to transform XML into some other format, such as HTML. This is a natural for use in a web servlet: if a given web browser client can support XML, just write the data as-is, but if not, transform the data into HTML. We'll look at two approaches to XML transformation: transformation using a generic XSLT processor and then later some parsing APIs suitable for customized operations on XML.

If you need to control how an XML document is formatted, for screen or print, you can use XSL ( Extensible Style Language). XSL is a more sophisticated variation on the HTML stylesheet concept that allows you to specify formatting for particular elements. XSL has two parts: tree transformation (for which XSLT was designed, though it can also be used independently, as we'll see) and formatting (the non-XSLT part is informally known as XSL-FO or XSL Formatting Objects).

XSL stylesheets can be complex; you are basically specifying a batch formatting language to describe how your textual data is formatted for the printed page. A comprehensive reference implementation is FOP (Formatting Objects Processor), which produces Acrobat PDF output and is available from http://xml.apache.org/.

Prior to JDK 1.4, writing portable XML-based Java programs was difficult because there was no single standard API. JDK 1.4 introduced JAXP, the Java API for XML Processing, which provides standard means for accessing the various components discussed in this chapter. If you are still using JDK 1.3, you may need to acquire additional JAR files and/or change the examples somewhat.

Information Technology Tips

Monday, August 22, 2011

Introduction to XML

XML's ancestry

No comments:

Post a Comment