XML basics {#_xml_basics}

XML is a hierarchical markup language. It uses opening and closing tags to define data. It's used to store and exchange data, and because of its extreme flexibility, it gets used for everything from

documentation

graphics

Here's a sample XML document:

<xml>
  <os>
   <linux>
    <distribution>
      <name>Fedora</name>
      <release>8</release>
      <codename>Werewolf</codename>
    </distribution>

    <distribution>
      <name>Slackware</name>
      <release>12.1</release>
      <mascot>
    <official>Tux</official>
        <unofficial>Bob Dobbs</unofficial>
      </mascot>
    </distribution>
   </linux>
  </os>
</xml>

Reading the sample XML, you might find that there's an intuitive quality to the format. You can probably understand the data in this sample XML document whether you're familiar with the subject matter or not. This is partly because XML is considered verbose. It uses lots of tags, the tags can have long and descriptive names, and the daeta is ordered in a hierarchical manner that helps explain their relationship. You probably understand from this sample that the Fedora distribution and the Slackware distribution are two different and unrelated instances of Linux, because each one is "contained" inside of its own independent **<distribution>** tag.

XML is also extremely flexible. Unlike HTML, there's no predefined list of tags. You are free to create whatever data structure you need to represent.

Components of XML {#_components_of_xml}

Data exists to be read, and when a computer "reads" data, the process is called *parsing*. Using the sample XML data again, here are the terms that most XML parsers consider significant.

**Document**: The **<xml>** tag opens a *document*, and the **</xml>** tag closes it.

**Node**: The **<os>**, **<distribution>**, **<mascot>** are *nodes*. In parsing terminology, a node is a tag that contains other tags.

**Element**: Entities such as **<name>Fedora</name>** and **<official>Tux</official>**, from the first **<** to the last **>**, is an *element*.

**Content**: The data between two element tags is considered *content*. In the first **<name>** element, the string **Fedora** is the content.

XML schema {#_xml_schema}

Tags and tag inheritance in an XML document is known as *schema*.

Some schemas are made up as you go (for example, the sample XML code in this article was purely improvised), while others are strictly defined by a standards group. The Scalable Vector Graphics (SVG) schema is

defined by the W3C

, while the

Docbook schema

is defined by Norman Walsh.

A schema enforces consistency. The most basic schemas are usually also the most restrictive. In my example XML code, it wouldn't make sense to place a distribution name within the **<mascot>** node, becasue the implied schema of the document makes it clear that a mascot must be a "child" element of a distribution.

Data object model (DOM) {#_data_object_model_dom}

Talking about XML would get confusing if you had to constantly describe tags and positions ("the name tag of the second distribution tag in the linux part of the os section"), so parsers use the concept of a Document Object Model (DOM) to represent XML data. The DOM places XML data into a sort of "family tree" structure, starting from the root element (in my sample XML, that's the `os` tag), and including each tag.

image: dom.jpg

This same XML data structure can be expressed as paths, just like files in a Linux system or the location of webpages on the Internet. For instance, the path to the **<mascot>** tag can be represented as `//os/linux/distribution/slackware/mascot`.

The path to *both* **<distribution>** tags can be represented as `//os/linux/distribution`. Because there are two distribution nodes, a parser loads both nodes (and the contents of each) into an array that can be queried.

Strict XML {#_strict_xml}

XML is also known for being strict. This means that most applications are designed to intentionally fail when they encounter errors in XML. That may sound problematic, but it's one of the things developers appreciate the most about XML, because unpredictable things can happen when applications try to guess how to resolve an error. For example, back before HTML was well defined, most web browsers had to include a "quirks mode" so that when people tried to view poor HTML code, the web browser could load what the author *probably* intended. The results were wildly unpredictable, especially when one browser guessed differently than another.

XML disallows this by intentionally failing when there's an error. This lets the author fix errors until valid XML is produced. Because XML is well-defined, there are validator plugins for many applications, and stand-alone commands like `xmllint` and `xmlstarlet`, to help you locate errors early.

Transforming XML {#_transforming_xml}

Because XML is often used as an interchange format, transforming XML into some other format or into some other XML

Learning XML {#_learning_xml}

Writing XML is a lot like writing HTML. Thanks to the hard work of Jay Nick, there are

free and fun XML lessons available online

, which teaches you how to create graphics with XML.

In general, very few special tools are required to explore XML. Thanks to the close relationship of HTML and XML, you can

view XML using a web browser

. Open source text editors like

QXMLEdit

and

Netbeans

and

Kate

make typing and reading XML easy with helpful prompts, autocompletion, syntax verification, and more.

Choose XML {#_choose_xml}

XML may look like a lot of data at first, but it's not that much different than HTML (in fact, HTML has been

reimplemented as XML in the form of XHTML

). It's a unique benefit of XML that the components forming its structure also happens to be metadata providing information about what it's storing. A well-designed XML schema both contains and describes your data, allowing a user to understand it at a glance and to parse it quickly, and enabling developers to LINK-TO-JAVA-CONFIG-PARSE-ARTICLE[parse it efficiently] with convenient programming libraries.