Tags and Text and Elements, Oh My!
It's time to stop calling things just "items" and "text"; we need some names for the pieces that make up an XML document. To get cracking, let's break down the simple <name> document we created in Chapter 1:
<name> <first>John</first> <middle>Fitzgerald Johansen</middle> <last>Doe</last> </name>
The words between the < and > characters are XML tags.
The information in our document (our data) is contained within the various tags
that constitute the markup of the document. This makes it easy to distinguish
the information in the document from the markup.
As you can see, the tags are paired together, so that any opening tag also has a closing tag. In XML parlance, these are called start-tags and end-tags. The end-tags are the same as the start-tags, except that they have a "/" right after the opening < character.
In this regard, XML tags work the same as start-tags and end-tags do in HTML. For example, you would create an HTML paragraph like this:
<P>This is a paragraph.</P>
As you can see, there is a <P> start-tag, and a </P>
end-tag, just like we use for XML.
All of the information from the start of a start-tag to the end of an end-tag, and including everything in between, is called an element. So:
- <first> is a start-tag
- </first> is an end-tag
- <first>John</first> is an element
The text between the start-tag and end-tag of an element is called the element
content. The content between our tags will often just be data (as opposed
to other elements). In this case, the element content is referred to as Parsed
Character DATA, which is almost always referred to using its acronym, PCDATA.
Whenever you come across a strange-looking term like PCDATA, it's usually a good bet the term is inherited from SGML. Because XML is a subset of SGML, there are a lot of these inherited terms.
The whole document, starting at <name> and ending at </name>, is also an element, which happens to include other elements. (And, in this case, the element is called the root element, which we'll be talking about later.)
To put this new-found knowledge into action, let's create an example that contains more information than just a name.
Try It Out - Describing Weirdness
We're going to build an XML document to describe one of the greatest CDs ever produced, Dare to be Stupid, by Weird Al Yankovic. But before we break out Notepad and start typing, we need to know what information we're capturing.
In Chapter 1, we learned that XML is hierarchical in nature; information is structured like a tree, with parent/child relationships. This means that we'll have to arrange our CD information in a tree structure as well.
Since this is a CD, we'll need to capture information like the artist, title, and date released, as well as the genre of music. We'll also need information about each song on the CD, such as the title and length. And, since Weird Al is famous for his parodies, we'll include information about what song (if any) this one is a parody of.
Here's the hierarchy we'll be creating:
Some of these elements, like <artist>, will appear only
once; others, like <song>, will appear multiple times in the document. Also,
some will have PCDATA only, while some will include their information as child
elements instead. For example, the <artist> element will contain PCDATA for
the title, whereas the <song> element won't contain any PCDATA of its own,
but will contain child elements that further break down the information.
With this in mind, we're now ready to start entering XML. If you have Internet Explorer 5 installed on your machine, type the following into Notepad, and save it to your hard drive as cd.xml:
<CD> <artist>"Weird Al" Yankovic</artist> <title>Dare to be Stupid</title> <genre>parody</genre> <date-released>1990</date-released> <song> <title>Like A Surgeon</title> <length> <minutes>3</minutes> <seconds>33</seconds> </length> <parody> <title>Like A Virgin</title> <artist>Madonna</artist> </parody> </song> <song> <title>Dare to be Stupid</title> <length> <minutes>3</minutes> <seconds>25</seconds> </length> <parody></parody> </song> </CD>
For the sake of brevity, we'll only enter two of the songs
on the CD, but the idea is there nonetheless.
Now, open the file in IE5. (Navigate to the file in Explorer and double click on it, or open up the browser and type the path in the URL bar.) If you have typed in the tags exactly as shown, the cd.xml file will look something like this:
How It Works
Here we've created a hierarchy of information about a CD, so we've named the root element accordingly.
The <CD> element has children for the artist, title, genre, and date, as well as one child for each song on the disc. The <song> element has children for the title, length, and, since this is Weird Al we're talking about, what song (if any) this is a parody of. Again, for the sake of this example, the <length> element was broken down still further, to have children for minutes and seconds, and the <parody> element broken down to have the title and artist of the parodied song.
You may have noticed that the IE5 browser changed <parody></parody> into <parody/>. We'll talk about this shorthand syntax a little bit later, but don't worry: it's perfectly legal.
If we were to write a CD Player application, we could make use of this information to create a play-list for our CD. It could read the information under our <song> element to get the name and length of each song to display to the user, display the genre of the CD in the title bar, etc. Basically, it could make use of any information contained in our XML document.
Rules for Elements
Obviously, if we could just create elements in any old way we wanted, we wouldn't be any further along than our text file examples from the previous chapter. There must be some rules for elements, which are fundamental to the understanding of XML.
XML documents must adhere to these rules to be well-formed.
We'll list them, briefly, before getting down to details:
- Every start-tag must have a matching end-tag
- Tags can't overlap
- XML documents can have only one root element
- Element names must obey XML naming conventions
- XML is case-sensitive
- XML will keep white space in your text