Beginning XML

Well-formed XML (3)


Another important point to keep in mind is that the tags in XML are case-sensitive. (This is a big difference from HTML, which is case-insensitive.) This means that <first> is different from <FIRST>, which is different from <First>.

This sometimes seems odd to English-speaking users of XML, since English words can easily be converted to upper- or lower-case with no loss of meaning. But in almost every other language in the world, the concept of case is either not applicable (in other words, what's the uppercase of b? Or the lowercase, for that matter?), or not extremely important (what's the uppercase of é? The answer may be different, depending on the context). To put intelligent rules into the XML specification for case-folding would probably have doubled or trebled its size, and still only benefited the English-speaking section of the population. Luckily, it doesn't take long to get used to having case-sensitive names.

This is the reason that our previous <P></p> HTML example would not work in XML; since the tags are case-sensitive, an XML parser would not be able to match the </p> end-tag with any start-tags, and neither would it be able to match the <P> start-tag with any end-tags.

Warning! Because XML is case-sensitive, you could legally create an XML document which has both <first> and <First> elements, which have different meanings. This is a bad idea, and will cause nothing but confusion! You should always try to give your elements distinct names, for your sanity, and for the sanity of those to come after you.

To help combat these kinds of problems, it's a good idea to pick a naming style and stick to it. Some examples of common styles are:

  • <first_name>
  • <firstName>
  • <first-name> (some people don't like this convention, because the "-" character is used for subtraction in so many programming languages, but it is legal)
  • <FirstName>

Which style you choose isn't important; what is important is that you stick to it. A naming convention only helps when it's used consistently. For this book, I'll usually use the <FirstName> convention, because that's what I've grown used to.

White Space in PCDATA

There is a special category of characters, called white space. This includes things like the space character, new lines (what you get when you hit the Enter key), and tabs. White space is used to separate words, as well as to make text more readable.

Those familiar with HTML are probably quite aware of the practice of white space stripping. In HTML, any white space considered insignificant is stripped out of the document when it is processed. For example, take the following HTML:

  <p>This is a paragraph.       It has a whole bunch
    of space.</p>

As far as HTML is concerned, anything more than a single space between the words in a <p> is insignificant. So all of the spaces between the first period and the word It would be stripped, except for one. Also, the line feed after the word bunch and the spaces before of would be stripped down to one space. As a result, the previous HTML would be rendered in a browser as:

In order to get the results as they appear in the HTML above, we'd have to add special HTML markup to the source, like the following:

  <p>This is a paragraph. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
  It has a whole bunch<br>
  &nbsp;&nbsp;of space.</p>

&nbsp; specifies that we should insert a space (nbsp stands for Non-Breaking Space), and the <br> tag specifies that there should be a line feed. This would format the output as:

Alternatively, if we wanted to have the text displayed exactly as it is in the source file, we could use the <pre> tag. This specifically tells the HTML parser not to strip the white space, so we could write the following and also get the desired results:

  <pre>This is a paragraph.       It has a whole bunch
    of space.</pre> 

However, in most web browsers, the <pre> tag also has the added effect that the text is rendered in a fixed-width font, like the courier font we use for code in this book.

White space stripping is very advantageous for a language like HTML, which has become primarily a means for displaying information. It allows the source for an HTML document to be formatted in a readable way for the person writing the HTML, while displaying it formatted in a readable, and possibly quite different, way for the user.

In XML, however, no white space stripping takes place for PCDATA. This means that for the following XML tag:

  <tag>This is a paragraph.       It has a whole bunch
    of space.</tag>

the PCDATA is:

This is a paragraph.       It has a whole bunch
  of space.

Just like our second HTML example, none of the white space has been stripped out. As far as white space stripping goes, all XML elements are treated just as the HTML <pre> tag. This makes the rules much easier to understand for XML than they are for HTML:

In XML, the white space stays.

Unfortunately, if you view the above XML in IE5 the white space will be stripped out - or will seem to be. This is because IE5 is not actually showing you the XML directly; it uses a technology called XSL to transform the XML to HTML, and it displays the HTML. Then, because IE5 is an HTML browser, it strips out the white space.

End-of-Line White Space

However, there is one form of white space stripping that XML performs on PCDATA, which is the handling of new line characters. The problem is that there are two characters that are used for new lines - the line feed character and the carriage return - and computers running Windows, computers running Unix, and Macintosh computers all use these characters differently.

For example, to get a new line in Windows, an application would use both the line feed and the carriage return character together, whereas on Unix only the line feed would be used. This could prove to be very troublesome when creating XML documents, because Unix machines would treat the new lines in a document differently than the Windows boxes, which would treat them differently than the Macintosh boxes, and our XML interoperability would be lost.

For this reason, it was decided that XML parsers would change all new lines to a single line feed character before processing. This means that any XML application will know, no matter which operating system it's running under, that a new line will be represented by a single line feed character. This makes data exchange between multiple computers running different operating systems that much easier, since programmers don't have to deal with the (sometimes annoying) end-of-line logic.

White Space in Markup

As well as the white space in our data, there could also be white space in an XML document that's not actually part of the document. For example:

    <another-tag>This is some XML</another-tag>

While any white space contained within <another-tag>'s PCDATA is part of the data, there is also a new line after <tag>, and some spaces before <another-tag>. These spaces could be there just to make the document easier to read, while not actually being part of its data. This "readability" white space is called extraneous white space.

While an XML parser must pass all white space through to the application, it can also inform the application which white space is not actually part of an element's PCDATA, but is just extraneous white space.

So how does the parser decide whether this is extraneous white space or not? That depends on what kind of data we specify <tag> should contain. If <tag> can only contain other elements (and no PCDATA) then the white space will be considered extraneous. However, if <tag> is allowed to contain PCDATA, then the white space will be considered part of that PCDATA, so it will be retained.

Unfortunately, from this document alone an XML parser would have no way to tell whether <tag> is supposed to contain PCDATA or not, which means that it has to assume none of the white space is extraneous. We'll see how we can get the parser to recognize this as extraneous white space in Chapter 9 when we discuss content models.

You might also like...



Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'” - Isaac Asimov