Case-Sensitivity
Another important point to keep in mind is that the tags in XML are case-sensitive.
(This is a big difference from HTML, which is case-insensitive.) This means that
<first> is different from <FIRST>, which is different from <First>.
This sometimes seems odd to English-speaking users of XML, since English words
can easily be converted to upper- or lower-case with no loss of meaning. But
in almost every other language in the world, the concept of case is either not
applicable (in other words, what's the uppercase of b? Or the lowercase, for
that matter?), or not extremely important (what's the uppercase of é? The answer
may be different, depending on the context). To put intelligent rules into the
XML specification for case-folding would probably have doubled or trebled its
size, and still only benefited the English-speaking section of the population.
Luckily, it doesn't take long to get used to having case-sensitive names.
This is the reason that our previous <P></p> HTML example would not work
in XML; since the tags are case-sensitive, an XML parser would not be able to
match the </p> end-tag with any start-tags, and neither would it be able to
match the <P> start-tag with any end-tags.
Warning! Because XML is case-sensitive, you could legally create an XML document
which has both <first> and <First> elements, which have different meanings.
This is a bad idea, and will cause nothing but confusion! You should always try
to give your elements distinct names, for your sanity, and for the sanity of
those to come after you.
To help combat these kinds of problems, it's a good idea to pick a naming style
and stick to it. Some examples of common styles are:
- <first_name>
- <firstName>
- <first-name> (some people don't like this convention, because the "-" character is used for subtraction in so many programming languages, but it is legal)
- <FirstName>
Which style you choose isn't important; what is important is that you stick
to it. A naming convention only helps when it's used consistently. For this book,
I'll usually use the <FirstName> convention, because that's what I've grown
used to.
White Space in PCDATA
There is a special category of characters, called white space. This includes
things like the space character, new lines (what you get when you hit the Enter
key), and tabs. White space is used to separate words, as well as to make text
more readable.
Those familiar with HTML are probably quite aware of the practice of white space
stripping. In HTML, any white space considered insignificant is stripped out
of the document when it is processed. For example, take the following HTML:
<p>This is a paragraph. It has a whole bunch of space.</p>
As far as HTML is concerned, anything more than a single space
between the words in a <p> is insignificant. So all of the spaces between
the first period and the word It would be stripped, except for one. Also, the
line feed after the word bunch and the spaces before of would be stripped down
to one space. As a result, the previous HTML would be rendered in a browser as:
In order to get the results as they appear in the HTML above, we'd have to add special HTML markup to the source, like the following:
<p>This is a paragraph. It has a whole bunch<br> of space.</p>
specifies that we should insert a space (nbsp stands
for Non-Breaking Space), and the <br> tag specifies that there should
be a line feed. This would format the output as:
Alternatively, if we wanted to have the text displayed exactly as it is in the source file, we could use the <pre> tag. This specifically tells the HTML parser not to strip the white space, so we could write the following and also get the desired results:
<pre>This is a paragraph. It has a whole bunch of space.</pre>
However, in most web browsers, the <pre> tag also has the
added effect that the text is rendered in a fixed-width font, like the courier
font we use for code in this book.
White space stripping is very advantageous for a language like HTML, which has
become primarily a means for displaying information. It allows the source for
an HTML document to be formatted in a readable way for the person writing the
HTML, while displaying it formatted in a readable, and possibly quite different,
way for the user.
In XML, however, no white space stripping takes place for PCDATA. This means
that for the following XML tag:
<tag>This is a paragraph. It has a whole bunch of space.</tag>
the PCDATA is:
This is a paragraph. It has a whole bunch
of space.
Just like our second HTML example, none of the white space has been stripped
out. As far as white space stripping goes, all XML elements are treated just
as the HTML <pre> tag. This makes the rules much easier to understand for
XML than they are for HTML:
In XML, the white space stays.
Unfortunately, if you view the above XML in IE5 the white space will be stripped
out - or will seem to be. This is because IE5 is not actually showing you the
XML directly; it uses a technology called XSL to transform the XML to HTML, and
it displays the HTML. Then, because IE5 is an HTML browser, it strips out the
white space.
End-of-Line White Space
However, there is one form of white space stripping that XML performs on PCDATA,
which is the handling of new line characters. The problem is that there
are two characters that are used for new lines - the line feed character
and the carriage return - and computers running Windows, computers running
Unix, and Macintosh computers all use these characters differently.
For example, to get a new line in Windows, an application would use both the
line feed and the carriage return character together, whereas on Unix only the
line feed would be used. This could prove to be very troublesome when creating
XML documents, because Unix machines would treat the new lines in a document
differently than the Windows boxes, which would treat them differently than the
Macintosh boxes, and our XML interoperability would be lost.
For this reason, it was decided that XML parsers would change all new lines to
a single line feed character before processing. This means that any XML application
will know, no matter which operating system it's running under, that a new line
will be represented by a single line feed character. This makes data exchange
between multiple computers running different operating systems that much easier,
since programmers don't have to deal with the (sometimes annoying) end-of-line
logic.
White Space in Markup
As well as the white space in our data, there could also be white space in an
XML document that's not actually part of the document. For example:
<tag> <another-tag>This is some XML</another-tag> </tag>
While any white space contained within <another-tag>'s
PCDATA is part of the data, there is also a new line after <tag>, and some
spaces before <another-tag>. These spaces could be there just to make the
document easier to read, while not actually being part of its data. This "readability"
white space is called extraneous white space.
While an XML parser must pass all white space through to the application, it
can also inform the application which white space is not actually part of an
element's PCDATA, but is just extraneous white space.
So how does the parser decide whether this is extraneous white space or not?
That depends on what kind of data we specify <tag> should contain. If <tag>
can only contain other elements (and no PCDATA) then the white space will be
considered extraneous. However, if <tag> is allowed to contain PCDATA, then
the white space will be considered part of that PCDATA, so it will be retained.
Unfortunately, from this document alone an XML parser would have no way to tell
whether <tag> is supposed to contain PCDATA or not, which means that it has
to assume none of the white space is extraneous. We'll see how we can get the
parser to recognize this as extraneous white space in Chapter 9 when we discuss
content models.
Comments