Library sample chapters

Beginning XML

Well-formed XML (2)

Every Start-tag Must Have an End-tag

One of the problems with parsing SGML documents is that not every element requires a start-tag and an end-tag. Take the following HTML for example:

  <HTML>
  <BODY>
  <P>Here is some text in an HTML paragraph.
  <BR>
  Here is some more text in the same paragraph.
  <P>And here is some text in another HTML paragraph.</p>
  </BODY>
  </HTML>

Notice that the first <P> tag has no closing </P> tag. This is allowed - and sometimes even encouraged - in HTML, because most web browsers can detect automatically where the end of the paragraph should be. In this case, when the browser comes across the second <P> tag, it knows to end the first paragraph. Then there's the <BR> tag (line break), which by definition has no closing tag.

Also, notice that the second <P> start-tag is matched by a </p> end-tag, in lower case. HTML browsers have to be smart enough to realize that both of these tags delimit the same element, but as we'll see soon, this would cause a problem for an XML parser.

The problem is that this makes HTML parsers much harder to write. Code has to be included to take into account all of these factors, which often makes the parsers much larger, and much harder to debug. What's more, the way that files are parsed is not standardized - different browsers do it differently, leading to incompatibilities.

For now, just remember that in XML the end-tag is required, and has to exactly match the start-tag.

Tags Can Not Overlap

Because XML is strictly hierarchical, you have to be careful to close your child elements before you close your parents. (This is called properly nesting your tags.) Let's look at another HTML example to demonstrate this:

  <P>Some <STRONG>formatted <EM>text</STRONG>, 
  but</EM> no grammar no good!</P>

This would produce the following output on a web browser:

Some formatted text, but no grammar no good!

As you can see, the <STRONG> tags cover the text formatted text, while the <EM> tags cover the text text, but.

But is <em> a child of <strong>, or is <strong> a child of <em>? Or are they both siblings, and children of <p>? According to our stricter XML rules, the answer is none of the above. The HTML code, as written, can't be arranged as a proper hierarchy, and could therefore not be well-formed XML.

If ever you're in doubt as to whether your XML tags are overlapping, try to rearrange them visually to be hierarchical. If the tree makes sense, then you're okay. Otherwise, you'll have to rework your markup.

For example, we could get the same effect as above by doing the following:

  <P>Some <STRONG>formatted <EM>text</EM></STRONG><EM>,
  but</EM> no grammar no good!</P>

Which can be properly formatted in a tree, like this:

  <P>
    Some 
    <STRONG>
      formatted 
      <EM>
        text
      </EM>
    </STRONG>
    <EM>
      , but
    </EM> 
    no grammar no good!
  </P>

An XML Document Can Have Only One Root Element

In our <name> document, the <name> element is called the root element. This is the top-level element in the document, and all the other elements are its children or descendents. An XML document must have one and only one root element: in fact, it must have a root element even if it has no content.

For example, the following XML is not well-formed, because it has a number of root elements:

  <name>John</name>
  <name>Jane</name>

To make this well-formed, we'd need to add a top-level element, like this:

  <names>
    <name>John</name>
    <name>Jane</name>
  </names>

So while it may seem a bit of an inconvenience, it turns out that it's incredibly easy to follow this rule. If you have a document structure with multiple root-like elements, simply create a higher-level element to contain them.

Element Names

If we're going to be creating elements we're going to have to give them names, and XML is very generous in the names we're allowed to use. For example, there aren't any reserved words to avoid in XML, as there are in most programming languages, so we have a lot flexibility in this regard.

However, there are some rules that we must follow:

  • Names can start with letters (including non-Latin characters) or the "_" character, but not numbers or other punctuation characters.
  • After the first character, numbers are allowed, as are the characters "-" and ".".
  • Names can't contain spaces.
  • Names can't contain the ":" character. Strictly speaking, this character is allowed, but the XML specification says that it's "reserved". You should avoid using it in your documents, unless you are working with namespaces (which are covered in Chapter 8).
  • Names can't start with the letters "xml", in uppercase, lowercase, or mixed - you can't start a name with "xml", "XML", "XmL", or any other combination.
  • There can't be a space after the opening "<" character; the name of the element must come immediately after it. However, there can be space before the closing ">"character, if desired.

Here are some examples of valid names:

  <first.name>
  <résumé>

And here are some examples of invalid names:

  <xml-tag>  

which starts with xml,

  <123>      

which starts with a number,

  <fun=xml>  

because the "=" sign is illegal, and:

  <my tag>   

which contains a space.

Remember these rules for element names - they also apply to naming other things in XML.

Comments

  1. 16 Dec 2006 at 12:47

    Hi there

    I firstly want to make it clear I am a total beginner when it comes to XHTML etc.
    I am trying to write a number of pages based on some templates and have problems with CSS rendering.

    For my CSS to render properly I have figured out I need the statement <?xml version="1.0" encoding="utf-8"?> in the first line of my php page. So far so good.

    But for my php page to be able to include some statements from the database, I also need <?session_start();?> to be on the first line of my php page.

    Whatever I try, I cannot figure out how to fix the problem.

    Any ideas?

    Cheers

    John

  2. 14 May 2006 at 10:04

    it is very good and useful thanks

  3. 01 Jan 1999 at 00:00

    This thread is for discussions of Beginning XML.

Leave a comment

Sign in or Join us (it's free).

AddThis

Events coming up

  • Jul 13

    SmartClient

    California, United States

    A Smart Client is an application that uses local processing, consumes XML Web Services and can be deployed and updated from a centralized server. While the .NET Framework (Windows Forms) and the .

We'd love to hear what you think! Submit ideas or give us feedback