Using XML Queries and Transformations

Top Level Settings

The top-level settings are a set of elements that can only be used at the top level of an XSLT document, and hold settings that specify how the stylesheet should be used. They specify the behavior of the processor on a few points.

output

The output element is a bag of attributes that indicate settings about the style of output that is generated. The main setting is defined in the method attribute. The possible values are xml, html and text.

xml

If the method is set to xml, the output document will be an XML document. What this means depends largely on the other attributes of the output element:

  • The version attribute specifies which version of XML should be used – we only have version 1.0 now, but that will probably change in the future. This number will also appear in the XML declaration if one is generated. The default version is 1.0.
  • The encoding attribute sets the preferred encoding for the destination document. If it is not specified, XSLT processors will use UTF-8 or UTF-16. If an XML declaration is generated, this will contain the encoding string specified.
  • The indent attribute can be set to yes to allow the processor to include additional whitespace in the destination document. This can improve readability. The default setting is no.
  • The attribute cdata-section-elements tells the processor when to use CDATA sections in the destination and when to escape illegal characters by using entity references. The value can hold a whitespace-separated list of element names. Text nodes that have a parent node in this list will be output as CDATA sections. All others will be escaped (characters like < will be replaced by entities like &lt;).
  • omit-xml-declaration can be set to yes to leave out the XML declaration. By default, XSLT will include one, reflecting the settings of encoding and version. Also, if the standalone attribute has any value, this value will show up in the XML declaration.
  • With the doctype-system and doctype-public attributes, the validation rules for the destination document can be set. If you use only doctype-system, the processor will include a <!DOCTYPE fragment just before the first element. The doctype will be the name of the root element. The system identifier (URL of the DTD) is the value of the doctype-system attribute. If you also specify a doctype-public attribute, the output will contain a doctype declaration referring to a public DOCTYPE, with the value of doctype-system as its URL. If only doctype-public is used, it will be ignored.
  • Finally, the media-type attribute can be used to specify a MIME-type for the result. By default this is text/xml, but some XML-based document types may have their own MIME types installed.

html

If the method attribute on the output element is set to html, the results of some of the other attributes change a bit compared to the xml method.

  • The version attribute now refers to the version of HTML, with a default value of 4.0. The processor will try to make the output conform to the HTML specification.
  • Empty elements in the destination document will be outputted without a closing tag. Think of HTML elements like BR, HR, IMG, INPUT, LINK, META and PARAM.
  • Textual content of the script and style elements will not be escaped. So if the XSLT document contains this literal fragment:
<script>if (a &gt; b) doSomething()</script>

This will be output as:

<script>if (a > b) doSomething()</script>
  • If any non-ASCII characters are used, the processor should try to use HTML escaping in the output (&euml; instead of &#0235;).
  • If an encoding is specified, the processor will try to add a META element to the HEAD of the document. This will also contain the value for media-type (default is text/html).
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
...

text

If the method attribute is set to text, the output will be restricted to only the string value of every node. The media-type defaults to text/plain, but you can use other MIME types. Think of generating RTF documents from an XML source document. These have no XML mark up, so the most appropriate method is text, with media-type set to application/msword. The encoding attribute can still be used, but the default value is system dependent (on most Windows PCs it will be ISO-8859-1).

Let's have a look at an example. The following stylesheet is used:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:template match="/">
  <HTML><BODY>
    <TEST>
      This is literal text with an ëxtended character
      <BR/>
      <TABLE>
        <TR><TD>Cell data</TD>
        <TD>Second cell</TD></TR>
      </TABLE>
    </TEST>
  </BODY></HTML>
</xsl:template>
</xsl:stylesheet>

We use this stylesheet on an arbitrary, valid XML document. Note that the output will always be the same literal XML tree. We will now only change the output method and have a look at the result. First the result for the xml method:

<?xml version="1.0" encoding="utf-8"?>
<HTML>
<BODY>
<TEST>This is literal text with an ëxtended character
    <BR/>
<TABLE>
<TR>
<TD>Cell data</TD>
<TD>Second cell</TD>
</TR>
</TABLE>
</TEST>
</BODY>
</HTML>

Note that every element starts on a new line. This is the result of the indent="yes" attribute. If this had not been specified, all content would be concatenated on one line. This XSLT processor has defaulted its output to encoding UTF-8. UTF-8 supports the extended character ë, so this is not escaped.

Setting the method to html would generate:

<HTML>
<BODY>
<TEST>This is literal text with an &euml;xtended character
    <BR>
<TABLE>
<TR>
<TD>Cell data</TD><TD>Second cell</TD>
</TR>
</TABLE>
</TEST>
</BODY>
</HTML>

Note that the XML declaration has disappeared and the processor appears to have decided on a slightly different formatting around the TD elements. The processor has been assigned to indenting the resulting document, but in html mode, this may only be done in places that cannot influence the appearance of the document in a browser. Also, the ë character cannot be used in HTML, so it is escaped using the preferred HTML entity &euml; (not the numeric XML entity).

Using the text method, the result would be:

This is literal text with an ëxtended character
    Cell dataSecond cell

Only the string values of the nodes have been printed. The specified encoding is used, so the special character is no problem. Note that no whitespace appears between the values of the two TD elements. We will see more on whitespace in the next sections.

strip-space and preserve-space

What exactly happens to the whitespace in a document and in the XSLT document itself? This is one of the subjects that often puzzle XML developers. Spaces, tabs and linefeeds seem to emerge and disappear at random. And then there are the XSLT elements to influence them: strip-space, preserve-space and the indent attribute on the output element. Let's take a closer look.

During a transformation, there are basically two moments when whitespace can appear or vanish:

  • When parsing the source and stylesheet documents and constructing a tree
  • Encoding a generated XML tree to the destination document

Before any processing occurs, the XSLT processor loads the source and stylesheet into memory and starts to strip unnecessary whitespace. The parser removes all text nodes that:

  • Consist entirely of whitespace characters
  • Have no ancestor node with the xml:space attribute set to preserve
  • Are not children of a whitespace-preserving element

For the stylesheet, the only whitespace-preserving parent element is xsl:text. For the source element, the list of whitespace-preserving elements can be set using the strip-space and preserve-space elements from the stylesheet. By default, all elements in the source document preserve whitespace. With the elements attribute of strip-space, you can specify which elements should not preserve whitespace. Adding elements to the list of elements that have their whitespace preserved is done with preserve-space. The elements attributes accept a list of XPath expressions. If an element in the source matches multiple expressions, the conflict is resolved following the rules for conflicts between matching templates.

So if a stylesheet contained these whitespace elements:

<xsl:strip-space elements="*"/>
<xsl:preserve-space elements="PRE CODE"/>

the processor would strip all text nodes in the source document, except for those inside a PRE element or a CODE element.

After stripping space from the source and stylesheet documents, the processing occurs. The generated tree of nodes is then persisted to a string or file. By default, no new whitespace is added to the result document, except if the output element has its indent attribute set to yes.

attribute-set

On the document level, it is possible to define certain groups of attributes that you need to include in many elements together. By grouping them, the XSLT document can be smaller and easier to maintain:

<xsl:template match="chapter/heading">
  <font xsl:use-attribute-sets="title-style">
    <xsl:apply-templates/>
  </font>
</xsl:template>
<xsl:attribute-set name="title-style">
  <xsl:attribute name="size">3</xsl:attribute>
  <xsl:attribute name="face">Arial</xsl:attribute>
</xsl:attribute-set>

Here the attribute-set element defines a group of two attributes that are often used together. In the template for chapter headings, the attribute-set is applied to a literal element, but use-attribute-set can also be used on element, copy and attribute-set elements. Be careful not to use use-attribute-set by itself (directly or indirectly), as this would generate an error.

namespace-alias

The namespace-alias element is used in very special cases, especially when transforming a source document to an XSLT document. In this case, you want the destination document to hold the XSLT namespace and lots of literal XSLT elements, but you don't want these to interfere with the transformation process. See the problem? You are shooting yourself in the foot there.

Using namespace-alias, you can use another namespace in the stylesheet, but have the declaration for that namespace show up in the destination document with another URI:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:axsl="http://www.w3.org/1999/XSL/TransformAlias">
<xsl:namespace-alias stylesheet-prefix="axsl" result-prefix="xsl"/>
<xsl:template match="/">
  <axsl:stylesheet>
    <xsl:apply-templates/>
  </axsl:stylesheet>
</xsl:template>
...
</xsl:stylesheet>

Instead of declaring the literal XSLT output elements in their real namespace, they have a fake namespace in this document. In the destination document, the same prefixes will be used, but they will refer to another URI:

<?xml version="1.0" encoding="utf-8"?>
<axsl:stylesheet xmlns:axsl="http://www.w3.org/1999/XSL/Transform">
...
</axsl:stylesheet>

key

The key element is a very special one. It will take a little time to discover its full potential. It is more or less analogous to creating an index on a table in a relational database. It allows you to access a set of nodes in a document directly with the key() function, using an identifier of that node that you specify. Let's describe an example. We could, using the key element, define that the key person-by-name gives us access to PERSON elements by passing the value of their name attribute. If the key is set up correctly, we would use key('person-by-name', 'Teun') to get a result set of PERSON elements that have their name attribute set to 'Teun'.

To set this key, you would have used the element like this:

<xsl:key name="person-by-name" match="PERSON" use="@name"/>

Try to see what each of the attributes name, match and use specifies. The name attribute is simple: it just serves to refer to a specific key of which there may be many. The match attribute holds a pattern that nodes must match to be indexed by this key; this pattern is identical to the template match attribute. It is not a problem if the same node is indexed by multiple keys. For each node in the selected set, the XPath expression in the use attribute is evaluated. The string value of the result of this expression is used to retrieve the indexed node. Multiple nodes can have the same result when evaluating use in their context. When the key function is called with this value, it will return a result set holding all nodes that had this result. The result can be a node set. In this case, each of the nodes will be converted to a string and each of these strings can be used to retrieve the selected node.

Don't worry if you can't see the point of this yet. We will do an extensive example on this. Suppose we have this XML document:

<?xml version="1.0"?>
<FAMILY>
  <TRADITIONAL_NAMES>
    <NAME>Peter</NAME>
    <NAME>Mary</NAME>
  </TRADITIONAL_NAMES>
  <PERSON name="Peter">
    <CHILDREN>
      <PERSON name="Peter"/>
      <PERSON name="Archie"/>
    </CHILDREN>
  </PERSON>
</FAMILY>

We are transforming the XML source with an XSLT document that starts like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet  version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:key name="all-names" match="PERSON" use="@name"/>
<xsl:key name="parents-names" match="PERSON[CHILDREN/PERSON]" use="@name"/>
...

If we now use the key() function, our results will be:

Expression Used in key()

Result

key('all-names', 'Peter')

Both PERSON elements with name="Peter"

key('parent-names', 'Peter')

Only the Peter that has children

key('all-names', /FAMILY/TRADITIONAL_NAMES/NAME)

Both Peters, because Peter is one of the traditional names in the family

Now what are the cases where using a key is a good idea? Think of situations where XML elements often refer to each other using some sort of ID, but without using the validation rules for IDs (because these are sometimes too rigid). The key construct can:

  • Keep your code more readable.
  • Depend on the implementation, which may help performance. The XSLT processor can keep a hash-table structure in memory of all key references in the source document. If these references are often used, performance gains can be substantial.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” - Bill Gates