Inside Open XML

Page 2 of 3
  1. Introduction
  2. Markup Languages and Relationships
  3. Extending Open XML

Markup Languages and Relationships

This article was originally published on DNJ Online
DNJ Online

By convention, parts are arranged within specific folders inside the package. Most Word 2007 packages store the file document.xml inside a /word folder, for example, and embedded images inside the sub-folder /word/media. However, as our ‘Hello world’ document demonstrates, this needn’t be the case as here document.xml is to be found in the root of the package. Indeed the only folder that must be present is the /_rels folder that holds the top-level relationship part ‘.rels’.

So applications cannot rely on the internal file structure of the package to discover the location of the various parts. Instead they should use the relationship parts for this purpose, starting with /_rels/.rels. In our ‘Hello world’ example this defines just one relationship, identified as ‘rId1’, which tells us that document.xml is in the root directory and conforms to the ‘officeDocument’ schema.

For ‘Hello world’ that is the end of the story: we have found the one and only part that defines the content of the package, and we know that the content is stored in WordprocessingML. In more complex packages, such as that shown in Figure 1, we can see that the /word folder also contains a /_rels folder which contains another relationship part, namely document.xml.rels. This contains further relationships that target the remaining parts of the package. The relationship identified as ‘rId6’, for example, targets the embedded image /media/image1.jpeg.

This may seem a complicated way of doing things but it does offer a number of advantages. For a start it means that consumer applications can discover parts and the relationship between parts without having to interpret application-level schema. It also means that relationships can be established and modified without having to touch the actual content of the document.

For example, the relationship schema also supports a TargetMode attribute, which by default is set to ‘Internal’ to indicate that the target is internal to the package. However it can also be set to ‘External’ where the target is not part of the package. The following, for instance, defines relationship rId9 which targets a particular page on the DNJ Online Web site:

<Relationship Id=”rId9”
   Type=”http://schemas.openxmlformats.org/
         officeDocument/2006/relationships/hyperlink”
       Target=”http://dnjonline.com/article.aspx?ID=jul06_atlas”
       TargetMode=”External” />

The content of the document, stored within document.xml, might include the following line:

<w:hyperlink r:id=”rId9” w:history=”1”>

Without going into the complexities of WordprocessingML, this specifies a hyperlink whose target is defined by the relationship rId9. In the document it would appear as a link to the specified article. However the target for this link could be changed without making any changes to document.xml, simply by changing the relevant Target attribute within the relationship part.
     To take another example, the following SpreadsheetML defines a workbook containing a single worksheet:

<workbook>
   <sheets>
   <sheet name=”Sheet1” sheetId=”1” r:id=”rId1”/>
   </sheets>
</workbook>

Which worksheet it contains is determine by the relationship rId1. Again, this can be changed by editing the appropriate relationship, without having to go into the SpreadsheetML itself.

Furthermore, as we shall see in our next article, the Packaging API that comes with .NET includes methods that make the process of creating, walking through and editing relationship parts fairly straightforward.

Markup languages

The Open XML specification defines three primary markup languages which target each of the document editors in Office 2007. These are WordprocessingML, SpreadsheetML and PresentationML which first appeared with Office 2003. Each is based on XML, as you can see from the small snippet exhibited in the document.xml part of our ‘Hello world’ document.

The structure of these languages is fairly straightforward although their definition does take up the bulk of the 6,000 pages of the full Open XML specification. This is because they are intended to support the full feature set of all versions of Word, Excel and PowerPoint going back to Office 2000, enabling documents created with any of these applications to be faithfully captured.

In the case of WordprocessingML, the root is the ‘document’ element which contains a ‘body’ element. This in turn can contain a number of ‘p’ (paragraph) elements which can contain multiple ‘r’ (run) elements. A run is a group of characters that have identical properties and so require no additional markup. Runs would in turn contain ‘t’ (text) elements and there are also ‘rPr’ (run property) and ‘pPr’ (paragraph property) elements for defining the format of these elements, and so forth. Footnotes, headers and footers are stored in separate parts within the package.

SpreadsheetML starts with the root ‘workbook’ element which contains ‘sheet’ elements pointing to the various sheets in the document. Each sheet is stored in a separate part that contains a ‘worksheet’ element that goes down through the ‘sheetData’ element, the ‘row’ element, the ‘c’ (cell) element and finally to the ‘v’ (value) and ‘f’ (formula) elements. One point to note is that the sheetData element only contains information about cells that are not empty. Other aspects of the specification deal with charting and formatting.

PresentationML defines the root element ‘presentation’ which contains pointers to ‘sldMaster’, ‘notesMaster’, ‘handoutMaster’ and ‘sld’ (slide) parts. These in turn support elements describing the various graphics, text, tables and charts used on the slides themselves. In addition there is DrawingML, a comprehensive markup language for defining vector graphics. There is also support for VML (Vector Markup Language) although this is included solely for backward compatibility. DrawingML is to be preferred wherever possible.

One seemingly trivial but important aspect of these markup languages is the brevity of the element names – often just a single character. This is deliberate as it helps to cut down on the size of Open XML documents. Other space-saving features include the sharedStrings part used by SpreadsheetML to ensure that repeated strings of text are only stored once.

XML Digital Signatures are supported through the Digital Signature Origin part which is referenced through /_rels/.rels and provides a starting point for discovering the signatures contained within a package. Each signature is held in a separate part which has the root element ‘Signature’.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The trouble with programmers is that you can never tell what a programmer is doing until it's too late.” - Seymour Cray