Inside Open XML

Introduction

This article was originally published on DNJ Online
DNJ Online

The 2007 Microsoft Office system brings many changes but the most significant, as far as organisations of any size will be concerned, is one that few end-users will ever see. This is the new Office Open XML file format used by Word, Excel and PowerPoint 2007 to store documents. For Office 2007, Microsoft has adopted a native file format that, while able to support all the features of Office documents back to Office 2000, is also an open standard capable of being read, manipulated and extended by third-parties.

The Office Open XML standard was approved by Ecma International (previously the European Computer Manufacturers Association) in December 2006 and represents the collaborative effort of some 20 members including vendors such as Apple, Intel and Novell, and organisations such as BP, Barclays Capital, The British Library and the US Library of Congress. It has now been submitted to the ISO/IEC Joint Technical Committee for ratification. The full specification runs to over 6,000 pages but a useful overview can also be found at www.ecma-international.org/publications/standards/Ecma-376.htm.

Open Packaging Convention

Central to Open XML is the Open Packaging Convention (OPC). This is defined in Part 2 of the specification but is also used by Microsoft’s XML Paper Specification (XPS), originally codenamed ‘Metro’. OPC describes a method for storing a set of files within a single compressed file known as a ‘package’.
Open XML Zip file This is done using ZIP technology, as originally created by Phil Katz of PKWARE, and indeed if you take a .docx file saved from Word 2007 and change the file extension to .zip then you can see what is inside using either the WinZip utility or the ZIP support which is now built into Windows itself. Shown here is the contents of a fairly simple document containing a couple of paragraphs and one embedded image.

As you can see the package contains a number of parts grouped into various folders. Most of these contain XML data structured in accordance with published schema that describe various aspects of the document, although a part can also be a binary data stream such as the embedded picture image1.jpeg shown in our example. One useful side-benefit of the format is that original resolution versions of embedded images are stored as distinct parts with their original file extension within the package (usually within a ‘media’ sub-folder), from where they can easily be extracted.

Other parts contain metadata that define content type and the relationships between the parts. The content types used by every part within the package are defined in the file [Content_Types].xml, which must be present in the root of every package. The only other reserved location is /_rels/.rels which specifies the relationship between the package itself and the top-level parts.

Hello world

To understand this better, let’s look at what is required to create a basic ‘Hello world’ document. At the very least the package must contain a content-type part, a package-relationship part and a document part. As you can see, these are all standard XML documents with defined namespaces. In the content-type part and the package relationship part these reference schemas that are specific to OPC. For the document part the namespace references the WordprocessingML schema, introduced with Office 2003 and now part of the Open XML specification:

The content-type part /[Content_Types].xml:

<Types xmlns=”http://schemas.openxmlformats.org/
           package/2006/content-types”>
   <Default Extension=”rels”
       ContentType=”application/
           vnd.openxmlformats-package.relationships+xml” />
   <Default Extension=”xml”
       ContentType=”application/
           vnd.openxmlformats-officedocument
           .wordprocessingml.document.main+xml” />
</Types>

The package relationship part /_rels/.rels:

<Relationships xmlns=”http://schemas.openxmlformats.org/
          package/2006/relationships”>
   <Relationship Id=”rId1”
          Type=”http://schemas.openxmlformats.org/
                officeDocument/2006/relationships/
                officeDocument”
          Target=”document.xml” />
</Relationships>

The document part /document.xml:

<w:document xmlns:w=”http://schemas.openxmlformats.org/
                     wordprocessingml/2006/main”>
   <w:body>
      <w:p>
         <w:r>
            <w:t>Hello world</w:t>
         </w:r>
      </w:p>
   </w:body>
</w:document>

Example taken from the ‘Explanatory Report on Office Open XML Standard (ECMA-376)’.

The purpose of the content-type part is to tell consuming applications such as Word 2007 how to interpret the various types of content that the package contains. In this case the part defines just two content types. The first is for files with the ‘rels’ extension which define relationships; the second is for files with the ‘xml’ extension which in this case are to be interpreted as WordprocessingML documents.

In more complex packages the Default Extension element may simply define files with the ‘xml’ extension as having ContentType ‘application/xml’. This would be followed by a number of more specific ‘Override PartName’ elements specifying, for example, that the file /word/styles.xml should be interpreted using the WordprocessingML style specification.

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Debugging is anticipated with distaste, performed with reluctance, and bragged about forever.” - Dan Kaminsky