Opening the package in OpenXML

Introduction

This article was originally published on DNJ Online
DNJ Online

The idea of making XML the default document format in Microsoft Office 2007 is to enable universal access. Since XML is mainly plain text, applications can read, write and create the new formats without resorting to COM automation. Even Web applications written in PHP and running on Linux can manipulate Office Open XML.

Unfortunately the task is not as easy as it sounds. Anyone expecting a tidy single file in the style of XHTML will be disappointed. Open XML documents are compressed using ZIP, and although such archives are easy to handle, this adds to the programmer’s task. Furthermore, the archive contains not one but many files, even for a simple text-only document, so figuring out what to edit is a challenge.

Microsoft has provided an API to make the job easier, though this is fairly low-level. It would be great to have some high-level classes to simplify common tasks, but for the moment there is only a set of code snippets you can download. These illustrate the problem. The snippet for inserting a string into a cell in an Excel worksheet, for example, is over 100 lines of code. A set of wrapper classes is essential, and until Microsoft comes up with something better, developers will have to roll their own.

The API for working with Open XML is called System.IO.Packaging and is part of .NET Framework 3.0. Getting started involves downloading the .NET 3.0 runtime and SDK. A good place to begin is http://www.netfx3.com which has links to numerous .NET 3.0 resources.

The Packaging API is not specific to Open XML but is also used for Microsoft’s XML Print Specification (XPS) documents and may be used for other purposes in future. It provides a generic way to manipulate documents packaged into a single archive conforming to the Open Packaging Convention (OPC), which is a Microsoft-sponsored Ecma standard.

Many developers will already be familiar with the COM automation API exposed by Office applications that lets you create and modify Office documents from Visual Basic and other languages. The advantage of the Packaging API is that you no longer need COM or the presence of the Office applications themselves. If you need to work with Office documents from a .NET application then the Packaging API is ideal.

On the other hand, the old COM API does a better job of wrapping the complexity of these documents. It is not the Packaging API itself which is complex – in fact, it is rather simple – but rather the XML specifications for the Office documents themselves, which famously account for thousands of pages of documentation. Using the packaging API will typically require more code than working with automation, and with greater risk of ending up with a corrupt document.

Another tricky issue is that when they open a document, applications like Word keep it locked which means the Packaging API is no use. You have to close the document, process it and then reopen it.

The essentials

The OPC is based on the concept of parts and relationships, as explained in our previous article. It is therefore no surprise to find that the three key classes in System.IO.Packaging are Package, PackagePart and PackageRelationship. Code that works with the Packaging API generally starts by creating or opening the package to instantiate a Package object, working with PackageParts and PackageRelationships, and then closing the package.

One aspect of the Packaging API is worth mentioning early on. All Open XML documents have a top-level file called [Content_Types].xml which lists the MIME content types used in the document. You do not need to interact with this file directly as it is maintained automatically. A related fact is that the content type of a part is read-only. If you need to change the content type of a part then you must delete it and create a new part of the required type.

The Packaging API does not duplicate the .NET XML libraries. You read and write the content of parts through streams in the normal way. For example, if you wanted to amend an XML document stored in a package, you would retrieve the XML through PackagePart.GetStream(), amend it using the System.XML API or even just as plain text, and then write it back, again through PackagePart.GetStream().

In order to work with the Packaging API in a Visual Studio project you must set a project reference to WindowsBase. If it does not exist then you need to install .NET Framework 3.0. In your code you will probably want to add the following to the C# using statements or Visual Basic Imports statements:

System.IO
System.IO.Packaging
System.Xml

You might also like...

Comments

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“The question of whether computers can think is just like the question of whether submarines can swim.” - Edsger W. Dijkstra