With Office 2007, Microsoft decided to change default application formats from old, proprietary, closed formats (DOC, XLS, PPT) to new, open and standardized XML formats (DOCX, XLSX and PPTX). New formats share some similarities with old Office XML formats (WordML, SpreadsheetML) and some similarities with competing OpenOffice.org OpenDocument formats, but there are many differences. Since new formats will be default in Office 2007 and Microsoft Office is the most predominant office suite, these formats are destined to be popular and you will probably have to deal with them sooner or later.
This article will explain basics of Open XML file format and specifically XLSX format, the new format for Excel 2007. Presented is a demo application which writes / reads tabular data to / from XLSX files. Application is written in C# using Visual Studio 2005. Created XLSX files can be opened using Excel 2007 Beta (we used build 12.0.3820.1003).
Microsoft Open XML format
Every Open XML file is essentially a ZIP archive containing many other files. Office-specific data is stored in multiple XML files inside that archive. This is in direct contrast with old WordML and SpreadsheetML formats which were single, non-compressed XML files. Although more complex, new approach offers few benefits:
- You don’t need to process entire file in order to extract specific data.
- Images and multimedia are now encoded in native format, not as text streams.
- Files are smaller as a result of compression and native multimedia storage.
Picture 1: Parts and relations inside XLSX file.
To cut a long story short, in order to read the data from an Open XML file you need to:
1) Open package as a ZIP archive – any standard ZIP library will do.
2) Find parts that contain data you want to read – you can navigate through relationship graph (more complex) or you can presume that certain parts have defined name and path (Microsoft can change that in the future).
3) Read parts you are interested in – using standard XML library (if they are XML) or some other method (if they are images, sounds or of some other type).
On the other side, if you want to create a new Open XML file, you need to:
1) Create/get all necessary parts – by using some standard XML library (if they are XML), by copying them or by using some other method.
2) Create all relationships – create “.rels” files.
3) Create content types – create “[Content_Types].xml” file.
4) Package everything into a ZIP file with appropriate extension (DOCX, XLSX or PPTX) – any standard ZIP library will do.
The whole story about packages, parts, content types and relations is the same for all Open XML documents (regardless of they originating application) and Microsoft refers to it as Open Packaging Conventions.