Using Office 2007 Documents as a Data Store

Published 21 Aug 2009 9:44 AM

Don't know what's up, but recently I've gotten a bunch of questions about programming using Office 2007 documents as data--that is, the need to programmatically manipulate the contents of Word, Excel, or PowerPoint 2007 documents. Most people rely solely on automation (that is, the technique we learned back in 1997 for opening the host application, using it to open the document, and then using the host application's object model to do the work.) This is slow. Very, very slow.

If you want better performance, you must use the OpenXML File Formats introduced in Office 2007 to do the work. Because Office 2007 documents (Word, Excel, and PowerPoint, at least) are stored in a completely transparent ZIP-file-based set of XML parts, it's easy enough to simply manipulate the XML contents of the file, rather than opening Word (or Excel or PowerPoint) to do the work.

This is, of course, far more difficult in practice than it is in theory (isn't everything?)--to make it simpler, Microsoft has been preparing the Open XML SDK 2.0, which provides higher-level wrappers on the code you would need to write in order to crack open and manipulate the XML content within Office documents. You can find information about the SDK here: 

http://www.microsoft.com/downloads/details.aspx?FamilyID=c6e744e5-36e9-45f5-8d8c-331df206e0d0&DisplayLang=en

In addition, the Office team commissioned a set of 50 code snippets demonstrating the use of the Open XML SDK 2.0 (and I had the privilege of creating these snippets--it was loads of fun). They should be available soon, and I'll post a link once you can download them. This technology isn't for the faint-hearted, that's for sure. Even simple operations require some heavy lifting, but the results are much, much faster than using automation (at runtime, at least!)

If you need to manipulate Office 2007 content, and you don't mind getting your hands a little dirty, check out the SDK.

by KenG