Friday, August 24, 2012

My favorite All-In-One Code Framework sample of the week - Extract embedded files from Office documents

Microsoft Developer Network - Samples - Extract embedded files from Office documents (CSOfficeDocumentFileExtractor)

Introduction

Office documents can embed other files into it. However, the object model does not expose a way to extract the embedded files from the document. This sample demonstrates how to extract the embedded files from an Office 2007 format file.

The files embedded inside the Office 2007 format file is located under /<application/embeddings/ folder. Using the System.IO.Packaging classes, we can extract the files.

If the embedded file is an Office 2007 format file like a Word document or an Excel workbook, it will be stored as such. However, other files will be stored in Structured Storage format with the name oleObjectX.bin. Most of the files will be stored as Ole10Native format. The Ole10Native file format has the following structure.

  • First 4 Bytes – Unknown
  • Next 2 Bytes – Usually 2 (02 00)
  • From 7th Byte, the name of the embedded file starts.
  • The original full path of the embedded file starts after that. Scan the path till null character.
  • Next 4 bytes are unknown
  • Next 4 bytes represents the length of the temporary file path before it got inserted to the document. This will be in little endian format and we need to convert it.
  • The temporary file path starts after that. We can either skip this using the length retrieved or scan the path till null character.
  • Next 4 bytes represents the size of the embedded file in little endian format. We need to convert it.
  • The actual file contents starts from here. Read the file till the length retrieved previously.
  • The next 4 bytes gives the length of the temporary location of the file in Unicode.
  • Temporary location of the file in Unicode starts from here.
  • Finally the source file path in Unicode starts.

The sample uses Structured Storage APIs. Most of the Interfaces, Classes and Enumerations are defined in the System.Runtime.InteropServices.ComTypes namespace. A few like the ones given below are defined/declared in Ole10Native.cs file.

  1. IEnumSTATSTG – Interface
  2. STATFLAG – Enumeration
  3. IStorage – Interface
  4. STGM – Enumeration
  5. StgIsStorageFile – Method (Ole32.dll)
  6. StgOpenStorage – Method (Ole32.dll)

..."

This is fricken hard to find on the Net. I've written embedded file extractors using the Structured Storage APIs and there's just not much info on how to do it. While the more approachable OpenXML format makes this a little easier, the beast is the Ole10Native native format files. This sample looks like a great resource for helping with that (and since it's not my work IP, I can mention it... ;)

No comments: