Checking for Microsoft Word DocX/DocM Revisions/Track Changes without using Word... (via OpenXML SDK, LINQ to XML or XML DOM)
"I’ve written a short article at OpenXMLDeveloper.org that shows how to detect tracked revisions using XmlDocument. Previously, I wrote an article on detecting tracked revisions using LINQ to XML or the strongly-typed object model of the Open XML SDK 2.0. However, some developers do not have the option of using LINQ, and instead must use one of a variety of XML DOM Document implementations. ..."
"Determining whether an Open XML WordprocessingML document contains tracked revisions is important. You can significantly simplify your code to process Open XML WordprocessingML if you know that the document does not contain tracked revisions. This article describes how to determine whether a document contains tracked revisions.
Processing tracked changes (also known as tracked revisions) is an important task that you should full understand when you write Open XML applications. If you accept all tracked revisions first, your job of processing or transforming the WordprocessingML is made significantly easier.
Accepting Revisions by Using PowerTools for Open XML
To review the semantics of the elements and attributes of WordprocessingML that hold tracked changes information in detail, see Accepting Revisions in Open XML Word-Processing Documents. In addition, you can download the code sample, RevisionAccepter.zip from the following project on CodePlex, CodePlex.com/PowerTools. To download, go to the Downloads tab, and then click RevisionAccepter.zip.
Determining Existence of Tracked Changes
There are other scenarios where you want to process documents that you know do not contain tracked changes, and because of certain business requirements, you do not want to automatically accept tracked changes. For example, perhaps you have a SharePoint document library that contains no documents that contain tracked changes. Before users add the document to that document library, you want them to consciously and intentionally address and accept all tracked revisions. Accepting revisions as part of checking the document into the document library circumvents this manual process, where you want each person to examine their documents and resolve any issues.
As an alternative, instead of accepting revisions with the RevisionAccepter class, you can validate that the document contains no tracked revisions, and refuse to let the document be checked into the document library without tracked changes being accepted.
The code is not complex. It defines an array of revision tracking element names, and if any of these elements occur in any of the parts that can contain tracked revisions, then the document contains tracked revisions. We can use a LINQ query to determine if any of the revision tracking elements exist in the markup. This article presents four versions of the code to determine whether a document contains tracked revisions.
- Using C# and LINQ to XML.
- Using C# and the Open XML SDK strongly-typed object model.
- Using Visual Basic and LINQ to XML.
- Using Visual Basic and the Open XML SDK strongly-typed object model
"Tracked revisions are one of the more involved features of Open XML WordprocessingML. There are 28 elements associated with tracked revisions, each with their own semantics. In some cases, such as with content controls and deleted paragraph marks, the semantics for tracked revisions are (of necessity) very involved.
Some time ago, I wrote an article, Accepting Revisions in Open XML Word-Processing Documents, which details the exact semantics for each of the elements that comprise revision tracking.
... However, many developers do not have the option of using LINQ to process XML, and instead must use one of a variety of implementations of XML DOM, such as System.Xml.XmlDocument in the .NET framework, or an implementation of XML DOM for php. This post presents a bit of XmlDocument code to detect tracked revisions. The important parts are those that show which Open XML parts to process, and the XPath expression to detect tracked revision markup. Because the semantics of XPath and XML DOM Document are carefully defined, it is pretty easy to translate this code to another language and implementation of XML DOM Document.
Being in the biz that I'm in, Revisions/Track Changes in Word doc's are a big deal to me. I can't tell you the number of times I've seen documents where unaccepted revisions revealed something that might have been best to have not been reveled. From jokes to contract negotiations, Track Changes/Revisions can be a big deal. In many cases this is "hidden" metadata (though in newer versions of Word, by default this hidden data is much more in your face... which is a good thing) should be scrubbed, removed or at least be acknowledged prior to the release of any Word document.
With the old binary version of Word Documents, (DOC's) the common means of doing this was via automating Word. Yes, by those cringing and grimacing, I can see how much you liked doing that, especially in a automated, batch, or server scenario. In a word, ouch.
The beauty of the Open XML DocX/DocM format is that it's much easier to spelunk the documents now without using Word. DocX is just a Zip file with XML (and stuff). Sure it's supper dupper easy to walk a Word doc via its raw DocX xml, but have you ever looked at the binary Doc specification? In that respect, it's only about... um... about a 1000 times easier.
Eric's posts show how you can inspect Word DocX/DocM (DocM is a DocX with Marcos) for Revisions/Track Changes using either the Open XML SDK or just POX techniques.
Opening OpenXML, the Open XML Package Editor Power Tool for Visual Studio 2010
Open XML 2.0 Code Snippets for VS2010 (and VS2008 too)
Open XML Format SDK 2.0 Code Snippets for Visual Studio 2008 – 52 C#/VB Code Snippets to help ease your Open XML coding
Open XML File Format Code Snippets for Visual Studio 2005 (Office 2007 NOT required)
Microsoft Office File Formats and Microsoft Office Protocols Documentation Refreshed
Microsoft Office File Formats and Protocols documentation updated for Office 2010 (Think “Now with added ‘X’ flavor… DocX, PptX, XlsX, etc”)
MS-PST file format specification released. Yep, the full and complete specification for Outlook PST’s is now just a download away.
Microsoft Office (DOC, XLS, PPT) Binary File Format Specifications Released – We’re talking the full technical specification… (The [MS-DOC].pdf alone is 553 pages of very dense specification information)
DOC, XLS and PPT Binary File Format Specifications Released (plus WMF, Windows Compound File [aka OLE 2.0 Structured Storage] and Ink Serialized Format Specifications and Translator to XML news)