Back in 2006 when Microsoft released a new version of its Office suite, the Microsoft Office 2007, the company introduced a range of new file formats based on Open XML to eventually replace the proprietary Microsoft document containers.
The new Open XML file formats were designed simplify the exchange of data between the different Office applications. The new open formats were aimed to replace the old, proprietary file formats used by Microsoft office applications since early days. Based on open standards, these XML file formats enable the rapid creation of documents from different data sources and speed up document assembly, data mining, and content reuse.
Interestingly, Microsoft have submitted the Open XML Formats standard for ISO/IEC certification. Later on, the formats were published as the ISO/IEC 29500 Office Open XML Formats standard.
The Difference Between New and Old Document Formats
The old document formats (*.doc for Microsoft Word, *.xls for Microsoft Excel, *.ppt for Power Point) used before Office 2007 were Microsoft proprietary OLE containers that had a strict structure, packing text, formatting, and any embedded objects into a single file. To the contrary, the new Open XML format is based on a number of individual files packed into a ZIP container.
Yes, you read it right: the new Microsoft Office documents are collections of files (XML, images and OLE objects) packed into a ZIP archive and given a new extension (*.docx for Word 2007-2013 documents, *.xlsx for Excel spreadsheets, *.pptx for Power Point slides etc.)
What does that mean from the data recovery point of view? The new modular data storage allows Office applications to open files even if one or more components within the ZIP container (e.g. images, OLE objects, charts or tables) are damaged.
Think of it for a moment. A typical Word document has some formatted text and a bunch of embedded objects (charts, pictures etc.) The embedded objects will normally take much more disk space compared to text alone, while being much easier to replace should they become corrupted. On the other hand, text is difficult to replace (must type it in) but takes very little space inside of a ZIP archive.
Microsoft lists the following benefits of the new format:
- Small and compact files. Documents in the Office 2007-2016 format are ZIP-compressed, taking much less space on the disk compared to the older document format.
- Damaged files can be more easily recovered. Yes, the recoverability has finally become a virtue. Modular data storage naturally enables an easy path for repairing damaged documents, and even allows opening files one or more components are damaged or missing.
- Increased safety. In the old format, embedded code such as Visual Basic for Applications (VBA) scripts was kept within the main file, making it non-trivial to detect or block. The new format makes all executable code stored in a separate section within the file, so that the macros can be easily identified and blocked if needed.
- Better transparency and security. Personally identifiable information such as user names, comments, tracked changes etc. is stored in separate files within the container, and can be easily removed if needed.
- Open format. The Open XML format is now a standard that can be used by third-party developers without the need of reverse engineering. This is extremely important when it comes to data recovery.
Office Documents: Recover or Repair?
There is still some confusion between recovering office documents and repairing them. When we talk about recovery, we mean scanning the disk (or analyzing the file system) to identify files that are missing or have been deleted. One can recover documents from formatted, repartitioned or corrupted storage devices.
Unfortunately, the recovery is almost never 100% successful. One or more documents could have parts that have been partially overwritten with new data. If this is the case, the document will not open in the Office application, and will have to be repaired (or rebuilt) to conform to its format. Let’s see how recovery and repairing of Office 2007, 2010, 2013 and 2016 documents work.
Recovering Office 2007/2010/2013/2016 Documents
As a data recovery tool sees it, the new Open XML document format is essentially a collection of XML, pictures and metadata files stored inside of a ZIP archive. Pretty much every data recovery tool can deal with ZIP archives, so there is generally no problem in identifying the beginning of a file, calculating its length and finally saving the file. It’s just a standard ZIP file. The ZIP format is well-known and perfectly documented; there are no nasty surprises here.