In a recent article “Content Aware Recovery and Data Carving Explained”, I wrote about some of the more advanced algorithms used in contemporary data recovery tools. While these methods are often called “file carving”, “signature search” or “content-aware recovery”, the underlying principle is similar between them. The differences between advanced data recovery algorithms are minor, yet they do in fact affect the end result. In today’s post, we’re going to peek into the future.
The Limitations of Today’s Data Recovery Algorithms
If you follow this blog, you may already know that fragmentation is your worst enemy when it comes to recovering information. Indeed, while you can normally recover contiguous files with one of the simplest content-aware algorithms, you’re out of luck if that file’s content is scattered around the disk. Today’s content-aware algorithms are based on signature search, meaning they can only detect the beginning of a file in a format they are aware of. Whether or not the actual content belongs to the file remains out of their scope.
Let’s see, for example, how a typical signature-search based algorithm deals with a large MKV file. First, it bumps into the file’s header. The tool calculates the length of the MKV file and saves that many data blocks into a newly created file with the same name. Case closed, let’s move on to the next file. Some of the more intelligent algorithms will read the file system first, and only extract data blocks that don’t belong to any other file (and even this much effort is rare among today’s data recovery tools). No matter what they do, they just can’t deal with disk fragmentation no matter what.
A completely different approach with much more context dependency will be required to actually carve that video file. If we had unlimited resources and only cared for that one file, we could set that algorithm to read the file’s header, then attach the next data block and validate if the resulting file is still a valid video with all the correct frames. If it is, add the next block and validate the video. If it becomes invalid AND we haven’t quite reached the required number of blocks, read the entire disk and try appending each data block to the end of the video file, checking the video for validity every time. This would be painfully slow, and would take hours to recover just one single file, but at the end we would have the best possible reconstruction of a given file. We could do that. But only if we had unlimited resources and unlimited time. Today, such comprehensive carving methods are only used in intelligence and digital forensics (and even there it’s only used in very, very few cases).
What is your “next best bet” compared to true data carving? Quite possibly, many today’s content-aware algorithms offer similar real-life performance by taking a few other things into account.
The File System
First and foremost, let’s not forget about the file system. Total data loss scenarios with empty (or erased) file systems are relatively rare. Indeed, wiping the entire file system can be a difficult and lengthy process. If, for example, you repartition the hard drive, the original file system will NOT be emptied. Instead, it will remain on the disk. If you format an NTFS volume, the file system ($MFT file) will be deleted just like any other file would. However, it will remain recoverable – just like any other file would! As a result, recovering the original file system (by using context-aware search, or carving) is the highly recommended first step available in pretty much every high-end data recovery tool on the market. And having the file system at its disposal, even if the file system is damaged or corrupted, a data recovery tool can restore most files without even using the signature search.
If you have the file system available, content-aware analysis becomes so much easier. First, one separates existing files and folders that are referred from the file systems, and empty space. Second, the signature search algorithm treats all unreferenced data blocks as contiguous space, reconstructing deleted files from blocks that aren’t taken by any other file. While of course the above paragraph is an oversimplification of a kind (let’s not forget about sparse files for example), treating unallocated disk sectors as contiguous area instead of mixing it together with existing files goes a long way in recovering your data.
Data Carving Today: Text Files
Did you know that you might be using true data carving even today? While maintaining context throughout the entire data recovery process and performing a global search for individual fragments is still not feasible for all but the most demanding applications and labor-intensive jobs, true data carving is already available for at least one type of data: text.
Text files don’t have headers. In fact, the structure of text files differs very little from the structure of random binary data, with one exception. Text files use limited character sets. What does that mean in the context of data recovery? This means that the tool can look for disk clusters that only contain characters falling within the range 0-9, A-Z, whitespace and a handful of special characters. If a cluster containing this limited character set is detected, it is treated as the beginning of a text file. Subsequent clusters are analyzed and, if their content also falls within the limited character range, they are appended to the text file. The process continues until the algorithm encounters a cluster that no longer falls into the “text” category, after which the text file is saved, the context is reset, and the tool starts looking for other types of data.
As you can see, this approach is as close to actual data carving as it comes in today’s tools. It guarantees that all bits and pieces of text-based data will be extracted and saved into individual text files. Even if your hard drive is heavily fragmented, you’ll still get all the bits and pieces, and may be able to manually merge these bits into your original text files.
Now, of course, things aren’t that rosy in real life. The clean 0-9, A-Z range is only typical for English and several other languages, while other languages use accents and non-Latin characters that could make an improperly written tool to ignore the text in one of the foreign languages. There are also two-byte encodings such as Unicode and variable-length encodings such as UTF-8 that require a different approach altogether. Even in this simplest of cases, the different data recovery tools employ many different tricks and workarounds such as detecting the system’s regional settings (which may or may not correspond to those of the hard drive being recovered!), requiring the user to manually specify the character set (or character sets) to look for, or only looking for encodings popular in a certain geographical region (e.g. Latin, West European, East European and Cyrillic character sets).
Some more advanced tools are using neural networks in order to detect the actual encoding of a given set of characters. Ever used Google Translate or Bing Translators? Paste some foreign text into their translation window, and if you entered more than a few words you’ll see the tool recognize input language automatically. While language detection like that is probably redundant for a data recovery tool, actual character sets can be detected quite reliably with relatively simple algorithms.
A similar technique can be used to discover XML and HTML files and some RTF document.