If you are shopping for a data recovery tool, you have probably seen manufacturers mention things like “file carving”, “signature search” or “content-aware recovery”. What are these, is there any difference between these technologies, and do they really help recover more data? Read this article to find out!
- Discovering Deleted Files with Signature Search
- Problems of Content-Aware Data Recovery
- Sparse Files
- No File System?
Discovering Deleted Files with Signature Search
If you are following our blog, you have probably already read the article “Why and How Deleted Files Can Be Undeleted”. In that article we talk a lot about how Windows and other operating systems delete files, what happens to their content and what can be done to recover the data. In that article, we briefly mentioned content-aware recovery as one of the possible methods. Let’s see how content-aware recovery actually works in this more detailed review. After all, today’s content-aware recovery algorithms are much more complex than they used to be, and do a lot more than the match-identify-extract type of thing as our overview article could make you believe.
All content-aware recovery without exception are based on signature search. This is the same technique used by anti-virus tools to identify known viruses hiding in files. Generally speaking, “signature search”, “content-aware recovery” and “file carving” mean the same thing. However, the particular implementation of this technology may differ significantly among products.
However, unlike antivirus tools, signature search data recovery algorithms will scan unallocated (empty) disk space instead of existing files. (Under certain circumstances they’ll perform a full disk scan though; more on that later). But let’s say for example that we are just about to find a file that’s been just recently deleted.
When looking for deleted files, the signature search algorithm will first analyze the file system to discover which blocks on the disk are empty (not occupied with existing data). Note that this is only possible if the file system is still available. If you’ve formatted or repartitioned the disk, the original file system may not be available, and signature search will need to analyze all disk sectors without exception.
So, the algorithm reads a data block off the disk that is not claimed by any file. What happens next? The algorithm scans that block in an attempt to find out whether or not that particular block of data can be identified as the beginning of a file in a certain known format. This is done by matching the data against the list of signatures stored in the tool’s built-in database (hence the name “signature search”). If a known signature is detected, the tool… happily reports that it found a file? Not that fast. Prior to that, the algorithm will perform at least one secondary check, while sometimes multiple checks are required. For example, if the tool discovers a signature that says %PDF%, it won’t report about discovering a PDF file immediately. After all, we have two %PDF% signatures already in this very article, and this is not a PDF file by any means! So the algorithm will analyze the data block to discover whether its other data matches the structure of a certain file format. This, for example, could mean that certain data must be stored at a certain offset: a checksum, the file’s length, or date/time, or something else. Several checks must be performed for some file formats, yet detection quality may still be loose.
So let’s say the data block definitely belongs to a known file (say, a PDF document). What happens next?
The next step is discovering the length of the file. The file’s length can be calculated by further analyzing information available in the file’s header. By determining the length of the file, we can easily calculate its beginning and end, or the first and the last sectors belonging to that file… or not?
Problems of Content-Aware Data Recovery
Signature search recovery is only straightforward if you are recovering a single, non-fragmented file from a contiguous chunk of free space. In real life this rarely happens. More often than not, you’ll have gaps of free space scattered around the disk. These gaps lay between the files. Their size and locations depend on many factors such as the file system used, version of Windows, the amount of free disk space, and whether or not you’ve recently run defragmentation on that volume. As a result, recovering a deleted file with content-aware analysis becomes a sort of a lottery. If a file is small, or if it is stored in a contiguous chunk of free space, that file can be recovered in its entirety. If, however, the file was fragmented in the first place, the different signature search algorithms behave quite differently.
Most file carving algorithms will scan the file system first, then only read unallocated (empty) disk space. This is one of the better methods for recovering fragmented files, as it ignores parts of the disk occupied by existing data, concentrating instead on empty areas that are likely to contain information that used to be part of deleted files.
Other algorithms will ignore the file system, and scan the complete disk area. This seemingly “dumb” approach may not deliver the best results if a file being recovered was fragmented in the first place. However, it may produce better results on file systems such as NTFS (as well as ext2/ext3/ext4 used in Linux) that use so-called sparse files. Sparse file attempt to use file system space more efficiently when the system creates a new large file. Instead of allocating the required disk space (and actually writing zeroes on the disk, which *is* slow when you’re creating e.g. a file to contain a long movie), the use of sparse files allows the system only writing brief information (metadata) representing the empty blocks to disk instead of the actual “empty” space. As a result, blocks of data are only written to disk only when they contain real (non-empty) data.
What does that mean in the context of content-aware recovery? It simply means that ignoring allocated (yet non-empty) disk space would be a mistake. Ideally, a carving tool should be able to recognize sparse files. In reality, few if any tools do. As a result, attempting to carve both allocated and unallocated sectors may lead to better results than carving unallocated space only.
The other problem is fragmentation. In real-life data recovery scenarios, fragmentation can be your worst enemy. As we already discussed in the very comprehensive article “Recovering Fragmented Files”, there is no single, all-in-one solution to this problem. All data recovery algorithms treat fragmentation differently, attempting to recover as much usable information as possible. If you are interested in this issue, please click on the above link to read more.
No File System?
What should a data recovery tool do if there is no file system on the volume? While you can always use signature search to perform content-aware recovery, this may not be the optimal choice. Instead, some of the more advanced tools such as [ Partition Recovery ] will attempt to locate and recover the original file system first. Why is it possible?
There could of course be different scenarios. For example, if an NTFS disk is formatted (or if the disk is repartitioned), the entire file system may be preserved as it’s stored on the disk as a file. Yes, the entire file system is nothing more than a file named $MFT. So if a data recovery tool can carve (and recover) that one file first, recovering other files from that partition becomes so much easier. Not every tool on the market can do that, and even in those that can this functionality is usually reserved to flagship tools or the most expensive editions. If you need a tool that can do that, try Partition Recovery.
In real world, content-aware recovery works pretty well. However, if you absolutely need that one file, and your favorite data recovery tool does not help, well, you can try a different tool using a different approach to see if that one can access your file.