This post continues series of articles about the internal mechanisms of today’s data recovery tools.
In “How Data Recovery Works”, we looked at how file recovery tools can recover deleted files by using the file system. But what if the file was deleted a long time ago, and its file system record no longer exists? Or what if the disk was formatted or repartitioned and the file system is empty or missing? Finally, what if the file system is overwritten by another file system (such as that used by Linux or Ubuntu if you experimented with an alternative OS)? If this is the case, traditional file recovery tools will fail to recover anything.
In order to recover information in such situations, manufacturers of data recovery tools invented a set of so-called ‘carving’ algorithms based on signature search. As opposed to traditional file recovery methods relying on the file system, data carving works by reading the complete surface of the hard drive (or scanning the entire content of flash-based media). While scanning the disk, data carving algorithms look for characteristic signatures (hence the name “signature search”) identifying known file formats. This is very similar to how anti-virus tools work, scanning files and looking for patterns of code to identify viruses.
For example, ZIP files normally start with “PK” followed by binary data in a pre-defined format. By analyzing that binary data, a carving algorithm can tell if that “PK” signifies the beginning of a ZIP file (if all the numbers line up), or if it’s just a “PK” that was typed in a document such as this one.
If the signature is confirmed to belong to the actual file, the algorithm will start analyzing the file header. By analyzing the file header, the data recovery program can calculate the original length of the file. Knowing the initial address of the file on the disk and knowing the length of that file, the tool can learn exactly what sectors are used by that file’s data, read them and reassemble the original file.
File carving: issues and problems
Don’t you see a problem in this approach? There are actually at least two. First, without the file system there is no way to discover the original name of the file. Recovered files are saved as “image0001.jpg” or “document012.jpg” instead of having a proper name. The other problem has to do with disk fragmentation. If a file is not stored in a contiguous chunk (typical for larger files), file carving will be unable to recover the complete file.
To solve this problem, developers combine information obtained from the file system with data discovered with the use of file carving algorithms. This approach gives the best of the two worlds: reliable recovery complete with file names and regardless of disk fragmentation level.