Content-aware search uses an implementation of a signature search algorithm in order to identify and locate files of certain types. In general, a persistent file signature is used to detect the very existence of a file, then header analysis is performed in order to determine the length of the file.
However, there are some exceptions from this rule. In this article, we’ll have a look at two extremes: a binary file format with a highly persistent structure, and a text format with no structure at all.
Detecting JPEG Images
JPEG files are easy to identify and easy to analyze. The format is well documented, so parsing a file header is generally not a problem. Let’s look, for example, at a typical JPEG file.
JPEG files have a characteristic signature and a highly structured format, making them easy to detect. All JPEG files begin with a hexademical value of FFD8, and end with a value of FFD9. In JPEG files, these signatures can be used several times to identify thumbnail previews in various sizes.
For example, Canon EOS 5D creates JPEG files of the following structure.
FFD8 – the beginning of the file
FFD8 – first thumbnail preview
FFD9 – end of first preview
FFD8 – second thumbnail preview
FFD9 – end of second preview
FFD9 – end of file
As you see, simply detecting fixed signatures is not enough. The program must analyze the file header, know and care about the actual file structure. If information stored in the file header does not match the actual content that follows, the resulting recovered file may come out corrupted. Corrupted images can be recovered with a specialized tool such as RS File Repair.
Detecting Text Files
Text files are on the opposite end of file formats. Having no persistent structure at all, text files are the most difficult to locate – but among the easiest to recover. Even fragmented text files can be recovered (if identified successfully) and combined into a single file if needed. There are no file headers or system structures to worry about.
Sometimes, no formal file headers are available (e.g. for text or HTML files), yet those files can still be recovered. In the case of text-based documents, a data recovery tool analyzes actual data blocks, attempting to find out if the blocks belong to what appears to be a text file. The decision is made by analyzing the file’s character set. If a certain data block contains mostly ASCII characters from a known character set (e.g. Western European, or Unicode, or Arabic etc.), the block is considered to belong to a text file. The ending of such text files is normally detected after the appearance of a certain number of non-ASCII symbols (binary data).
Detecting XML and HTML Documents
XML and HTML documents are structured text files. They normally begin with certain tags, and end with other tags. While there is no exact binary signature to look for, XML and HTML documents can be detected by looking for one of the opening tags (e.g. opening tags , , <?xml, closing tags: , etc.) The lookup must be case-insensitive, as tags can be written in either case or even with characters of mixed cases (e.g. ). The very existence of opening and closing tags allows reliable detection of the beginning and end of such documents.