Before you can do anything with an unstructured file, you first need to identify its file type. Our products start with the following steps:
- Calculate hashes (MD5, SHA-1, etc.) for the entire file, and match them to known good and/or known bad file hash databases. (optional; very slow)
- Search the first 32 bytes for a known signature (magic ID). (very fast).
- Search further into the file for a known signature. (slow for some file types, like PDF)
- Match file name extension. (optional; very fast; very low accuracy)
- Read some key parts of the file, in order to find secondary signatures. (very high accuracy; slow for some file types)
That list is a bit simplified, since we go to great measures to accommodate some complicated and difficult file types, but it gives you an idea of what we do to identify each unstructured file. Some types of files were designed with no signatures and sometimes not even a file header or file footer.
To handle these difficult file types, we created the Byte Value Distribution (BVD) signature. This signature looks much like a histogram, but uses our own special sauce to generate a fuzzy file signature. The intent was to obtain a consistent signature that could be used to identify each type of file across all instances of that file type that may occur on a user’s computer. We started by reading every byte in a file, and calculating our BVD signature, like you would a hash code, but that takes a long time on large (< 100KB) files. We added caching, but that only helped a little. We finally settled on reading just the first 256KB of each file, and only using these signatures on file types that have no other signature type for identification.
These first Byte Value Distribution signatures consisted of the high and low values observed in a chart of values 0 – 255. If a file’s signature falls within these highs and lows, across all 256 columns, then it was a match and the file type was reported back to the user. Since this method was still slow, we only used it when all other identification methods (other than file name extension match) failed.
We later needed a BVD type of method to identify file fragments, in the development of our up coming Disk/File Carving technology. In File Carving, you typically encounter file fragments around the size of a small disk sector (4KB). That doesn’t provide enough bytes of data for BVD to effectively identify most any file type. So, we had to make a change, and squeeze the BVD (256 character codes wide) into BVD32 (32 columns). This new reduced pattern width effectively identifies file types as small as 4KB in size. We then collected BVD32 patterns for the first 4KB of over a thousand file types. Last quarter, we added this BVD32 file identification stage to our File Expander product, in an effort to expand it’s ability to work with more file types (along with it’s new Object Edge Detection technology). This quarter we replaced the old BVD256 patterns with the new BVD32 file header patterns, in our File Investigator line of products (includes the File Investigator API, FI Tools and upcoming Dark Data Detective), with great success. This has not only greatly accelerated the file identification stage, but also provided the ability to catch some new variations of file formats. In the past, these new file format variations would cause File Investigator to fail to identify a file with medium or high accuracy. The variations are introduced by programmers adding new features to a product or failing to follow the documented methods for writing data files to disk.
In the future, we plan to create an additional BVD32 pattern database for 4KB sections of files beyond the first 4KB. This will provide the ability to quickly confirm whether the file is valid and doesn’t contain any trojan objects, trojan files or corruption. We will create a third BVD32 database that contains patterns for file objects. This will give us the ability to confirm the validity of each object inside a file, as well as reassemble file fragments that have been recovered through file carving.
In Dark Data Detective, the Business version uses File Investigator. The Professional Investigator version includes everything in the Business version as well as File Expander and other upgraded features. The Advanced Researcher version includes everything in the Professional Investigator version as well as file carving and other upgraded features. Watch our website for releases of these new upcoming products this year.