I’m providing a survey of File Carving/Data Carving approaches here, in order to set a base of understanding, before I introduce our new approach next month.  We have been developing digital forensics products, that identify computer files, since 1990.  These products are being used by Electronic Discovery product and service providers, Police Departments, Government Contractors and Spook Agencies in multiple countries.  In our decades of research & development, we have created methods for breaking down thousands of file types into their individual building blocks/objects.  While observing the clusters and sectors used to store files are useful as reference points, they are woefully ineffective in detecting file and object boundaries.  Our methods operate with objects, not set block sizes. Next month, we will provide access to an Alpha version of our new approach that supports 4,000+ file types, and is incorporated into our Dark Data Detective platform.  While the following descriptions are very basic, I am providing references at the end for deeper study on this topic.  My previous post on Data Carving is here. Dark Data Detective’s approach is a superset of these approaches (minus Semantic Carving and Repackaging Carving at this point), plus additional techniques of our own design.

Data Carving Approaches

Block-Based Carving, is the process of evaluating consecutive blocks of data, to collect them together into a file.  These blocks are typically clusters and/or sectors collected from a mass storage device.  Most approaches will use these pre-set block sizes as their individual puzzle pieces to be matched together. This approach is not cognizant of the individual objects in a file that don’t happen to land on a sector boundary and runs the risk of over or under collecting data into the resulting file.

Statistical Carving, includes algorithms like Hash-Based Carving and Decision-Theoretic Carving, that use methods like hashes, statistical analysis and entropy measurements to classify each block and match similar blocks together. This is a good approach for recovering file types that the developer is not familiar with.

Header/Footer Carving identifies file headers with magic IDs/signatures, then searches for a matching footer signature to find the end of the file.  They use an aggressive process of collecting every block in-between the header and footer, to build a file.  This approach is a bit sloppy, and does not account for any fragmentation, but is useful when combined with Statistical Carving and Validation Carving.

Header/Maximum Carving identifies file headers with magic IDs/signatures, then collects all data up to a set maximum length. This approach is very sloppy, and does not account for any fragmentation, but is useful when combined with Statistical Carving, Validation Carving, Fragment Recovery Carving and for file types with no footer signature and a consistently known size. The risk is that a carved file steels the contents, after its footer, of other files that could otherwise have been carved as well.

Header/Embedded Length Carving identifies file headers with magic IDs/signatures, then reads the file’s size from a known offset in the file being carved. This approach is a bit sloppy, and does not account for any fragmentation, but is useful when combined with Statistical Carving, Validation Carving, Fragment Recovery Carving and for file types with no footer signature.

File Structure Based Carving identifies each component of the carved file with extensive knowledge of the file type’s structure.  I assume that this approach would include the use of a header signature match, but there may be some situations that may not require that. For example, a raw data file that uses no file header or footer objects or signatures would not require a header signature match. This has been called Deep Carving (by Metz and Mora), and would appear to be one of the most accurate approaches if the file structure semantic schema is flexible enough to account for variations introduced by complex format specifications and novice software developers who implement their file creation code in a sloppy manner.  This approach seems to have the lowest risk, but with high development difficulty and complexity.

Semantic Carving keeps track of the language being used in the carved file, and marks any blocks that contain a different language as fragments that do not belong to the current file being carved. This approach seems to improve the quality of the carving process, but with increased development difficulty and complexity unless implemented only as a Unicode filter applied to obvious Unicode fields.

Validation Carving confirms the carved file by comparing its contents to known specifications and/or loading the file with an application known to support the file type.  This could be very slow and partially a manual process if an external application or interpreter is used and may need to be observed by a human. This approach seems to greatly reduce the risk of a false possitive carving, but with the highest development difficulty and complexity if a full interpreter is required to be developed for every supported file type.

Fragment Recovery Carving identifies data blocks that have been inserted in the middle of the file being carved, which is typical of a disk file system that looks for fragmented empty disk space when writing a new, or growing, file. This approach also includes support for files that have been split into two or more parts and recombined in the wrong order.  While the fragmentation is observed to occur commonly in normal use of non-auto defragmenting data storage devices, I have never observed a data storage device store a file’s fragments in a non-continuous manner in increasing cluster/sector offsets.  Theoretically if an existing stored file increased in size, the end of the data storage area is reached, a different file was deleted after the expanding file was first stored resulting in new free blocks at offsets prior to the last block used in the expanding file, then this could occur.  But, that would be a rare case.  The complexity of implementing this approach may be high, but in my opinion is key to carving files successfully in a non-sloppy manner. Although, I’m not sure that non-continuous files need to be supported, other than for academic tests.

Repackaging Carving detects objects and fields missing from a carved file, and adds those parts back into the file.  This fixes the file for normal use and viewing with off the shelf applications. This approach can require the highest development difficulty and complexity level unless it is only designed for the simplest file types.

All other data carving approaches appear to be collections of two or more of the above list.

References

  1. “File carving”, https://en.wikipedia.org/wiki/File_carving
  2. “File carving”, https://forensics.wiki/file_carving/
  3. “Data Carving Concepts”, https://www.giac.org/paper/gcfa/1161/data-carving-concepts/110685 (typical image)
  4. “File Carving in WIndows”, https://digitalinvestigator.blogspot.com/2022/09/file-carving-in-windows.html (gap image)
  5. “File Carving”, https://www.infosecinstitute.com/resources/digital-forensics/file-carving/
  6. Simson Garfinkel, “Carving Contiguous and Fragmented Files with Fast Object Validation” Archived 2012-05-23 at the Wayback Machine, in Proceedings of the 2007 digital forensics research workshop, DFRWS, Pittsburgh, PA, August 2007, https://www.sciencedirect.com/science/article/pii/S1742287607000369
  7. “Decision-theoretic file carving”, http://www.sciencedirect.com/science/article/pii/S1742287617301329?via%3Dihub
  8. “Bifragment gap carving”, https://www.sciencedirect.com/science/article/pii/S1742287607000369
  9. “Hash-based carving: Searching media for complete files and file fragments with sector hashing and hashdb”, https://www.sciencedirect.com/science/article/pii/S1742287615000468
  10. “Detecting File Fragmentation Point Using Sequential Hypothesis Testing”, https://dfrws.org/sites/default/files/session-files/2008_USA_paper-detecting_file_fragmentation_point_using_sequential_hypothesis_testing.pdf
The 10 Common Data Carving Approaches

Leave a Reply