Unstructured Data sounds like files that store disorganized data or simple text, but I see it as more of a classification term than the description of a file’s lack of internal data structures. Some examples, provided in an article by Treehouse Technology Group, are “photos, video and audio files, social media content, satellite imagery, presentations, PDFs, open-ended survey responses, websites, data from IoT devices, mobile data, weather data, and conversation transcripts.” They go on to explain that, “Unstructured data is usually stored in a data lake. This is a storage repository where a large amount of raw data is stored in its native format.” For those new to this topic, native format is used to describe a document/file that is stored in its original file format (ex: MS Word, MS Power Point, etc.) as opposed to being converted to a new file type (ex: Adobe Acrobat PDF, TIFF, Text, etc.) to make is conform to a standard file format for easier processing and classification.
For most people, this must sound very complicated so far. Let’s see if we can clarify a little better. If we have Unstructured Data, then their must be Structured Data too? Here is a comparison provided in an article by Lawtomated, “Structured data resides in relational databases: a database structured to recognise relations between stored items of data. Databases of this type are typically managed via a relational database management system (“RDBMS“). This is usually what people think of when they think of a database, i.e. a table of rows and columns containing related information. … Unstructured data is everything else. Unstructured data has an internal structure (i.e. bits and bytes), but is not structured via pre-defined data models or schema, i.e. not organized and labelled to identify meaningful relationships between data.”
Now that we have a better understanding of the difference between Structured and Unstructured Data, what do we do with the Unstructured Data. In an article by TechTarget they said, “unstructured data isn’t suited to the transaction processing applications that often handle structured data. Instead, it’s primarily used for BI [Business Intelligence] and analytics. … Unstructured data analytics … aids regulatory compliance efforts, particularly in helping organizations understand what corporate documents and records contain. In the past, unstructured data was often locked away in siloed document management systems, individual manufacturing devices and the like — making it what’s known as dark data, unavailable for analysis. But things changed with the development of big data platforms, primarily Hadoop clusters, NoSQL databases, Azure and the Amazon Simple Storage Service (S3). They provide the required infrastructure for processing, storing and managing large volumes of unstructured data without the need for a common data model and a single database schema.” Most of our customers are actually in the Electronic Discovery, Law Enforcement and National Security industries.
At Forensic Innovations, our focus is on Unstructured Data. I personally trust individual files to store my data, over large databases that can become corrupted and lose my data. We embrace the challenge of identifying and dissecting unstructured data, in our search for dark data. While other companies focus on assimilating unstructured data, for searching and big data analytics, we tackle the remaining files that otherwise drop out of their systems into the exception bin. That’s like a trash can where unrecognized files go unnoticed. If someone wants to hide evidence, that’s the best place to hide it. Right where it will escape the search.