Structured vs Unstructured Data

May 13, 2018

Structured data is very familiar to most people, as it's what is captured by most user-facing business systems. You've got columns and rows, and data values are stored against a field name and identifying value of some type, with a clear data type.

Some structured data in Excel - field based, and can be loaded with data types into a relational database

Unstructured data is also prevalent, as it is any data which does not conform to a common schema and may not have an identifiable internal structure. It might be written in natural language, might not be formatted in columns and rows, may have dynamic hierarchies or otherwise be difficult to interpret automatically via scripting.

Unstructured data in a PDF file - data is stored in a binary format which isn't human readable or searchable

I've seen comparisons on the web suggest that if structured data is Excel, unstructured is PowerPoint or Word - as in, the contents are not formatted for analysis or easy searching.

So, structured data could be:
- Spreadsheets, with clear column organisation and categorisation of data
- Relational databases
- CSV or other delimited text files

Unstructured data could be:
- Spreadsheets, without clear organisation - for example many cells filled with values which do not form a table
- Pages on a website (including posts on social media)
- Proprietary binary files
- Emails
- PDF files
- Media (e.g. videos, audio, images)

Collecting and categorising this unstructured data requires data mining tools - and although some document types (e.g. Word documents) do have inherent schemas, the contents of that schema cannot be interpreted by normal tools to provide insights. It's almost a similar question as "What is Big Data?", to which the answer could generally be anything that cannot be processed by the software on a commodity computer (e.g. your work laptop).

If we take the example of a word document, this has an XML schema which is structured - but the way the user writes the data in this document is not. We could refer to this as semi-structured. This sort of data is also prevalent in web technologies, with examples such as JSON and NoSQL databases.

Some further detail:
- https://en.wikipedia.org/wiki/Unstructured_data
- https://en.wikipedia.org/wiki/Semi-structured_data
- https://www.datamation.com/big-data/structured-vs-unstructured-data.html