Structured data is very familiar to most people, as it’s what is captured by most user-facing business systems. You’ve got columns and rows, and data values are stored against a field name and identifying value of some type, with a clear data type.
Unstructured data is also prevalent, as it is any data which does not conform to a common schema and may not have an identifiable internal structure. It might be written in natural language, might not be formatted in columns and rows, may have dynamic hierarchies or otherwise be difficult to interpret automatically via scripting.
I’ve seen comparisons on the web suggest that if structured data is Excel, unstructured is PowerPoint or Word – as in, the contents are not formatted for analysis or easy searching.
So, structured data could be:
– Spreadsheets, with clear column organisation and categorisation of data
– Relational databases
– CSV or other delimited text files
Unstructured data could be:
– Spreadsheets, without clear organisation – for example many cells filled with values which do not form a table
– Pages on a website (including posts on social media)
– Proprietary binary files
– PDF files
– Media (e.g. videos, audio, images)
Collecting and categorising this unstructured data requires data mining tools – and although some document types (e.g. Word documents) do have inherent schemas, the contents of that schema cannot be interpreted by normal tools to provide insights. It’s almost a similar question as “What is Big Data?”, to which the answer could generally be anything that cannot be processed by the software on a commodity computer (e.g. your work laptop).
If we take the example of a word document, this has an XML schema which is structured – but the way the user writes the data in this document is not. We could refer to this as semi-structured. This sort of data is also prevalent in web technologies, with examples such as JSON and NoSQL databases.