What I did when I had to work with unstructured data?
This article was published as a part of the Data Science Blogathon.
I am sure those of you working with data in any capacity has been faced with the problem of working with bad data, inconsistent data, unclean data, call it as you may. For those working afar from data, as per a report by Forbes, 60% of any data professional’s time is spent in cleaning and organizing unstructured data. Yes, that’s a lot of time invested but in my opinion, is the foundation to derive results.
Here I curate 7 instances from my personal experience of working with unstructured data for almost 3 years now. I hope many of you reading can relate with at least a few.
1. Absence of Unique Identifier
To join two or more tables, normally there exists a common column, variable, parameter, etc that can be used to join the tables. Here I was faced with the problem of the absence of a unique identifier or we may say the absence of a primary and foreign key. Which meant trouble, more specifically while reporting on parameters across tables.
To get ahead of the problem, we created our key by concatenating values from certain columns and thus, building a table with columns of interest on that. Sounds chaotic? yes; Got us results? also yes.
2. Different nomenclature
While working with unstructured geographical data, I came across different spellings of a single geographical jurisdiction. For example, Chhattisgarh (a state in India) was spelled as Chattisgarh, Chhatisgarh, and Chhattisgarh. Which again meant trouble.
To tackle the problem, we created an interim layer of mapping, allocating codes to the string value and thus fetching the correct spelling from the master table via codes on the main data to get going.
Although in my opinion building systems that restrict user’s choice to spell the geography and instead of providing them with a pre-populated drop-down list would mean wiping the problem altogether.
3.Collating data from different file formats
I have been faced with this problem for almost half of my total tenure of working with data. I had to work with data bifurcated into different tables, all with different formats. For example, one was an SQL file and the other came from an xlsx file.
There wasn’t much to do here, but to be certain of not missing out on all parts of a whole, i.e. collating the sub-divided data.
We can’t control everything, sometimes it is about working around.
4. Linguistic barrier
This was one interesting instance, where linguistic and dialect diversity of the Indian subcontinent had a direct impact on data consistency. For example, Thiruvananthapuram, a city in the state of Kerala also spelled as Tiruvanantapuram made my days more anxious than usual. While working on geographical jurisdiction data of District and further Town level for urban areas and Village level for rural areas, posed a big challenge especially to reconcile the data.
The problem was dealt with in more than one way, fuzzy logic spelling checker being the stepping stone which we sought out to maintain a scoring system to reconcile the data on the proposed correct spellings, solving close to 60% of the problem. While the rest 40% was dealt with manually.
5. The most common, missing data
In my very recent experience, I faced the missing data problem for various data types and sizes.
We were using a key-value pair system to get away from problems defined earlier like the different nomenclature and/or linguistic barrier, etc. It was when we found out that at least 35% of values were missing, that I had to come up with a pattern recognition algorithm to impute the missing values while maintaining the sanctity of the very mechanism, i.e. keeping it unique.
While in other problems of the same nature we referred back to the Data Engineering team to rebuilt data pipelines more than once using various sources of data to come up with a Single Source of Truth.
6. Not so flawless data architecture
In my opinion, if the data collection, flow, and usage are not standardized it leads to delays in getting relevant, useful data. This has got me to write about two instances that occurred because of the same problem.
First, Imagine getting data points meant to be collected in column ‘x’ from column ‘y’. Now it was a mammoth activity to make architectural changes, so instead, we made optimized data models that were automated to update every time a new batch of data came in.
Second, Not receiving any data for one column with the sword of delivery hanging on heads. This meant a work-around to meet the deadline, we sought out to do the Math and used the Derivative logic to get the value for the missing column. Although, we later received the value for the column which was what we derived.
7. Different date formats
Honestly, as I was writing the header I cringed
Referencing point 3 here, since the data came in different file formats it was no surprise that the column formats for one variable were different as well.
There wasn’t much we could do but to cast and/or convert while collating the data. However, while using visualization tools the problem did not seem as grave and was only a matter of few clicks to make it relevant and compatible to use.
In conclusion, I believe cleaning and organizing the unstructured data is essential to deliver quality results. I hope these excerpts provided relatable or insightful instances of a few amongst the many problems that exist in the real world.