What I did when I had to work with unstructured data?

Palak Last Updated : 14 Dec, 2020

4 min read

This article was published as a part of the Data Science Blogathon.

Introduction

I am sure those of you working with data in any capacity has been faced with the problem of working with bad data, inconsistent data, unclean data, call it as you may. For those working afar from data, as per a report by Forbes, 60% of any data professional’s time is spent in cleaning and organizing unstructured data. Yes, that’s a lot of time invested but in my opinion, is the foundation to derive results.

Here I curate 7 instances from my personal experience of working with unstructured data for almost 3 years now. I hope many of you reading can relate with at least a few.

1. Absence of Unique Identifier

To join two or more tables, normally there exists a common column, variable, parameter, etc that can be used to join the tables. Here I was faced with the problem of the absence of a unique identifier or we may say the absence of a primary and foreign key. Which meant trouble, more specifically while reporting on parameters across tables.

To get ahead of the problem, we created our key by concatenating values from certain columns and thus, building a table with columns of interest on that. Sounds chaotic? yes; Got us results? also yes.

2. Different nomenclature

While working with unstructured geographical data, I came across different spellings of a single geographical jurisdiction. For example, Chhattisgarh (a state in India) was spelled as Chattisgarh, Chhatisgarh, and Chhattisgarh. Which again meant trouble.

To tackle the problem, we created an interim layer of mapping, allocating codes to the string value and thus fetching the correct spelling from the master table via codes on the main data to get going.

Although in my opinion building systems that restrict user’s choice to spell the geography and instead of providing them with a pre-populated drop-down list would mean wiping the problem altogether.

3.Collating data from different file formats

unstructured data - collating — Image from Johnstone High School

I have been faced with this problem for almost half of my total tenure of working with data. I had to work with data bifurcated into different tables, all with different formats. For example, one was an SQL file and the other came from an xlsx file.

There wasn’t much to do here, but to be certain of not missing out on all parts of a whole, i.e. collating the sub-divided data.

We can’t control everything, sometimes it is about working around.

4. Linguistic barrier

This was one interesting instance, where linguistic and dialect diversity of the Indian subcontinent had a direct impact on data consistency. For example, Thiruvananthapuram, a city in the state of Kerala also spelled as Tiruvanantapuram made my days more anxious than usual. While working on geographical jurisdiction data of District and further Town level for urban areas and Village level for rural areas, posed a big challenge especially to reconcile the data.

The problem was dealt with in more than one way, fuzzy logic spelling checker being the stepping stone which we sought out to maintain a scoring system to reconcile the data on the proposed correct spellings, solving close to 60% of the problem. While the rest 40% was dealt with manually.

5. The most common, missing data

In my very recent experience, I faced the missing data problem for various data types and sizes.

We were using a key-value pair system to get away from problems defined earlier like the different nomenclature and/or linguistic barrier, etc. It was when we found out that at least 35% of values were missing, that I had to come up with a pattern recognition algorithm to impute the missing values while maintaining the sanctity of the very mechanism, i.e. keeping it unique.

While in other problems of the same nature we referred back to the Data Engineering team to rebuilt data pipelines more than once using various sources of data to come up with a Single Source of Truth.

6. Not so flawless data architecture

Unstructured Data - Not so Flawless architecture — Image from The Data Modelling Hub

In my opinion, if the data collection, flow, and usage are not standardized it leads to delays in getting relevant, useful data. This has got me to write about two instances that occurred because of the same problem.

First, Imagine getting data points meant to be collected in column ‘x’ from column ‘y’. Now it was a mammoth activity to make architectural changes, so instead, we made optimized data models that were automated to update every time a new batch of data came in.

Second, Not receiving any data for one column with the sword of delivery hanging on heads. This meant a work-around to meet the deadline, we sought out to do the Math and used the Derivative logic to get the value for the missing column. Although, we later received the value for the column which was what we derived.

7. Different date formats

Honestly, as I was writing the header I cringed

Referencing point 3 here, since the data came in different file formats it was no surprise that the column formats for one variable were different as well.

There wasn’t much we could do but to cast and/or convert while collating the data. However, while using visualization tools the problem did not seem as grave and was only a matter of few clicks to make it relevant and compatible to use.

End Notes

In conclusion, I believe cleaning and organizing the unstructured data is essential to deliver quality results. I hope these excerpts provided relatable or insightful instances of a few amongst the many problems that exist in the real world.

blogathon

Palak

Deep Learning Intermediate Unstructured Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

What I did when I had to work with unstructured data?

Introduction

1. Absence of Unique Identifier

2. Different nomenclature

3.Collating data from different file formats

4. Linguistic barrier

5. The most common, missing data

6. Not so flawless data architecture

7. Different date formats

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#