Combating Data Inconsistencies with SQL

Abhishek Singh 28 Jun, 2022 • 7 min read

This article was published as a part of the Data Science Blogathon.

SQL

Photo by Sebastian Herrmann on Unsplash

Introduction

As you know from my previous posts, I’ve started a mailing list to promote my blog. In the last post in this series, we designed a way to answer the question of what’s causing folks to subscribe to my blog by matching an opt-in with a source. If you haven’t checked that out already, you can find that here. This time we take a second look at our subscriber history data and question its validity. For what we’ll be doing today you’ll need to step into Sherlock Homes’ shoes, so time to put on your detective caps and solve a mystery!

Problem

So like any good analyst, I was validating the subscriber_feed_history table we built together here which gives me the history of who subscribed to my mailing list. That’s when I noticed some inconsistencies with the table, especially when compared to the actual truth. If you remember the source of this table is the mailing_list feed which is a complete list of users that are opted-in to my mailing list on a particular day.

Well turns out this subscriber_feed is not as reliable as I thought. When I use the subscriber_feed_history table to see how many users are opted-in on a particular day and match that with the email_sent_history (a table derived from logs of email sends), they don’t align 100%. This is odd as you’d assume only users that are opted-in on a particular day will be sent an email on that day but turns out that’s not the case. Welcome to the world of data!

Data

So here are the two tables we’re working with today:

  • subscriber_feed_history : Table with history of users’ mailing_list_status(In/Out) by day
date user_id status
2/2 456 O
1/31 123 O
1/30 777 I
1/28 123 I
1/27 456 I
1/26 789 I
1/25 606 I

 

  • email_sent_history: Table with history of users that were sent an email by the day
date user_id delivery_status
2/1 777 Y
2/1 123 Y
2/1 789 Y
2/1 808 Y

A small caveat here is that because I don’t send emails daily, the email_sent_history table won’t have all the dates but just the ones when an email was sent.

STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.

My Approach

Now let’s think about how we can approach this problem in steps. First, we need to determine all the users who were subscribed on the day an email was sent since an email is not being sent daily. Once we have this we can compare this with the email sent list for the day to determine two things: (1) Emails that were sent an email but don’t look to be opted in as of that day, (2) Emails that were opted-in on that day but weren’t sent an email. Finally, once we have this figured out, the final step would be to alter the subscriber_feed_history table to accurately account for these inconsistencies.

Now to get the emails that were opted-in on a particular day, we can’t leverage subscriber_feed_history directly, we need an expiry date to check the duration a user was opted in for. For this purpose, we built ourselves the subscriber_feed_history_view which adds an expiry date to the table. See here for how to do this. For the above users, the subscriber_feed_history_view looks something like this:

start_date end_date user_id mailing_list_status
1/30 12/31/9999 777 I
1/28 1/31 123 I
1/27 2/2 456 I
1/26 12/31/9999 789 I
1/25 12/31/9999 606 I

Now that we have this figured out, the next step is to compare the email_sent_history with the subscriber_feed_history_view to see what’s going on. The way we do that is by doing a FULL JOIN between the users that were opted-in and were sent an email to get both sides of the coin. Remember you’re JOINing to a user’s mailing_list_status at the time the emails were sent.

Here’s how you do that in SQL:

-- email sent history
WITH email_sent_history AS (
SELECT date AS sent_date, user_id
FROM `email_sent_history`
WHERE delivery_status = "Y"
),
-- users that were subscribed & the duration
opt_in_subscribers AS (
SELECT start_date, end_date, user_id
FROM `subscriber_feed_history_view`
WHERE mailing_list_status = "I"
)
-- FULL JOIN: to not miss info from any side,
-- email sent & status joined to get users who 
-- were subscribed at the time an email was sent 
SELECT sent_date, 
opt_in_subscribers.user_id AS opt_in_users,
email_sent_history.user_id AS email_sent_users
FROM opt_in_subscribers 
FULL OUTER JOIN email_sent_history
ON opt_in_subscribers.user_id = email_sent_history.user_id
AND sent_date BETWEEN start_date AND end_date

And here’s what this gets us:

sent_date opt_in_users email_sent_users scenario
2/1 777 777
2/1 null 123 Out & Sent
2/1 456 null In & Not sent
2/1 789 789
2/1 null 808 Never In & Sent
2/1 606 null Never Out & Not sent

From here we see users that have a Null value in the opt_in_users field were sent an email but not opted-in and vice-versa for Null in email_sent_users. Now that we’ve figured out a way to get to the inconsistencies, the next step is to think about what do we do with them? And how do we fix the subscriber_feed_history?

Let’s start with thinking about users which were sent an email but aren’t opted-in. Well the logical thing is to opt them in, right? But when? Like on what date? There are two scenarios here: (1) Invalid opt-out: user=123 has an unsubscribe event recorded on 1/31 before the email sent on 2/1, and (2) Missed opt-in: user=808 which from the looks of it seems to have never opted-in but was sent an email. Now for the first case, you may consider the opt-out captured as invalid and delete it but you’ve to be sure here. Or implement a fix that works in both cases which is to opt-in these emails on the day they were sent an email to fix our subscriber_feed_history. This would entail adding an opt-in record for these users to the subscriber_feed_history and here’s how you do that:

INSERT INTO `subscriber_feed_history`
WITH email_sent_history AS (
SELECT date AS sent_date, user_id
FROM `email_sent_history`
WHERE delivery_status = "Y"
),
opt_in_subscribers AS (
SELECT start_date, end_date, user_id
FROM `subscriber_feed_history_view`
WHERE mailing_list_status = "I"
),
-- LEFT JOIN with WHERE to only get users that were 
-- sent an email but not subscribed
sent_but_not_in AS (
SELECT sent_date, user_id
FROM email_sent_history 
LEFT JOIN opt_in_subscribers
ON opt_in_subscribers.user_id = email_sent_history.user_id
AND sent_date BETWEEN start_date AND end_date
WHERE opt_in_subscribers.user_id IS NULL
)
-- INSERT an Opt-In record in history table for these
SELECT sent_date AS date, user_id, "I" AS status
FROM sent_but_not_in

On the other hand, users that appear to be opted-in but weren’t sent an email, must’ve opted out, right? (We make an assumption here which we’ll revisit in the future). Let’s think about the two scenarios here: (1) Invalid opt-out: user_id=456 seems to have unsubscribed on 2/2 but wasn’t a part of the 2/1 send. (2) Missed opt-out: user_id=606 never seems to have unsubscribed but wasn’t a part of the 2/1 email. The real question is when did they unsubscribe? Given the data we have, I’d say our best estimate would be the last email sent date. For example, if user_id=606 was a part of the emails before 2/1, let’s say 1/25, it must’ve unsubscribed between these two dates, and for the lack of better knowledge we can assume it did on 1/26, a day after the last send. So to apply this fix to subscriber_feed_history we add an opt-out event with a new date as follows:

INSERT INTO `subscriber_feed_history`
WITH email_sent_history AS (
SELECT date AS sent_date, user_id
FROM `email_sent_history`
WHERE delivery_status = "Y"
),
opt_in_subscribers AS (
SELECT start_date, end_date, user_id
FROM `subscriber_feed_history_view`
WHERE mailing_list_status = "I"
),
-- LEFT JOIN with WHERE to only get users that were 
-- subscribed but not sent an email
in_but_not_sent AS (
SELECT user_id
FROM opt_in_subscribers 
LEFT JOIN email_sent_history
ON opt_in_subscribers.user_id = email_sent_history.user_id
AND sent_date BETWEEN start_date AND end_date
WHERE email_sent_history.user_id IS NULL
),
-- every user's latest email sent date
users_last_sent_date AS (
SELECT user_id, MAX(date) AS last_sent_date, 
FROM `email_sent_history`
WHERE delivery_status = "Y"
GROUP BY user_id
)
-- INSERT an Opt-Out record in history for these
SELECT last_sent_date AS date, user_id, "O" AS status
FROM in_but_not_sent JOIN users_last_sent_date
USING(user_id)

Conclusion

So here’s what we accomplished today, we started by identifying some inconsistencies in our email subscriber history table that we’d built. Where the list of users who are subscribed to my mailing list didn’t align with the list of users that were sent a promotional email. To combat this, we start by comparing the two data sources(subscriber & email sent history) to identify users that were subscribed but not sent an email and vice versa. Then we design an approach for each of these cases to fix our original table by accounting for these inconsistencies. And voila by tweaking our subscriber_feed_history it’s now aligned with the real truth. Below are some key takeaways.

  • Always take a second look at the source of your data & question its validity
  • Whenever possible validate your data against other data that goes with it
  • A FULL OUTER join in SQL is a great way to get both sides of the data
  • Before jumping on a solution, think about all the scenarios in your data & design accordingly

Food for Thought

  • Why would the source data not be reliable? What do you do in such cases?
  • This fix works because we’re going back in time to fix historical data, what do we do if this is an ongoing issue?
  • How would you figure out the issue with the opted-out but was sent an email case?
  • For users that appear to be opted-in but weren’t sent an email, what may be an explanation?
  • For users that appear to be opted-in but weren’t sent an email but have a corresponding opt-out later(user_id=456), what would the history look like with our approach?
SQL

                                                                             Photo by LinkedIn Sales Solutions(Left) & Ben White(Right) on Unsplash

Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Abhishek Singh 28 Jun 2022

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear