The Art of Query Building: Data Problems to SQL Queries

Abhishek Singh 30 Apr, 2023
8 min read

Introduction

SQL is an important tool for anyone working with data. It will always come in handy if you need to pull relevant information or perform advanced analytics. But have you ever encountered a data problem and had no idea how to write the SQL to solve it? If you have been through it, don’t you think having a step-by-step approach for transforming a problem statement into a well-structured SQL query will be nice? If your answer is yes, then look no further. In this blog, we’ll explore using SQL keywords to guide your thought process and streamline your query building, helping you easily tackle even the most complex data problems.

SQL | SQL queries | SQL keyword
Learning Objectives 

  • Understand how data flows through a SQL query and use this to solve data problems.
  • Transform data problems into SQL queries using a keyword-based approach.
  • Dos and Don’ts when it comes to SQL keywords.
  • Finally, we’ll go through an example of using the underlying approach.

This article was published as a part of the Data Science Blogathon.

Table of Contents

TABLE: Where Is My Data?

First, I like to start by considering all the tables I need in the query. You can do this by considering all the fields that will be needed to get the desired result, and then we can find them. An important thing to note is that multiple tables may have the same field. For example, user data can be present in multiple tables with different levels of aggregations. Hence, knowing what grain you want to pull in the results is essential. When building the query, I want you to pick one table, go through the steps, and return to the table. Also, if any array fields are needed in the table, now is a good time to unpack them.

FROM table_name LEFT JOIN UNNEST(table_array) AS array

WHERE: What I Don’t Want?

Now that you know where your data is coming from, it’s time to know what information you need and, more importantly, what you don’t need from the table. So if the table has a partition or if the query demands filtering a certain type of record, now is the time to use it. Also, I need you to look at all fields in a table and think about all possible ways to filter your data here. You should really push yourself to add more filters.

To put it simply, the lesser data your query sees, the better it performs and avoids mistakes. Further, we often skip obvious filters as they seem too trivial; for example, if you’ve filtered on the partition date, it might still have multiple dates, so look for other date fields and add the filter.

WHERE partition_field = "date_value" 
 AND col1 = "xyz"
 AND col2 IS NOT NULL
 ...

GROUP BY: What’s the Grain?

Before you SELECT anything, I’d recommend doing a GROUP BY.  This is because having this first will often constrain what you select in your query. You can no longer do a `SELECT *`, which rarely makes sense. This will also leave out duplicate records before anything, and trust me; you don’t want duplicates flowing through your query as it’s difficult to determine their origin later. This also forces you to perform aggregations.

You often don’t need a field but only the aggregated value. Having this out of the way is helpful so that the rest of the query sees lesser data. So I’d recommend having a GROUP BY in your query for every table; even if it’s not explicitly needed, it’s an excellent way to avoid duplicates and only pulls in relevant data for the query.

SELECT col1, col2
FROM table_name
GROUP BY col1, col2

SELECT: What Do I Actually Want?

After doing all the work above, you can now think about what fields you’ll actually pull from the specific table. If you have followed the above steps, the scope of the fields has already been reduced to the fields that are needed for the specific results.

A `SELECT *` slows down your query and may lead to incorrect results, as you may end up with extra records. The only time you should do it is when you’re trying to do a preview of all the fields in a table. On the contrary, selecting fewer fields first and then adding them later when needed is also feasible.

CASE: Conditions

A case statement is SQL’s way of making IF-ELSE statements. These enable you to capture complex logic and show SQL’s real ability. In addition to using CASE statements for traditional applications, you should also use them to alter fields before selection. For example, if you’re not concerned about a field’s specific value but only want a discrete value like Y/N, this is the time to convert the field using CASE statements.

One thing to note here is always to have an ELSE condition that tells SQL what to do if none of your conditions are met. We’re often confident that we’ve covered all the scenarios in our CASE statement, but data always surprises us. Hence it’s better to have an ELSE condition to avoid unknown behavior.  Personally, I like to add `ELSE NULL` so I can see that data didn’t fall into any of my expected scenarios.

CASE WHEN col = "value" THEN "Y" ELSE "N" END AS new_col

Aggregations (Level 1): The Math

In this article, we’ll be talking about aggregations twice. At first, you should only worry about aggregations at a single table level. These are usually math-based, like sum, average, max, and min, or count-based. One thing to note for counts is that in 99% of the cases, you’d want to do a `COUNT(DISTINCT field_name)` instead of a regular `COUNT(field_name)` as the latter gives you a record count with duplicates in the specific field. A useful strategy is combining aggregations and CASE statements to capture complex logic in an easy manner. For example, building a purchase_indicator using the total transaction amount as below.

CASE WHEN SUM(amount) > 0 THEN "Y" ELSE "N" END AS agg_purchase

ALIAS: Nicknames

This may seem trivial, but this step is important for readability and writing correct queries. This is because many times, you’ll be deep down in your query looking for a field and not know what it is called. Hence it’s essential to make these worthwhile. Also, rather than using aliases for aggregated or derived fields, it’s helpful to use them for renaming fields with long or funky names in the table. In this way, even though you cannot do anything to the actual table, you can still call it something easy to work with in your own query.

Now if the query you’re building only uses a single table, this is where you stop. However, in most cases, there’ll be more than one table, so you can read further.

CTE: Building Blocks

CTEs or Common Table Expressions can build a temporary table inside your query without creating a real table. These are most useful for compartmentalizing your SQL query. This helps you to think clearly as every element becomes a separate table that can be combined.

At this point, you should put together all the above steps and wrap it in a CTE as done below. These also help in making changes to the query; for example, if you’re trying to edit certain conditions on a table, you can directly go to the associated CTE and make the change, enabling your change to cascade to the rest of your query.

WITH table_cte AS (
 SELECT col1, array.col2 AS col2_alias,
  CASE WHEN SUM(col3) > 0 THEN "Y" ELSE "N" END AS col3_alias
 FROM table_name 
  LEFT JOIN UNNEST(table_array) AS array  
 WHERE col4 = "value"
 GROUP BY col1, array.col2
)

Now go back to TABLEs and repeat the steps above for any other tables in your query.

JOINs: Watch Out

Finally, we come to JOINs that may be required in your query. However, if you’ve followed these steps, you won’t be joining tables but will join CTEs of individual tables. A note of caution, JOINs can be tricky as they can be the number one cause of issues in SQL queries. These are malignant because if there’s an issue in your JOIN, your query won’t error out, but the results will be incorrect, making these hard to spot. This is advisable to spend extra time here to ensure you’re doing what’s expected. The extra step of validating the joint fields to ensure they align in values and types is also beneficial.

SELECT col1, col2
FROM cte1 AS c1 JOIN cte2 AS c2
ON c1.col1 = c2.col1
GROUP BY col1, col2

Aggregations (Level 2): More Math

Now is the time to combine the metrics in the final result by aggregating the JOIN results. Because these will make our final results, it’s useful to throw in things like final aliases and FORMAT that make sure the numbers are readable with the appropriate commas.

SELECT FORMAT("%'d", SUM(amount)) AS total_amount

ORDER BY: Make it Pretty

Ordering the results should always be saved for the last, as this can’t go in any CTE or subquery. The only time this can be avoided is when your query will be a production query where results are used and not read by someone. Otherwise, adding an `ORDER BY` is helpful, even if not explicitly required, as it will make reading the results much more accessible. Also, you can use fields here and CASE statements to allow for custom ordering of results.

ORDER BY col1, CASE WHEN col2>col3 THEN col2 ELSE col3 END

LIMIT: Make it Digestible

Finally, if the plan with the query is to export or use the results to drive another calculation, you can skip this. However, in other cases, having the LIMIT clause is a must, which will only return a certain number of records, making it easier for you and your SQL engine. If you forget this and your query is about to return a million rows, your query will fail even without errors.

LIMIT 100
"SQL | SQL queries | SQL keyword

Putting It All Together

So let’s use our newly gained skills as an example. If you need more examples of queries with data and stories, head to my blog here.

The problem: We have an e-commerce store, and the marketing team wants a report of users who’ve not made a purchase in the last month. The state should break this down the user is in and the last interaction they had on the website.

WITH user_demographics AS (
 SELECT user_id, address.state AS state
 FROM demographics 
  LEFT JOIN UNNEST(address) AS address  
 WHERE country = "USA"
 GROUP BY user_id, address.state
),

user_purchases AS (
 SELECT user_id, 
  CASE WHEN SUM(amount) > 0 THEN "Y" ELSE "N" END AS agg_purchase
 WHERE date > DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)
 FROM transactions
 GROUP BY user_id
),

user_clickstream AS (
 SELECT * EXCEPT(rnk) 
 FROM (
   SELECT user_id, event,
    RANK() OVER(PARITION BY user_id, event ORDER BY date DESC) AS rnk
   FROM clickstream
  ) t
 WHERE t.rnk = 1 
),

user_no_pruchases AS (
 SELECT a.*
 FROM user_demographics a 
  LEFT JOIN user_purchases b
 ON a.user_id = b.user_id
 WHERE (b.user_id IS NULL OR agg_purchase = "N")
),

user_no_purchase_events AS (
 SELECT user_id, state, event
 FROM user_no_pruchases JOIN user_clickstream
 USING(user_id)
 GROUP BY user_id, state, event
)

SELECT state, event, COUNT(DISTINCT user_id) AS user_count
FROM user_no_purchase_events
GROUP BY state, event
ORDER BY state, event
LIMIT 100

Conclusion

Here’s what we learned today:

  • We started by visiting the importance of SQL and building queries to solve business problems.
  • Then we delved into a step-by-step approach that leverages SQL keywords to transform data problems into queries.
  • In this, we highlighted common mistakes that go along with SQL keywords, for example, not having an `ELSE NULL` in a CASE statement.
  • We also reviewed best practices when writing SQL queries, including `GROUP BY`, to prevent duplicates.
  • Finally, we discussed an approach to query building using CTEs to compartmentalize your query.

Following these steps, you can transform any business problem into a SQL query that yields desired results.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Abhishek Singh 30 Apr, 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers