SQL Query: Coding Question Asked by Microsoft and Facebook

R Last Updated : 22 Sep, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

SQL proficiency is crucial for the field of data science. We’ll talk about two SQL queries that product businesses use to screen applicants for jobs as data scientists in this article. The StrataScratch website generates the SQL questions.

StataScratch is an excellent tool for anyone wishing to get started in data science and improve their SQL and Python skills. This platform offers coding questions and non-coding topics related to data science, such as statistics, probability, and so on. I strongly advise you to create an account on the StrataScratch website and practice the question along with the article. To solve this problem, I will use the Postgres SQL database.

If you know SQL well, you will stand a better chance of clearing data science interviews or dealing with day-to-day tasks efficiently. This article will focus on the approach to solving the problem. After going through this article, you would better understand how you should approach the solution for a given problem. You must read this article to improve your understanding and ways to approach solutions. Let’s get ahead to questions.

Part 1: Premium vs. Freemium Asked by Microsoft

Find the total number of downloads for paying and non-paying users by date. Include only records where non-paying customers have more downloads than paying customers. The output should be sorted by earliest date first and contain 3 columns date, non-paying downloads, and paid downloads.

Interview Question Date: November, 2020, Company: Microsoft, Difficulty-Level: Medium, Interview QuestionsID: 10300, Tables: ms_user_dimension(fields: user_id(int), acc_id(int)), ms_acc_dimension (fields: acc_id(int), paying_customer(varchar)) , ms_download_facts(fields: date(datetime), user_id(int), downloads(int)

Preview of table ms_user_dimension:

Preview of table ms_acc_dimension:

Preview of table ms_download_facts:

Approach:

Three tables are provided here. To solve the problem, we must determine the number of daily downloads made by paying and nonpaying customers. The problem can be divided into three sections. We will join all of the tables in the first section. The second section will determine the number of paid and non-paid downloads for each user. Finally, we will show records with more non-paid downloads than paid downloads.

Step 1: Join TablesTo begin, we must combine the records from the various tables. As common field user_id exists between ms_user_dimension and ms_download_facts, and acc_id between ms_user_dimension and ms_acc_dimension. Therefore we can use an inner join to join all given tables and select the relevant columns date, downloads, and paying_customer columns which we need to shape the output. The code is given below for step1.

select date, downloads,paying_customer from ms_user_dimension

inner join ms_acc_dimension on
ms_user_dimension.acc_id = ms_acc_dimension.acc_id
inner join ms_download_facts 
on  ms_user_dimension.user_id = ms_download_facts.user_id;

Step 2: Calculate Paid and Non-paid downloads for Each UserAfter combining all records, we need to display three columns date, the total number of paid downloads, and a total number of non-paid downloads. For this, we will use a group on a date as we want to take the sum of downloads for each date. Also, we need to use a case clause to display the number of downloads for paying_customer and non_paying_customer separately in separate fields. The case clause in the SQL database is similar to if-else conditions. Here, we want a sum of downloads for paid_users and non_paid_users, Therefore, we have used a case clause to distinguish the downloads made by paid and non-paid users. The code is given below for step 2.

select date,
sum(case when paying_customer = 'yes' then downloads end) as paid_downloads,
sum(case when paying_customer = 'no' then downloads end) as non_paid_downloads
from ms_user_dimension  inner join ms_acc_dimension
on ms_user_dimension.acc_id = ms_acc_dimension.acc_id
inner join ms_download_facts
on  ms_user_dimension.user_id = ms_download_facts.user_id
group by date;

SQL

Step 3: Filter the Records and Format Result into Manner SpecifiedWe must display only those records in the result for which the number of paid_downloads exceeds non_paid downloads. paid_downloads and non_paid_downloads are calculated using the aggregation function. Therefore, we can not use them in the where clause. We can add a filter in having clause for fields calculated using aggregated function. And finally, sort the result such that the earliest date comes first. Here is the final code for the query:

select date, 
sum(case when paying_customer = 'no' then downloads end) as non_paid_downloads,
sum(case when paying_customer = 'yes' then downloads end) as paid_downloads
from ms_user_dimension  inner join ms_acc_dimension
on ms_user_dimension.acc_id = ms_acc_dimension.acc_id
inner join ms_download_facts 
on  ms_user_dimension.user_id = ms_download_facts.user_id
group by date
having sum(case when paying_customer = 'no' 
then downloads end) > 
sum(case when paying_customer = 'yes' then downloads end)
order by date;

In the preceding query, we saw how we could divide the problem into sections and work on each section individually. To answer the question, we used the join operation, case, group by, and having clause and how they are used to solve the question. Let’s work on the next query.

Part 2: Highest Energy Consumption Asked by FaceBook

This is the hard-level question asked by Facebook/Meta in one of its interviews. You can view the question here. The question name is Highest Energy Consumption. The details of the question are given below:

Highest Energy Consumption

Find the date with the highest total energy consumption from the Meta/Facebook data centers. Output the date along with the total energy consumption across all data centers.

Interview Question Date: March 2020, Company: Meta/Facebook, Difficulty-Level: Medium, Interview QuestionsID: 10064, Tables: fb_eu_energy (field: date(datetime), consumption(int)), fb_asia_energy (field: date(datetime), consumption(int)) , fb_na_energy(field: date(datetime), consumption(int))

Preview of table fb_eu_energy:

Preview of table fb_asia_energy:

Preview of table fb_na_energy:

Approach: The problem will be divided into three sections. We will combine the records from the tables in the first section. The total energy consumed each day will be determined in the second part. Finally, we must determine the date on which the most energy was consumed and return the result.

Step 1: Join Tables

As data is present among three tables, we must combine all of the records from all three tables. We can’t use the union to combine records from these three tables because there are duplicate records in fb_eu_energy and fb_na_energy. For instance, the record (2020-01-01, 400) can be found in the fb_eu_energy and fb_na_energy tables. Therefore, if we combine records using the union, it will eliminate duplicate records. Therefore we have used union all, which will contain the duplicate record also.

SELECT date, consumption 
FROM  fb_eu_energy 
union all 
SELECT date, consumption FROM fb_asia_energy 
union all 
SELECT date, consumption FROM fb_na_energy;

Step 2: Calculate the Total Amount of Energy Consumed for Each Day
After combining all records, we will select the sum of energy consumption for each day. We can use the group on the date and take the total energy consumed for this.

select date, sum(consumption) as total_consumption
from (
          SELECT date, consumption FROM fb_eu_energy 
          union all 
          SELECT date, consumption FROM fb_asia_energy 
          union all 
          SELECT date, consumption FROM fb_na_energy
        )E
group by date;

Step 3: Filter the Records and Format Result into Manner Specified

Now we must format our query result in the manner specified in the question. Across all data centers, we must output the data with the highest energy consumption. To arrange rows in descending order of total consumption, we can use the order by clause on the total consumption field. The first row will give us the date with the highest energy consumption; we can use limit 1 to output only one row for this task.

select date, sum(consumption) as total_consumption from (
                    SELECT date, consumption FROM fb_eu_energy 
                    union all 
                    SELECT date, consumption FROM fb_asia_energy 
                    union all 
                    SELECT date, consumption FROM fb_na_energy 
                    )E
            group by date
            order by sum(consumption) 
            desc limit 1 ;

Conclusion

In this article, we looked at two SQL questions and how to solve them efficiently. We have seen union all, group by, having clause, case clause, filtering the rows using where clause, and how they have been used to solve the questions. When attempting to solve any complex problem, keep the following points in mind

It is necessary to divide the problem into smaller problems. After reading the problem twice, decide which columns to use to calculate the desired result.
Next, determine the functions you’ll need to calculate each sub-problem, and then try to connect the solutions of the sub-problems.
Using this method, you can quickly solve any problem. Exposure to strategies for solving different questions by product companies will improve your capacity to formulate novel answers and new approaches to problems.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

SQL Query: Coding Question Asked by Microsoft and Facebook

This article was published as a part of the Data Science Blogathon.

Introduction

Part 1: Premium vs. Freemium Asked by Microsoft

Part 2: Highest Energy Consumption Asked by FaceBook

Highest Energy Consumption

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

SQL Query: Coding Question Asked by Microsoft and Facebook

This article was published as a part of the Data Science Blogathon.

Introduction

Part 1: Premium vs. Freemium Asked by Microsoft

Part 2: Highest Energy Consumption Asked by FaceBook

Highest Energy Consumption

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques