Nilabh Nishchhal — July 13, 2021
Beginner Data Engineering Libraries Python SQL
This article was published as a part of the Data Science Blogathon

Introduction

Pandas have come a long way on their own, and are considered second to none when it comes to data handling. Still, there are many SQL power users who consider SQL queries nothing less than sacred, and swear by them.

For such users and also for those who chase efficiency in coding (I do agree that SQL Queries are more efficient for some operations!), there is some good news. You can use the, as it is, to do data manipulation inside the python environment. That too in Jupyter Notebooks. Not only that, you can query pandas DataFrame directly using only SQL queries or syntax. If it sounds much like a fantasy, tighten your seat belts and join me in this adventure to marry SQL with Pandas. And did I say, You do not need to install or connect any SQL servers 😎

The saviour is python’s library, pandasql.

As the libraries’ documentation mentions:

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

Installation

You need to install the Python’s Library, pandasql first. It’s very simple to install. Use any of the below two methods, both use PIP installation.

  • Open the terminal and run
    pip install -U pandasql
  • Open your Jupyter Notebook and in any cell run
    !pip install -U pandasql

Basics

There is this one function that is used the most from this library. Its the main function sqldf. sqldf takes two parameters.

  • A SQL query in string format
  • A set of session/environment variables (globals() or locals())

It becomes tedious to specify globals() or locals(), hence whenever you import the library, run the following helper function along with. This will make things simple going forward.

from pandasql import sqldf 
mysql = lambda q: sqldf(q, globals())

Syntax

There are many variants of SQL in use, and their syntaxes vary a little. Here in pandasql uses the SQLite syntax. Most of the standard SQL language SQLite understands. However, it adds few features of its own while at the same time it does omit some features. Click Here to read the document that attempts to describe what parts of the SQL language SQLite do and do not support.

pandasql automatically detects any pandas DataFrame. You can call them or query them by their name in the same way you would have done with a SQL table.

We are going to use any one of these two basic code samples.

from pandasql import sqldf
mysql = lambda q: sqldf(q, globals()) mysql("SQL Query")

or

from pandasql import sqldf
mysql = lambda q: sqldf(q, globals()) query = ''' SQL Query ''' mysql(query)

Import libraries and Data

For this article, we are going to use the data from the pandasql library itself. Let us import the dependencies and the data.

In [1]:
import pandas as pd
from pandasql import sqldf
from pandasql import load_meat, load_births          # Importing Data
​# Bring data in Python environment as pandas DataFrame
​meat = load_meat()
births = load_births()

Let us have a look at the Data.

In [2]:
meat.head()
Out[2]:
date beef veal pork lamb_and_mutton broilers other_chicken turkey
0 1944-01-01 751.0 85.0 1280.0 89.0 NaN NaN NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN NaN NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN NaN NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN NaN NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN NaN NaN

In [3]:

births.head()

Out[3]:
date births
0 1975-01-01 265775
1 1975-02-01 241045
2 1975-03-01 268849
3 1975-04-01 247455
4 1975-05-01 254545

Read Data using SQL Query

We will read the first 5 rows of data, for the meat and births data frames using SQL. The result shall be similar to what we get from using .head()

In [4]:
# specify globals() or locals() using the following helper function
​mysql = lambda q: sqldf(q, globals())
mysql("SELECT * FROM meat LIMIT 5;")
Out[4]:
date beef veal pork lamb_and_mutton broilers other_chicken turkey
0 1944-01-01 00:00:00.000000 751.0 85.0 1280.0 89.0 None None None
1 1944-02-01 00:00:00.000000 713.0 77.0 1169.0 72.0 None None None
2 1944-03-01 00:00:00.000000 741.0 90.0 1128.0 75.0 None None None
3 1944-04-01 00:00:00.000000 650.0 89.0 978.0 66.0 None None None
4 1944-05-01 00:00:00.000000 681.0 106.0 1029.0 78.0 None None None

In [5]:

mysql("SELECT * FROM births LIMIT 5;")
Out[5]:
date births
0 1975-01-01 00:00:00.000000 265775
1 1975-02-01 00:00:00.000000 241045
2 1975-03-01 00:00:00.000000 268849
3 1975-04-01 00:00:00.000000 247455
4 1975-05-01 00:00:00.000000 254545

Join (or merge) DataFrames using SQL Queries

Joining tables is one of the most common tasks being performed by SQL. Understandably so, as the relational databases have data segregated in separate tables. Hence, SQL users are pretty used to using join() tables in SQL. We can use the power of SQL JOIN here with pandas DataFrame.

In [6]:
query = '''
SELECT m.date, m.beef, m.veal, m.pork, b.births
FROM meat AS m
INNER JOIN
births AS b
ON m.date = b.date;
'''
​
mysql(query)
Out[6]:
date beef veal pork births
0 1975-01-01 00:00:00.000000 2106.0 59.0 1114.0 265775
1 1975-02-01 00:00:00.000000 1845.0 50.0 954.0 241045
2 1975-03-01 00:00:00.000000 1891.0 57.0 976.0 268849
3 1975-04-01 00:00:00.000000 1895.0 60.0 1100.0 247455
4 1975-05-01 00:00:00.000000 1849.0 59.0 934.0 254545
402 2012-07-01 00:00:00.000000 2200.8 9.5 1721.8 368450
403 2012-08-01 00:00:00.000000 2367.5 10.1 1997.9 359554
404 2012-09-01 00:00:00.000000 2016.0 8.8 1911.0 361922
405 2012-10-01 00:00:00.000000 2343.7 10.3 2210.4 347625
406 2012-11-01 00:00:00.000000 2206.6 10.1 2078.7 320195

407 rows × 5 columns

GROUP BY using SQL

The data of meat production is month-wise. We want to see the beef production per year. For that we need to groupby() and aggregate. We can do this using the SQL GROUP BY function.

In [7]:
query = '''SELECT
            strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year
              LIMIT 5;
    '''
​
mysql(query)
Out[7]:
year beef_total
0 1944 8801.0
1 1945 9936.0
2 1946 9010.0
3 1947 10096.0
4 1948 8766.0

 

In the above code, we used SQL query to limit the number of rows for the grouped and aggregated table to 5 rows. But the output and the input, both are not SQL tables. They are pandas DataFrames. And this gives us the liberty to use Pandas functions and methods on the same.

Let us do the same operation, and this time the output shall be the first 10 rows. But the SQL query will give a full table and we will use pandas head() function to get the final output truncated to 10 rows.

In [8]:
query = '''SELECT
            strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year;
    '''
​
mysql(query).head(10)
Out[8]:
year beef_total
0 1944 8801.0
1 1945 9936.0
2 1946 9010.0
3 1947 10096.0
4 1948 8766.0
5 1949 9142.0
6 1950 9248.0
7 1951 8549.0
8 1952 9337.0
9 1953 12055.0

UNION ALL to club multiple variables in SQL

We have beef, pork, and veal as meat types, in separate columns. Here we want all the production values in one column and the identifier in another column. We can use UNION ALL function from SQL to achieve this easily.

In [9]:
#executing union all statements
query = """
        SELECT
            date
            , 'beef' AS meat_type
            , beef AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'veal' AS meat_type
            , veal AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'pork' AS meat_type
            , pork AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'lamb_and_mutton' AS meat_type
            , lamb_and_mutton AS value
        FROM meat
        ORDER BY 1
    """
​
mysql(query).head(10)
Out[9]:
date meat_type value
0 1944-01-01 00:00:00.000000 beef 751.0
1 1944-01-01 00:00:00.000000 veal 85.0
2 1944-01-01 00:00:00.000000 pork 1280.0
3 1944-01-01 00:00:00.000000 lamb_and_mutton 89.0
4 1944-02-01 00:00:00.000000 beef 713.0
5 1944-02-01 00:00:00.000000 veal 77.0
6 1944-02-01 00:00:00.000000 pork 1169.0
7 1944-02-01 00:00:00.000000 lamb_and_mutton 72.0
8 1944-03-01 00:00:00.000000 beef 741.0
9 1944-03-01 00:00:00.000000 veal 90.0

Nested Queries of SQL

In SQL, writing queries within another query is commonplace. The same kind of nesting of queries is possible here as well. We will create one table (or say DataFrame) and without assigning it any variable (or name), we will use that to create another table.

In [10]:
# use queries within queries
query = """
    SELECT
        m1.date
        , m1.beef
    FROM
        meat m1
    WHERE m1.date IN
        (SELECT
            date
        FROM meat
        WHERE
            beef >= broilers
        ORDER BY date)
"""
​
mysql(query)
Out[10]:
date beef
0 1960-01-01 00:00:00.000000 1196.0
1 1960-02-01 00:00:00.000000 1089.0
2 1960-03-01 00:00:00.000000 1201.0
3 1960-04-01 00:00:00.000000 1066.0
4 1960-05-01 00:00:00.000000 1202.0
416 1995-08-01 00:00:00.000000 2316.0
417 1995-09-01 00:00:00.000000 2220.0
418 1995-11-01 00:00:00.000000 2098.0
419 1996-05-01 00:00:00.000000 2302.0
420 1996-06-01 00:00:00.000000 2186.0

421 rows × 2 columns

Conclusion:

In this article, we saw that how easily we can use SQL queries to operate upon the DataFrames. This gives us a unique opportunity. This weapon can be a potent one in any Data Scientist’s arsenal, who knows SQL and Python, both.

They both are powerful languages and have their respective strengths and weaknesses. Using the method shown in this article, or in other words, using the pandasql library and sqldf function, we can use the best and most efficient method to manipulate data, well within the python environment and even Jupyter Notebook. This is music to my ears. I hope you enjoyed the song too 🤓.

In this article, you saw how to use SQL queries inside python. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.

End Notes

Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.

Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.

The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.

About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is Python Made Easy.

Cover Photo Credit: Photo by Norbert Hentges on Unsplash

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *