Nilabh Nishchhal — June 10, 2021
Beginner Libraries Pandas Programming Project Python SQL Technique
This article was published as a part of the Data Science Blogathon

Introduction

Everyone dealing with data in any capacity has to be conversant in both SQL and Python. Python has Pandas which makes data import, manipulation, and working with data, in general, easy and fun. On the other hand, SQL is the guardian angel of Databases across the globe. It has retained its rightful grip on this field for decades.

By virtue of this monopolistic hold, the data being stored by an organization, especially the Relational Databases needs the use of SQL to access the database, as well as to create tables, and some operations on the tables as well. Most of these operations can be done in Python using Pandas as well. Through experience, I have noticed that for some operations, SQL is more efficient (and hence easy to use) and for others, Pandas has an upper hand (and hence more fun to use).

The purpose of this article is to introduce you to “Best of Both Worlds”. You shall know how to do operations in both of these interchangeably. This will be of much use to those who have experience working with SQL but new to Python. Just one more thing: This is my first attempt to marry SQL and Python. Watch out this space for more such articles and leave your demand for specific topics in comments for me to write about them.

So let us begin with our journey without any further ado.

pyodbc library in python
Source: Photo by billow926 on Unsplash, duly edited by the author

Installing and Importing the library pyodbc

We are going to use the library named pyodbc to connect python to SQL. This will give us the ability to use the dynamic nature of Python to build and run queries like SQL. These two languages together are a formidable force in our hands. These together can take your code to the pinnacle of automation and efficiency.

Install pyodbc using pip or visit their webpage.

pip install pyodbc

and then import the library in your Jupyter notebook

import pyodbc

pyodbc is going to be the bridge between SQL and Python. This makes access easy to ODBC (Open Database Connectivity) databases. ODBC was developed by SQL Access Group in the early ’90s as an API (Application Programming Interface) to access databases. These DBMS (Database management Systems) are compliant with ODBC.

  • MySQL
  • MS Access
  • IBM Db2
  • Oracle
  • MS SQL Server

I am presently working on MS SQL Server, and that’s what I will be using for this article as well. However, the same codes can be used for any other ODBC compliant database. Only the connection step will vary a little.

 

Connection to the SQL Server

We need to establish the connection with the server first, and we will use pyodbc.connect function for the same. This function needs a connection string as a parameter. The connection string can be defined and declared separately. Let’s have a look at the sample connection String.

There can be two types of connection Strings. One when the connection is trusted one, and another where you need to enter your User_id and Password. You would know which one you are using, from your SQL Server Management Studio.

  • For Trusted Connection:
connection_string = ("Driver={SQL Server Native Client 11.0};"
            "Server=Your_Server_Name;"
            "Database=My_Database_Name;"
            "Trusted_Connection=yes;")
  • For Non-Trusted Connection:
connection_string = ("Driver={SQL Server Native Client 11.0};"
            "Server=Your_Server_Name;"
            "Database=My_Database_Name;"
            "UID=Your_User_ID;"
            "PWD=Your_Password;")

You need the following to access:

  • Server
  • Database
  • User ID
  • Password

Let me help you locate your Server Name and Database

Get Server Name

You can find the server name using two ways. One is to have a look at your SQL Server Management login window.

pyodbc server name

 

The other way is to run the following query in SQL.

SELECT @@SERVERNAME

 

Get Database

You need to give the name of the database in which your desired table is stored. You can locate the database name in the object Explorer menu under the Databases section. This is located on the left-hand side of your SQL server.

pyodbc get database

 

 

In our case, the Database Name is My_Database_Name.

Get UID

You can find the User ID in your SQL Server Management login window. The Login_ID_Here is the user name.

pyodbc Get UID

Once you have written the connection_string, you can initialize the connection by calling the pyodbc.connect function as below.

connection = pyodbc.connect(connection_string)

Note: In case you have ever used SQL database connection with any other software or environment (like SAS or R), you can copy the values of the above parameters from the connection string used there as well.

In [1]:
# Lets summarise the codes till now
import pyodbc
connection_string = ("Driver={SQL Server Native Client 11.0};"
            "Server=Your_Server_Name;"
            "Database=My_Database_Name;"
            "UID=Your_User_ID;"
            "PWD=Your_Password;")
connection = pyodbc.connect(connection_string)

Running the Query in SQL from Python

Now when you have established the connection of the SQL server with Python, you can run the SQL queries from your python environment (Jupyter notebook or any other IDE).

To do so, you need to define the cursor. Let’s do that here.

Let us run a simple query to select the first 10 rows of a table in the database defined in the connection_string, table name as State_Population.

In [2]:
# Initialise the Cursor
cursor = connection.cursor()
# Executing a SQL Query
cursor.execute('SELECT TOP(10) * FROM State_Population')

This executes the query, but you will not see any output in python, as the query is executed in SQL. However, you can print the results which will be the same as the ones returned inside the SQL server.

In [3]:
for row in cursor:
    print(row)

Out[3]:

(AL,  under18,  2012,   1117489.0)
(AL,    total,  2012,   4817528.0)
(AL,  under18,  2010,   1130966.0)
(AL,    total,  2010,   4785570.0)
(AL,  under18,  2011,   1125763.0)
(AL,    total,  2011,   4801627.0)
(AL,    total,  2009,   4757938.0)
(AL,  under18,  2009,   1134192.0)
(AL,  under18,  2013,   1111481.0)
(AL,    total,  2013,   4833722.0)

This way you can see the result in Python, but that is not very useful for further processing. It would be useful if we can import the table as a Pandas DataFrame in the Python environment. Let us do that now.

 with pandas

 

Bringing SQL table in Python

Pandas bring a data structure in Python, which is similar to a SQL (or for that matter, any other) table. That’s Pandas DataFrame for you. So it’s prudent to import the data from SQL to Python in form of Pandas DataFrame. Pandas have import functions that read SQL data types.

These are the three functions pandas provide for reading from SQL.

  • pandas.read_sql_table()
  • pandas.read_sql_query()
  • pandas.read_sql()

The read_sql_table function takes a table name as a parameter, the read_sql_query function takes SQL query as a parameter. The third one, read_sql is a wrapper function around the above two. It takes either, a table name or a query, and imports the data from SQL to Python in form of a DataFrame.

Also, notice how we give the SQL query in form of a string, and also tell the function the name of the connection.

In [4]:
import pandas as pd
# Using the same query as above to get the output in dataframe
# We are importing top 10 rows and all the columns of State_Population Table
data = pd.read_sql('SELECT TOP(10) * FROM State_Population', connection)
In [5]:
data
Out[5]:
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
5 AL total 2011 4801627.0
6 AL total 2009 4757938.0
7 AL under18 2009 1134192.0
8 AL under18 2013 1111481.0
9 AL total 2013 4833722.0

 

Notice the output above, it’s the same as you would expect from any local data file (say .csv), imported in Python as Pandas DataFrame.

We can write the query outside the function, and call it by the variable name as well. Let us import another table named state_areas which has the name of the states, and their area in Square Miles. But instead of calling the first 10 rows, we will like to see all the states whose area is more than 100,000 Square miles.

In [6]:
# write the query and assign it to variable
query = 'SELECT * FROM STATE_AREAS WHERE [area (sq. mi)] > 100000'
# use the variable name in place of query string
area = pd.read_sql(query, connection)
In [7]:
area
Out[7]:
state area (sq. mi)
0 Alaska 656425
1 Arizona 114006
2 California 163707
3 Colorado 104100
4 Montana 147046
5 Nevada 110567
6 New Mexico 121593
7 Texas 268601

Conclusion:

In this article, you saw how to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.

Now you can start using Python to work upon your data which rests in SQL Databases. Once you brought it as DataFrame, then all the operations are usual Pandas operations. Many of these operations were not possible in SQL.

I have found marrying SQL to Python immensely useful. This has opened the doors I didn’t know even existed.

The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.

About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is “Python Made Easy”.

Cover Image Courtesy – Photo by mateusz-butkiewicz on Unsplash

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *