How to Build a Data Warehouse Using PostgreSQL in Python?

Amruta Last Updated : 23 Jun, 2021

5 min read

This article was published as a part of the Data Science Blogathon

Introduction

Data warehouse generalizes and mingles data in multidimensional space. The construction or structure of a data warehouse involves Data Cleaning, Data Integration, and Data Transformation, and it can be viewed as an “important preprocessing step for data mining”.

Furthermore, data warehouses provide online analytical processing (called OLAP) tools for the interactive analysis of multidimensional data of diverse granularities, which facilitates effective data generalization and data mining. There are many other data mining functions, such as association,
classification, prediction, and clustering, that can be integrated with OLAP operations to enhance interactive mining of knowledge.

That’s why the data warehouse has become an escalated important platform for data analysis and OLAP. The data warehouse will provide a constructive platform for data mining. Therefore, data warehousing and OLAP form an essential step in the knowledge discovery process(KDD). This is the overview that is essential for understanding the overall data mining and knowledge discovery process.

Data Warehouse Using PostgreSQL — Image source: Google.com

Now let’s understand the basic concept of the data warehouse.

Basic Concept of Data Warehouse:

Data warehousing comes up with architectures and tools for business professionals to systematically organize, understand, and use these data to make strategic decisions. Data warehouse systems are beneficial tools in today’s competitive and fast-evolving world. From the last couple of years, many firms, industries have spent a lot of millions of dollars in building enterprise-wide data warehouses.

“Then, what exactly is a data warehouse?” Generally speaking, a data warehouse refers to a data repository that is maintained separately from an organization’s operational databases. Data warehouses allow the integration of a variety of application systems. The four keywords—subject-oriented, integrated, time-variant, and nonvolatile – distinguishes data warehouses from other data repository systems, such as
relational database systems(RDBMS), transaction processing systems, and other file systems.

There are three keys to implement a data warehouse:

– Server

– Dashboard

– Indexing

Let’s discuss all these points in detail:-

1) Server:

Postgre SQL

“Postgre SQL” is an open-source relational database system(RDMS). Even though it is a structured database management system(DBMS), it also stores non-structured data. Most important, the graphical user interface of Postgre SQL makes it very easy to deliver and manage databases on the mount.

Before we proceed, you need to download and install Postgres using the link PostgreSQL.

After the installation is complete, you can log in to the server by running the application which will open one portal on your browser of pgadmin.

There is a default database labeled as Postgre, however, you can create your own database by right-clicking on the “Databases” menu and then select “Create” to create a new database.

2) Implementation of Python

Now see we have created our server and database, you need to first install the package named “sqlalchemy” which will be used to connect to the database using Python. You can also download and install this package by using the following command into Anaconda prompt as-

pip install sqlalchemy

Let’s install and then import other necessary libraries into Python script as follows:

from sqlalchemy import create_engine
import psycopg2
import pandas as pd
import streamlit as st

Now, we need to establish a connection between our “records_db” database and create a new table where we can store our records. Also, we need to create another connection to the “datasets_db” database where we can store our datasets.

p_engine = create_engine("postgresql://<username>:<password>@localhost:5432/records_db")
p_engine_dataset = create_engine("postgresql://<username>:<password>@localhost:5432/datasets_db")
p_engine.execute("CREATE TABLE IF NOT EXISTS records (name text PRIMARY KEY, details text[])")

As we know the naming convention of postegre, names of tables must start with underscores(_) or letters (“a,b,c” & not numbers), it must not contain dashes(-) and less than 64 characters long. Consider our “records” table, we will create a “name” field with a “text” data type declared as a PRIMARY KEY and a details field as text[](array) which is postgres’s notation for a single-dimensional array. Furthermore, if you want to store your database credentials in a secure way then save them in a configuration_file and then invoke them as parameters in your code as per your requirement.

Hence, we are going to create the following five functions that is to read, write, update, list our data to/from our database. Let’s see:-

def write_record(name,details,p_engine):
    p_engine.execute("INSERT INTO records (name,details) VALUES ('%s','%s')" % (name,details))

def read_record(field,name,p_engine):
    result = p_engine.execute("SELECT %s FROM records WHERE name = '%s'" % (field,name))
    return result.first()[0]
    
def update_record(field,name,new_value,p_engine):
    p_engine.execute("UPDATE records SET %s = '%s' WHERE name = '%s'" % (field,new_value,name))

def write_dataset(name,dataset,p_engine):
    dataset.to_sql('%s' % (name),p_engine,index=False,if_exists='replace',chunksize=1000)

def read_dataset(name,p_engine):
    try:
        dataset = pd.read_sql_table(name,p_engine)
    except:
        dataset = pd.DataFrame([])
    return dataset

def list_datasets(p_engine):
    datasets = p_engine.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'public' ORDER BY table_name;")
    return datasets.fetchall()

3) Dashboard:

Streamlit

“Streamlit” is a pure Python web framework that allows us to develop and deploy user interfaces(UI) and applications in real-time. Here we are using streamlit to render the dashboard for interfacing with the database.

In the code shown below, we are using different text inputs to insert the values in our records, arrays, and names for our datasets. Next, we are using Streamlit’s functions to interactively visualize our dataset in the form of a chart and as a data frame too.

st.title('Dashboard')
column_1, column_2 = st.beta_columns(2)

with column_1:
    st.header('Save records')
    name = st.text_input('Please enter name')
    details = st.text_input('Please enter your details (separated by comma ",")')
    details = ('{%s}' % (details))
    if st.button('Save record to database'):
        write_record(name,details,p_engine)
        st.info('Name: **%s** and details: **%s** saved to database' % (name,details[1:-1]))

    st.header('Update records')
    field = st.selectbox('Please select field to update',('name','details'))
    name_key = st.text_input('Please enter name of record that to be updated')    
    if field == 'name':
        updated_name = st.text_input('Please enter your updated name')
        if st.button('Update records'):
            update_record(field,name_key,updated_name,p_engine)
            st.info('Updated name to **%s** in record **%s**' % (updated_name,name_key))                
    elif field == 'details':
        updated_details = st.text_input('Please enter updated details (separated by comma)')
        updated_details = ('{%s}' % (updated_details))  
        if st.button('Update records'):
            update_record(field,name_key,updated_details,p_engine)
            st.info('Updated details to  **%s** in record **%s**' % (updated_details[1:-1],name_key))
            
    st.header('Read records')
    record_to_read = st.text_input('Please enter name of record to read')
    if st.button('Search'):
        read_name = read_record('name',record_to_read,p_engine)
        read_details = read_record('details',record_to_read,p_engine)
        st.info('Record name is **%s**, record details is **%s**' % (read_name,str(read_details)[1:-1]))

with column_2:
    st.header('Save datasets')
    dataset = st.file_uploader('Please upload dataset')
    if dataset is not None:
        dataset = pd.read_csv(dataset)
        dataset_name = st.text_input('Please enter name for dataset')
        if st.button('Save dataset to database'):
            write_dataset('%s' % (dataset_name),dataset,p_engine_dataset)
            st.info('**%s** saved to database' % (dataset_name))

    try:
        read_title = st.empty()
        dataset_to_read = st.selectbox('Please select dataset to read',([x[0] for x in list_datasets(p_engine_dataset)]))
        read_title.header('Read datasets')
        if st.button('Read dataset'):
            df = read_dataset(dataset_to_read,p_engine_dataset)
            st.subheader('Chart')
            st.line_chart(df['value'])
            st.subheader('Dataframe')
            st.write(df)    
    except:
        pass

You can run your dashboard on a local browser from your machine, by typing the following commands in anaconda prompt. So, first, you need to change your root directory to where your source code has saved.

cd C:\Users\your directory path...

Now we will run the following code to run our app…

streamlit run file_name.py

Final Result

Finally, we have a dashboard that can be used to write, read, tokenize, update, upload, and visualize our data in a real-time manner. We can see the beauty of our data warehouse that it can be enlarged to user/host as much data as you may need within the same structure.

Conclusion

I hope you liked my article. Do share with your friends, colleagues. Thank You!

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Amruta

I am Software Engineer, data enthusiast , passionate about data and its potential to drive insights, solve problems and also seeking to learn more about machine learning, artificial intelligence fields.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

How to Build a Data Warehouse Using PostgreSQL in Python?

Introduction

Basic Concept of Data Warehouse:

Final Result

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

How to Build a Data Warehouse Using PostgreSQL in Python?

Introduction

Basic Concept of Data Warehouse:

Final Result

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques