Building Natural Language to SQL Applications using LlamaIndex

Avikumar Talaviya 11 Apr, 2024 • 9 min read

Introduction

With the rising advent of large language models and advancements in the field of AI we are witnessing new developments and opportunities in a way we work and interact with digital world around us. In this blog, we delve into the fundamentals of LlamaIndex, a groundbreaking technology that helps to build applications using LLMs.

This course covers setting up a project environment, connecting to a database, and developing NL2SQL applications using LlamaIndex modules. It provides a comprehensive roadmap for integration and explores real-world use cases of this cutting-edge technology in large language models and database management systems.

Learning Objectives

  • Understand the components and concepts of LlamaIndex, a technology for building applications with Large Language Models.
  • Learn the step-by-step workflow of developing Natural Language to SQL applications using LlamaIndex modules.
  • Gain knowledge on setting up environments, connecting to databases like MySQL, and creating SQL table mappings for NL2SQL applications.
  • Explore how to integrate OpenAI’s GPT models for encoding input queries and generating SQL queries from natural language prompts.
  • Discover real-world applications of NL2SQL in industries such as Finance & Banking, Healthcare, and E-commerce for automated reporting, customer insights, clinical data analysis, and more.

This article was published as a part of the Data Science Blogathon.

How to Get Started with LlamaIndex?

In this section, we will learn about components and concepts used in LlamaIndex library to build LLM powered applications for RAG and NL2SQL use-cases.

Building Natural Language to SQL Applications using LlamaIndex

LlamaIndex has many different components that can be used in LLM applications. Large language models require to interact with user queries, vector stores, databases and external data sources.

LlamaIndex has mainly 3 stages for LLM based applications:

  • Loading Stage: At this stage, data is loaded from external API output, PDFs, or database. Once the data is loaded it will be converted into chunks for indexing and storing purpose. Such chunks are also called as a nodes as an atomic unit of data in llamaindex.
  • Indexing Stage: Once you have loaded data, we have to vectorize the data and store each vector into Vectorstore or Vectorindex. LLM generate numerical representation of each sentences which are called as embeddings. Llamaindex has many components that helps to embed chunks into vector representation. 
  • Querying Stage: At this stage, Retrievers are used to extract relevant vectors from the vector stores based on context of the query. Similarity search is one of the most important concepts in LLM based applications which is key to how data is retrieved with efficiency and accuracy.

Using above stages in LLM based applications, we can develop RAG or NL2SQL based applications. Llamaindex functions and tools help to build such agentic applications. In next section, we will look at natural language to SQL application with step by step guide.

Architecture of NL2SQL Application

In this section, we will learn about architecture of natural language to SQL application to get an understanding of how large language models, database and user query interacts with each other to convert natural language query to SQL query resulting answer to the user query.

Architecture of NL2SQL Application

Above diagram shows how user query interacts with database and GPT model to generate output from database tables. let’s look at the step by step workflow of the application.

Step by Step Workflow of Application

Step1: User Uploads CSV

The process begins with the user uploading a CSV file, which contains data in tabular form. The user uploads the CSV file to a MySQL database, which then inserts the CSV data using “CREATE TABLE…” and “INSERT INTO…” functions.

Step2: Upload to MySQL Database

MySQL database processes and stores the uploaded CSV data. This database will serve as the source of data for the user’s queries.

Step3: User Uploads Examples

After uploading the CSV, the user has the option to upload example queries. These queries enhance natural language understanding and improve the robustness of prompt generation.

Step4: Upload Examples

Stored examples from the user are later used to generate prompts for the NL2SQL query engine.

Step5: Select Table

The application prompts the user to select a table from the uploaded CSV data. This table will be the target of the subsequent queries. user can select multiple tables too depending upon features of application.

Step6: Ask Query

The user formulates a query on the selected table using natural language.

Step7: NL2SQL Query Engine

The NL2SQL query engine takes the user’s query and information about the selected table to create a prompt. The system generates this prompt using a “create table” function with a few of the samples from the table.

Step8: Generate Prompt

The system submits the prompt to GPT models, which are natural language processing models capable of generating SQL queries based on the provided prompt. For more detailed prompt we can use few shot prompting to stir the GPT models to generate better SQL queries.

Step9: GPT Models Generate SQL

The GPT models generate an SQL query based on the prompt. The SQL query retrieves data from the selected table.

Step10: SQL Database

The application submits the generated SQL query to the MySQL database for execution. The database processes the query and returns the result set.

Step11: Execute Query

The SQL database executes the query and obtains the result set (DB Result).

Step12: Generate Descriptive Answer

The system submits the obtained DB result back to the GPT 3.5/4 models. These models generate a descriptive answer based on the results of the executed SQL query.

Step13: GPT Models Return Answer

The workflow completes when the system returns the descriptive answer to the user. The user receives an understandable and informative response based on their original natural language query.

Now that we have learned workflow of the NL2SQL applications. let’s look at how to build such applications with hands on code examples.

Setting up Environment and Connecting to the Database

In this section, we will set up the environment by installing important library for our project followed by connecting to A MySQL database for a development of a application. 

Installing Libraries

!pip install pymysql
!pip install llama_index
!pip install --force-reinstall 'sqlalchemy<2.0.0'
!pip install llama-index-callbacks-aim
!pip install openai
!pip install sqlalchemy

After installing the required libraries, we connect the database to our environment using ‘pymysql’ and ‘sqlalchemy’.

Database Connection

Before we connect to database using sqlalchemy and pymysql, we will look at how to add tables to database for our NL2SQL application. we will be using demo examples for our blog purpose. one can use any tables available to them to add into database.

# import necessary library to add data into sql database
from sqlalchemy import (create_engine, MetaData, Table, Column, String, Integer)
import re
import pymysql

# db credentials 
db_user = "your_username"
db_password = "your_password"
db_host = "your_host"
db_name = "your_name"

# Construct the connection string
connection_string = f"mysql+pymysql://{db_user}:{db_password}@{db_host}/{db_name}"

engine = create_engine(connection_string)
metadata_obj = MetaData()

# Function to create a sanitized column name
def sanitize_column_name(col_name):
    # Remove special characters and replace spaces with underscores
    return re.sub(r"\W+", "_", col_name)

# Function to create a table from a DataFrame using SQLAlchemy
def create_table_from_dataframe(
    df: pd.DataFrame, table_name: str, engine, metadata_obj
):
    # Sanitize column names
    sanitized_columns = {col: sanitize_column_name(col) for col in df.columns}
    df = df.rename(columns=sanitized_columns)

    # Dynamically create columns based on DataFrame columns and data types
    columns = [
        Column(col, String if dtype == "object" else Integer)
        for col, dtype in zip(df.columns, df.dtypes)
    ]

    # Create a table with the defined columns
    table = Table(table_name, metadata_obj, *columns)

    # Create the table in the database
    metadata_obj.create_all(engine)

    # Insert data from DataFrame into the table
    with engine.connect() as conn:
        for _, row in df.iterrows():
            insert_stmt = table.insert().values(**row.to_dict())
            conn.execute(insert_stmt)
        conn.commit()

# loop over your dataframes to add tables into sql database              
for idx, df in enumerate(dfs):
    create_table_from_dataframe(df, table_name, engine, metadata_obj)

Above code illustrates function to add dataframe into SQL databases such as MySQL, PostgreSQL, etc. The function uses a dataframe, table name, database engine, and metadata object to sanitize columns, create tables, and connect to the database engine to insert the table with relevant column names and data types. Once the table is added to the database, we can fetch it using regular SQL queries for our application development.

# import logging modules to configure login parameters
import logging
import sys
import pymysql

logging.basicConfig(stream=sys.stdout, level=logging.INFO, force=True)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display


# create_engine, text classes from sqlalchemy
from sqlalchemy import text


# Create an engine instance
engine = create_engine(connection_string)

# Test the connection using raw SQL
with engine.connect() as connection:
    result = connection.execute(text("show tables"))
    for row in result:
        print(row)

---------------------------------[Output]-------------------------------------
('Customers',)
('OrderDetails',)
('Orders',)
('info',)
('table_36370395d9314572970b325bee42817d',)
('table_59af3c5a5388482193f88093fc2eaa36',)

In the code above, the first step involves connecting to the MySQL database using the available database credentials. After creating a connection string object to the database, the next step involves creating an engine that connects the database tables. Table details can be retrieved for querying purposes once the connection is established.

Llamaindex SQL Database Instance

from llama_index.core import SQLDatabase

# select tables to connect for application
tables = ['table_36370395d9314572970b325bee42817d', 'table_59af3c5a5388482193f88093fc2eaa36']
sql_database = SQLDatabase(engine, include_tables=tables, sample_rows_in_table_info=2)
sql_database

After connecting to the database, we utilize the ‘SQLDatabase’ class from the llamaindex core library to retrieve the table schema of selected tables. With the SQL database set up, we can proceed with NL2SQL application development.

Developing NL2SQL Application Using Llama Index Modules

We begin by setting up OpenAI API key and importing OpenAI library to call GPT models for natural language to SQL queries. OpenAI provides wide range of LLMs for many tasks. 

import os
import openai

openai.api_key = "YOUR_OPENAI_KEY"

Above code configures OpenAI GPT models with application environment. Now let’s create LLM encoder function that will help to convert input queries into encoded vectors.

import tiktoken
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

# callback manager for encoding input queries
callback_manager = CallbackManager([token_counter])

Above code sets up a callback mechanism using a CallbackManager instance to manage callbacks. One of the callbacks is a TokenCountingHandler instance, which uses a tokenizer from the tiktoken module, specifically tailored for the “gpt-3.5-turbo” model. This setup is likely part of a larger text processing or analysis pipeline. 

Setting up OpenAI LLM Service Context 

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# llm instance usign OpenAI function and embedding model
Settings.llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="gpt-3.5-turbo")

Above code initializes various components necessary for utilizing OpenAI’s GPT-3.5 model with specific settings. It creates an instance of OpenAI class for model interaction, and a ServiceContext instance for managing service-related configurations, including the OpenAI model instance and callback manager that we initialized in previous code.

Creating SQL Table Node Mapping

#creating SQL table node mapping 
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex, SQLTableNodeMapping, SQLTableSchema
import pandas as pd

# list all the tables from database and crate table schema for prompt to LLM
tables = list(sql_database._all_tables)
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = []
for table in tables:
    table_schema_objs.append((SQLTableSchema(table_name = table)))

Above code retrieves information about tables from a SQL database, creates a mapping between SQL tables and internal representation, and generates table schemas for each table in the database. Users likely utilize these table schemas for interacting with an object indexing system offered by the llama_index library.

Initializing SQL Table Query Retrieval Engine 

from llama_index.core.indices.struct_store.sql_query import SQLTableRetrieverQueryEngine

# create a object index to store vectors 
obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex
)

# initializing query engine 
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=3), 
)

# Ask query to query engine and store it in a response variable
response = query_engine.query("How many people have previous work experience?")

Above code sets up an object index, initializes a query engine for SQL tables, and then executes a query on the SQL database using the provided text and generated SQL query from the service context. The response variable stores the result of the query. This process likely involves some form of natural language processing and database interaction to provide relevant information based on the query. The resultant answer can then be provided back to user.

Use Cases Across the Industries

NL2SQL applications find utility across various industries like finance, banking, healthcare, and e-commerce. Analyzing user data and deriving insights to enable data driven decision making is at the core of any industry in today’s time. one of biggest of enabling data driven insights across the teams is that it is time consuming and requires specialized skills. but by using NL2SQL applications any business stakeholder can just ask query to database in natural language and get the relevant answer using large language models. Let’s explore how these applications find use in these industries.

  • Finance and Banking:
    • Automated Reporting: Using natural language to SQL application we can extract key insights from the database and prepare automated reports that can save time and cost for the business.
    • Customer Insights: NL2SQL application can ease task for business stakeholders to find customer insights for tactical and managerial decision making.
  • Healthcare:
    • Clinical Data Analysis: In healthcare, Clinics and Hospitals possess large databases of patients health data. Such applications can help doctors to analyze data quickly to improve the speed of diagnosis.

Conclusion

We have learned about llama Index and their workings such a storage, indexing and query engines. after that we learned in depth about NL2SQL workflow along with explanation of each step to develop application from scratch. we learned how to integrate large language models with tools such as a llama Index so that users can interact with database using natural language queries without knowing any technical skill. Moreover, we learned to implement NL2SQL application using hands on code example along with its applications in various use-cases. Let’s look at few key takeaways from this blog that everyone should must learn.

Key Takeaways

  • Explored the three vital stages: Loading, Indexing, and Querying, essential for building LLM-based applications.
  • Gained insights into how to seamlessly integrate OpenAI’s GPT models for efficient natural language processing and SQL query generation.
  • Learned the practical aspects of setting up environments and connecting to databases like MySQL.
  • Installed and utilized essential libraries such as llama_index, pymysql, and sqlalchemy for application development.

Frequently Asked Questions

Q1: How does LlamaIndex help to develop Text to SQL applications?

A: LlamaIndex has a inbuilt Text-to-SQL functions and methods like SQL table retrieval query engine to retrieve data from database and generate response for the users.

Q2: What are the 3 main stages of LlamaIndex library?

A: LlamaIndex operates on 3 fundamental functions: Loading, Indexing, and Storing stages, enabling the development of LLM-based applications for a wide range of uses.

Q3: How does NL2SQL applications can help businesses to make faster decisions?

A: Natural language to SQL applications can reduce time and efforts needed to generate actionable insights by only querying in natural language rather than writing complex SQL queries.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Derrick Mwiti
Derrick Mwiti 12 Apr, 2024

Very indepth article

Related Courses