The launch of ChatGPT marked an unprecedented moment in the history of AI. With their incredible capabilities, ChatGPT and many other generative AI tools have the potential to change dramatically the way we work. Writing SQL is one task already changing in data science following the AI revolution.  We will provide an illustrative example of using natural language to connect and interact with an SQL database. You will be using Python’s open-source package Vanna. The link to the Notebook is here. Master the art of crafting intricate SQL queries with Generative AI. Learn how to streamline database interactions using natural language prompts in this insightful guide.
In this article, you will learn:
This article was published as a part of the Data Science Blogathon.
SQL is one of the most popular and widely used programming languages. Most modern companies have adopted SQL architecture to store and analyze enterprise data. However, not everyone in the company is capable of harnessing that data. They may lack the technical skills or be unfamiliar with the structure and schema of the database.
Whatever the reason, this is often a bottleneck in data-driven projects, for to answer business questions, everyone depends on the availability of the very few people who know how to use the SQL database. Wouldn’t it be great if everyone in the company, irrespective of their SQL expertise, could harness that data every time, everywhere, all at once?
That could be soon possible with the help of generative AI. Developers and researchers are already testing different approaches to train Large Language Models (LLMs)— the foundation technology of most generative AI tools — for SQL purposes. For example, LangChain, the popular framework for developing LLM-based applications, can now connect and interact with SQL databases based on natural language prompts.
However, these tools are still in a nascent stage. They often return inaccurate results or experience so-called LLM hallucinations, especially when working with large and complex databases. Also, they may not be intuitive enough for non-technical users. Hence, there is still a wide margin of improvement.
Vanna is an AI agent designed to democratize the use of SQL. Starting from a pre-trained model based on a combination of third-party LLMs from OpenAI and Google, you can fine-tune a custom model specific to your database.
Once the model is ready, you have to ask business questions in natural language, and the model will translate them into SQL queries. You will also want to run the queries against the target database. Just ask the model, and it will return the query and a pandas DataFrame with the results, a plotly chart, and a list of follow-up questions.
To create the custom model, Vanna has to be trained with contextually relevant information, including SQL examples, database documentation, and database schemas — i.e., data definition language (DDL). The accuracy of your model will ultimately depend on the quality and quantity of your training data. The good news is that the model is designed to keep learning as you use it. Since the generated SQL queries will be automatically added to the training data, the model will learn from its previous mistakes and gradually improve.
The whole process is illustrated in the following image:
Check out this article to learn more about the technicalities of LLMs and other kinds of neural networks.
Now that you know the theory, let’s get into the practice.
As with any Python package, you first need to install Vanna. The package is available in PyPI and should be installed in seconds.
Once you have Vanna on your computer, import it into your working environment using the alias vn :
# Install vanna, if necessary
%pip install vanna
# import packages
import pandas as pd
import vanna as vn
To use Vanna, you must create a login and get an API key. This is a straightforward process. Run the function vn.get_api_key() with your email and a code will be sent to your inbox. Just enter the code, then run vn.set_api_key() and you’re ready to use Vanna.
# Create login and get API key
api_key = vn.get_api_key('[email protected]')
vn.set_api_key(api_key)
With Vanna, you can create as many custom models as you want. Say you are a member of the marketing department of your company. Your team normally works with the company Snowflake data warehouse and a department-specific PostgreSQL database. You could then create two different models, each trained on the specific characteristics of the databases and with different access permissions.
To create a model, use the function vn.create_model(model, db_type), providing a name and the database type. Vanna can be used with any database that supports connection via Python, including SQLite, PostgreSQL, Snowflake, BigQuery, and Amazon Athena.
Imagine you want to create two models for the two databases your team works with:
# Create models
vn.create_model(model="data_warehose", db_type="Snowflake")
vn.create_model(model="marketing_db", db_type="Postgres")
Once created, you can access them using the vn.get_model() function. The function will return a list of the available models.
['data_warehose',
'marketing_db',
'my-dataset2',
'demo-tpc-h',
'tpc',
'chinook']
You may have noticed that there are more models than the ones you just created. That’s because Vanna comes with a set of pre-trained models that can be used for testing purposes.
We will play around with the “chinook” model for the rest of the tutorial. It is trained on the Chinook, a fictional SQLite database containing information about a music store. For the sake of clarity, below you can find the tables and relationships that comprise the database:
To select that model, run:
# Set model
vn.set_model('chinook')
This function will set the model to use for the Vanna API. It will allow the agent to send your prompts to the underlying LLM, leveraging its capabilities with the training data to translate your questions in natural language into SQL queries.
However, if you want the agent to run its generated SQL queries against the database, you will need to connect with it. Depending on the type of database, you will need a different connect function. Since we are using a SQLite database, we will use the vn.connect_to_sqlite(url) function with the url where the database is hosted:
# Connect to database
url= """https://github.com/lerocha/chinook-database/raw/master
/ChinookDatabase/DataSources/Chinook_Sqlite.sqlite"""
vn.connect_to_sqlite(url=url)
As mentioned, the Chinook model is already pre-trained with contextually relevant information. One of the coolest things of Vanna is that you always have full control over the training process. At any time, you can check what data is in the model. This is done with the vn.get_training_data() function, which will return a pandas DataFrame with the training data:
# Check training data
training_data = vn.get_training_data()
training_data
The model has been trained with a mix of questions with its corresponding SQL query, DDL, and database documentation. If you want to add more training data, you could do this manually with the vn.train() function. Depending on the parameters you use, the function can gather different types of training data:
For example, let’s include the question “Which are the five top stores by sales?” and its associated SQL query:
# Add question-query pair
vn.train(question="Which are the five top stores by sales?",
sql="""SELECT BILLINGCITY, SUM(TOTAL)
FROM INVOICE
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;""" )
Training the model manually can be daunting and time-consuming. There is also the possibility of training the model automatically by telling the Vanna agent to crawl your database to fetch metadata. Unfortunately, this functionality is still in an experimental phase, and it’s only available for Snowflake databases, so I didn’t have the chance to try it.
Now that your model is ready, let’s get into the funniest part: asking questions.
To ask a question, you have to use the vn.ask(question) function. Let’s start with an easy one:
vn.ask(question='What are the top 5 jazz artists by sales?')
Vanna will try by default to return the four elements already mentioned: the SQL query, a Pandas DataFrame with the results, a plotly-made chart, and a list with follow-up questions. When we run this line, the results seem accurate:
SELECT a.name, sum(il.quantity) as total_sales
FROM artist a
INNER JOIN album al
ON a.artistid = al.artistid
INNER JOIN track t
ON al.albumid = t.albumid
INNER JOIN invoiceline il
ON t.trackid = il.trackid
INNER JOIN genre g
ON t.genreid = g.genreid
WHERE g.name = 'Jazz'
GROUP BY a.nameORDER
BY total_sales DESC
LIMIT 5;
Suppose you want to save the results instead of having them printed. In that case, you can set the print_results parameters to False and unpack the results in different variables that you can later download in a desired format using regular techniques, such as the pandas .to_csv() method for the DataFrame and the plotly .write_image() method for the visualization:
sql, df, fig, followup_questions = vn.ask(question='What are the top 5 jazz artists by sales?',
print_results=False)
#Save dataframe and image
df.to_csv('top_jazz_artists.csv', index=False)
fig.write_image('top_jazz_artists.png')
The function has another parameter called auto_train set to True by default. That means that the question will be automatically added to the training dataset. We can confirm that using the following syntax:
training_data = vn.get_training_data()
training_data['question'].str.contains('What are the top 5 jazz artists by sales?').any()
Despite the impressive capabilities of the vn.ask(question) function, I wonder how it will perform in the real world, probably bigger and more complex databases. Also, no matter how powerful the underlying LLM is, the training process seems to be the key to high accuracy. How much training data do we need? What representation must it have? Can you speed up the training process to develop a practical and operational model?
On the other hand, Vanna is a brand new project, and many things could be improved. For example, the plotly visualizations don’t seem very compelling, and there seem to be no tools to customize them. Also, the documentation could be clarified and enriched with illustrative examples.
Additionally, I have noticed some technical problems that shouldn’t be difficult to fix. For example, when you only want to know a data point, the function breaks when trying to build the graph — which makes sense because, in those scenarios, a visualization is pointless. But the problem is that you don’t see the follow-up questions, and, more importantly, you cannot unpack the tuple.
For example, see what happens when you want to know the oldest employee.
vn.ask(question='Who is the oldest employee')
Vanna is one of the many tools that are trying to leverage the power of LLMs to make SQL accessible to everyone, no matter their technical fluency. The results are promising, but there is still a long way to develop AI agents capable of answering every business with accurate SQL queries. As we have seen in this tutorial, while powerful LLMs play an essential role in the equation, the secret still lies in the training data. Given the ubiquity of SQL in companies worldwide, automating the tasks of writing queries can be a game-changer. Thus, it’s worth watching how AI-powered SQL tools like Vanna evolve in the future.
A. Next-generation AI tools like ChatGPT are helping data practitioners and programmers in a wide range of scenarios, from improving code performance and automating basic tasks to fixing errors and interpreting results.
A. When only a few people in a company know SQL and the structure of the company database, everyone depends on the availability of these very few people to answer their business questions.
A. Powerful AI tools powered by LLMs could help data practitioners extract insights from data by enabling interaction with SQL databases using natural language instead of SQL language.
A. Vanna, powered by LLMs, is a Python AI SQL Agent that enables natural language communication with SQL Databases.
A. While the power of the LLMs underpinning these tools is relevant, the quantity and quality of training data is the most significant variable to increase accuracy.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,