Sukanya Bag — Updated On September 15th, 2023
Database Generative AI Github Intermediate Kaggle Python


Large Language Models have come a long way in Document Q&A and information retrieval. These models know a lot about the world, but sometimes, they struggle to know when they don’t know something. This leads them to make things up to fill the gaps, which isn’t great.

Langchain and Deep Lake

However, a new method called Retrieval Augmented Generation (RAG) seems promising. Using RAG to query an LLM with your private knowledge base. It helps these models get better by adding extra information from their data sources. This makes them more innovative and helps reduce their mistakes when they don’t have enough information.

RAG works by enhancing prompts with proprietary data, ultimately enhancing the knowledge of these large language models while simultaneously reducing the occurrence of hallucinations.

Learning  Objectives

1. Understanding of the RAG approach and its benefits

2. Recognize the challenges in Document QnA

3. Difference between Simple Generation and Retrieval Augmented Generation

4. Practical implementation of RAG on an industry use case like Doc-QnA

By the end of this learning article, you should have a solid understanding of Retrieval Augmented Generation (RAG) and its application in enhancing the performance of LLMs in Document Question Answering and Information Retrieval.

This article was published as a part of the Data Science Blogathon.

Getting Started

Regarding Document Question Answering, the ideal solution is to give the model the specific information it needs right when asked a question. However, deciding what information is relevant can be tricky and depends on what the large language model is expected to do. This is where the concept of RAG becomes important.

Let us see how a RAG pipeline works:

RAG Pipeline | Langchain and Deep Lake

Retrieval Augmented Generation

RAG, a cutting-edge generative AI architecture, employs semantic similarity to identify pertinent information in response to queries autonomously. Here’s a concise breakdown of how RAG functions:

  1. Vector Database: In a RAG system, your documents are stored within a specialized Vector DB. Each document undergoes indexing based on a semantic vector generated by an embedding model. This approach enables rapid retrieval of documents closely related to a given query vector. Each document is assigned a numerical representation (the vector), signifying its semantic meaning.
  2. Query Vector Generation: When a query is submitted, the same embedding model produces a semantic vector that represents the query.
  3. Vector-Based Retrieval: Subsequently, the model utilizes vector search to identify documents within the DB that exhibit vectors closely aligned with the query’s vector. This step is crucial in pinpointing the most relevant documents.
  4. Response Generation: After retrieving the pertinent documents, the model employs them with the query to generate a response. This strategy empowers the model to access external data precisely when required, augmenting its internal knowledge.

The Illustration

The illustration below sums up the entire steps discussed above:

Illustration on simple generation and retriever augmented generation | Langchain and Deep Lake

From the drawing above, there are 2 important things to pinpoint :

  1. In the Simple generation, we will never know the source information.
  2. Simple generation can lead to wrong information generation when the model is outdated, or its knowledge cutoff is before the query is asked.

With the RAG approach, our LLM’s prompt will be the instruction given by us, the retrieved context, and the user’s query. Now, we have the evidence of the information retrieved. 

So, instead of taking the hassle of retraining the pipeline several times to an ever-changing information scenario, you can add updated information to your vector stores/data stores. The user can come next time and ask similar questions whose answers have now changed (take an example of some finance records of an XYZ firm). You are all set.

Hope this refreshes your mind on how RAG works. Now, let’s get to the point. Yes, the code.

I know you did not come here for the small talk. 👻

Let’s Skip to the Good Part!

1: Making the VSCode Project Structure

Open VSCode or your preferred code editor and create a project directory as follows (carefully follow the folder structure) –

VSCode project structure | Langchain and Deep Lake

Remember to create a virtual environment with Python ≥ 3.9 and install the dependencies in the requirements.txt file. (Don’t worry, I will share the GitHub link for the resources.)

2: Creating a Class for Retrieval and Embedding Operations

In the file, paste the code below and save it.

from retriever.retrieval import Retriever

# Create a Controller class to manage document embedding and retrieval
class Controller:
    def __init__(self):
        self.retriever = None
        self.query = ""

    def embed_document(self, file):
        # Embed a document if 'file' is provided
        if file is not None:
            self.retriever = Retriever()
            # Create and add embeddings for the provided document file

    def retrieve(self, query):
        # Retrieve text based on the user's query
        texts = self.retriever.retrieve_text(query)
        return texts  

This is a helper class for creating an object of our Retriever. It implements two functions –

embed_document: generates the embeddings of the document

retrieve: retrieves text when the user asks a query

Down the lane, we will get deeper into the create_and_add_embeddings and retrieve_text helper functions in our Retriever!

3: Coding our Retrieval pipeline!

In the file, paste the code below and save it.

3.1: Import the necessary libraries and modules

import os
from langchain import PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.deeplake import DeepLake
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.chat_models.openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.memory import ConversationBufferWindowMemory

from .utils import save

import config as cfg

3.2: Initialize the Retriever Class

# Define the Retriever class
class Retriever:
    def __init__(self):
        self.text_retriever = None
        self.text_deeplake_schema = None
        self.embeddings = None
        self.memory = ConversationBufferWindowMemory(k=2, return_messages=True)csv

3.3: Let’s write the code for creating and adding the document embeddings to Deep Lake

def create_and_add_embeddings(self, file):
    # Create a directory named "data" if it doesn't exist
    os.makedirs("data", exist_ok=True)

    # Initialize embeddings using OpenAIEmbeddings
    self.embeddings = OpenAIEmbeddings(

    # Load documents from the provided file using PyMuPDFLoader
    loader = PyMuPDFLoader(file)
    documents = loader.load()

    # Split text into chunks using CharacterTextSplitter
    text_splitter = CharacterTextSplitter(
    docs = text_splitter.split_documents(documents)

    # Create a DeepLake schema for text documents
    self.text_deeplake_schema = DeepLake(

    # Add the split documents to the DeepLake schema

    # Create a text retriever from the DeepLake schema with search type "similarity"
    self.text_retriever = self.text_deeplake_schema.as_retriever(

    # Configure search parameters for the text retriever
    self.text_retriever.search_kwargs["distance_metric"] = "cos"
    self.text_retriever.search_kwargs["fetch_k"] = 15
    self.text_retriever.search_kwargs["maximal_marginal_relevance"] = True
    self.text_retriever.search_kwargs["k"] = 3

3.4: Now, let’s code the function that will retrieve text!

def retrieve_text(self, query):
    # Create a DeepLake schema for text documents in read-only mode
    self.text_deeplake_schema = DeepLake(

    # Define a prompt template for giving instruction to the model
    prompt_template = """You are an advanced AI capable of analyzing text from 
    documents and providing detailed answers to user queries. Your goal is to 
    offer comprehensive responses to eliminate the need for users to revisit 
    the document. If you lack the answer, please acknowledge it rather than 
    making up information.
    Question: {question} 

    # Create a PromptTemplate with the "context" and "question"
    PROMPT = PromptTemplate(
        template=prompt_template, input_variables=["context", "question"]

    # Define chain type
    chain_type_kwargs = {"prompt": PROMPT}

    # Initialize the ChatOpenAI model
    model = ChatOpenAI(

    # Create a RetrievalQA instance of the model
    qa = RetrievalQA.from_chain_type(

    # Query the model with the user's question
    response = qa({"query": query})

    # Return response from llm
    return response["result"]

4: Utility function to query our pipeline and extract the result

Paste the below code in your file :

def save(query, qa):
    # Use the get_openai_callback function 
    with get_openai_callback() as cb:
        # Query the qa object with the user's question
        response = qa({"query": query}, return_only_outputs=True)
        # Return the answer from the llm's response
        return response["result"]

5: A config file for storing your keys….nothing fancy!

Paste the below code in your file :

import os
TEXT_VECTORSTORE_PATH = "data\deeplake_text_vectorstore"

Finally, we can code our Gradio app for the demo!!

6: The Gradio app!

Paste the following code in your file :

# Import necessary libraries
import os
from controller import Controller
import gradio as gr

# Disable tokenizers parallelism for better performance
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Initialize the Controller class
controller = Controller()

# Define a function to process the uploaded PDF file
def process_pdf(file):
    if file is not None:
    return (

# Define a function to respond to user messages
def respond(message, history):
    botmessage = controller.retrieve(message)
    history.append((message, botmessage))
    return "", history

# Define a function to clear the conversation history
def clear_everything():
    return (None, None, None)

# Create a Gradio interface
with gr.Blocks(css=CSS, title="") as demo:
    # Display headings and descriptions
    gr.Markdown("# AskPDF ", elem_id="app-title")
    gr.Markdown("## Upload a PDF and Ask Questions!", elem_id="select-a-file")
        "Drop an interesting PDF and ask questions about it!",
    # Create the upload section
    with gr.Row():
        with gr.Column(scale=3):
            upload = gr.File(label="Upload PDF", type="file")
            with gr.Row():
                clear_button = gr.Button("Clear", variant="secondary")

    # Create the chatbot interface
    with gr.Column(scale=6):
        chatbot = gr.Chatbot()
        with gr.Row().style(equal_height=True):
            with gr.Column(scale=8):
                question = gr.Textbox(
                    placeholder="e.g. What is the document about?",
            with gr.Column(scale=1, min_width=60):
                submit_button = gr.Button(
                    "Ask me 🤖", variant="primary", elem_id="submit-button"

    # Define buttons
    question.submit(respond, [question, chatbot], [question, chatbot]), [question, chatbot], [question, chatbot])
        outputs=[upload, question, chatbot],

# Launch the Gradio interface
if __name__ == "__main__":
    demo.launch(enable_queue=False, share=False)

Grab your🧋, cause now it is time to see how our pipeline works!

To launch the Gradio app, open a new terminal instance and enter the following command:


Note: Ensure the virtual environment is activated, and you are in the current project directory.

Gradio will start a new instance of your application in the localhost server as follows:

Gradio | Langchain and Deep Lake

All you need to do is CTRL + click on the localhost URL (last line), and your app will open in your browser.


Our Gradio App is here!

Gradio App | Langchain and Deep Lake

Let’s drop an interesting PDF! I will use Harry Potter’s Chapter 1 pdf from this Kaggle repository containing Harry Potter books in .pdf format for chapters 1 to 7.

Lumos! May the light be with you🪄


Now, as soon as you upload, the text box to ask a query will be activated as follows:


Let’s get to the most awaited part now — Quizzing!


Wow! 😲

I love how accurate the answers are!

Also, look at how Langchain’s memory maintains the chain state, incorporating context from past runs.


It remembers that she here is our beloved Professor McGonagall! ❤️‍🔥

A Short Demo of How the App Works!

RAG’s practical and responsible approach can be extremely useful to data scientists across various research areas to build accurate and responsible AI products.

1. In healthcare diagnosis, Implement RAG to assist doctors and scientists in diagnosing complex medical conditions by integrating patient records, medical literature, research papers, and journals into the knowledge base, which will help retrieve up-to-date information when making critical decisions and research in healthcare.

2. In customer support, companies can readily use RAG-powered conversational AI chatbots to help resolve customer inquiries, complaints, and information about products and manuals, FAQs from a private product, and purchase order information database by providing accurate responses, improving the customer experience!

3. In fintech, analysts can incorporate real-time financial data, market news, and historical stock prices into their knowledge base, and an RAG framework will quickly respond efficiently to queries about market trends, company financials, investment, and revenues, aiding strong and responsible decision-making.

4. In the ed-tech market, E-learning platforms can have RAG-made chatbots deployed to help students resolve their queries by providing suggestions, comprehensive answers, and solutions based on a vast repository of textbooks, research articles, and educational resources. This enables students to deepen their understanding of subjects without requiring extensive manual research.

The scope is unlimited!


In this article, we explored the mechanics of RAG with Langchain and Deep Lake, where semantic similarity plays a pivotal role in pinpointing relevant information. With vector databases, query vector generation, and vector-based retrieval, these models access external data precisely when needed.

The result? More precise, contextually appropriate responses enriched with proprietary data. Hope you liked it and learned something on your way! Feel free to download the complete code from my GitHub repo, to try it out.

Key Takeaways

  1. Introduction to RAG: Retrieval Augmented Generation (RAG) is a promising technique in Large Language Models (LLMs) that enhances their knowledge by adding extra information from their own data sources, making them smarter and reducing errors when they lack information.
  2. Challenges in Document QnA: Large Language Models have made significant progress in Document Question and Answering (QnA) but can sometimes struggle to discern when they lack information, leading to errors.
  3. RAG Pipeline: The RAG pipeline employs semantic similarity to identify relevant query information. It involves a Vector Database, Query Vector Generation, Vector-Based Retrieval, and Response Generation, ultimately providing more precise and contextually appropriate responses.
  4. Benefits of RAG: RAG allows models to provide evidence for the information they retrieve, reducing the need for frequent retraining in rapidly changing information scenarios.
  5. Practical Implementation: The article provides a practical guide to implementing the RAG pipeline, including setting up the project structure, creating a retrieval and embedding class, coding the retrieval pipeline, and building a Gradio app for real-time interactions.

Frequently Asked Questions

Q1: What is Retrieval Augmented Generation (RAG)?

A1: Retrieval Augmented Generation (RAG) is a cutting-edge technique used in Large Language Models (LLMs) that enhances their knowledge and reduces errors in document question-answering. It involves retrieving relevant information from data sources to provide context for generating accurate responses.

Q2: Why is RAG important for LLMs?

A2: RAG is important for LLMs because it helps them improve their performance by adding extra information from their data sources. This additional context makes LLMs smarter and reduces their mistakes when they lack sufficient information.

Q3: How does the RAG pipeline work?

A3: The RAG pipeline involves several steps:
Vector Database: Store documents in a specialized Vector Database, and each document is indexed based on a semantic vector generated by an embedding model.
Query Vector Generation: When you submit a query, the same embedding model generates a semantic vector representing the query.
Vector-Based Retrieval: The model uses vector search to identify documents in the database with vectors closely aligned with the query’s vector, pinpointing the most relevant documents.
Response Generation: After retrieving pertinent documents, the model combines them with the query to generate a response, accessing external data as needed. This process enhances the model’s internal knowledge.

Q4: What are the benefits of using the RAG approach?

A4: The RAG approach offers several benefits, including:
More Precise Responses: RAG enables LLMs to deliver more precise and contextually appropriate responses by incorporating proprietary data from vector-search-enabled databases.
Reduced Errors: By providing evidence for retrieved information, RAG reduces errors and eliminates the need for frequent retraining in rapidly changing information scenarios.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article