How to Build a PDF Chatbot Without Langchain?

Sunil Kumar Dash 12 Sep, 2023

13 min read

Introduction

Since the release of Chatgpt, the pace of progress in the AI space shows no signs of slowing down, new tools and technologies are being developed every day. Sure, It’s a great thing for businesses and the AI space in general, but as a programmer, do you need to learn all of them to build something? Well, the answer is No. A rather pragmatic approach to this would be to learn about things that you need. There are a lot of tools and technologies that promise to make things easier, and to some extent they do. But also at times, we do not need them at all. Using large frameworks for simple use cases only ends up making your code a bloated mess. So, in this article, we are going to explore by building a CLI PDF chatbot without langchain and understand why we do not always need AI frameworks.

Learning Objectives

Why you do not need AI frameworks like Langchain, and Llama Index
Understand when you need frameworks
Learn about Vector Databases and Indexing
Build a CLI Q&A chatbot from scratch in Python

This article was published as a part of the Data Science Blogathon.

Introduction
Can you do without Langchain?
When do you Need Langchain?
Building a QA Chatbot
What are Vector Databases and indexes?
Build Project Environment
Utility Functions for Chatbot CLI
Chatbot CLI
Python Argparse
Building the CLI
Real-world Use Cases
Conclusion
Frequently Asked Question

Can you do without Langchain?

Over the recent months, frameworks such as Langchain and LLama Index have experienced a remarkable surge in popularity, primarily due to their exceptional capacity to facilitate convenient development of LLM apps by developers. But for a lot of usecases these frameworks might become overkill. It’s like bringing a bazooka to a gun fight.

They ship with things that may not be required in your project. Python is already infamous for being bloated. On top of that, adding dependencies that you hardly need will only make your environment messier. One such use case is document querying. If your project does not involve an AI agent or other such complicated stuff, you can ditch Langchain and make the workflow from scratch, thus reducing unnecessary bloat. Besides this, Langchain or Llama Index-like frameworks are under rapid development; any code refactoring might break your build.

When do you Need Langchain?

If you have an higher order need such as building an Agent to automate complicated software, or projects that require longer engineering hours to build from scratch, it makes sense to use prebuilt solutions. Never reinvent the wheel, unless you need a better wheel. There are other such countless examples where using readymade solutions with minor tweaks makes absolute sense.

Building a QA Chatbot

One of the most sought-after use cases of LLMs has been Document question and answering. And after OpenAI made their ChatGPT endpoints public, it has become much easier to build an interactive conversational bot with any text data sources. In this article, we will build an LLM Q&A CLI app from scratch. So, how do we approach the problem? Before building it let’s understand what we need to do.

A typical workflow will involve

Processing the provided PDF file to extract texts.
We also need to be careful about the context window of the LLM. So, we need to make chunks of those texts.
To query relevant chunks of text, we need to get embeddings of those text chunks. For this, we need an embedding model. For this project, we will use the Huggingface MiniLM-L6-V2 model, you can go with any model you wish such as OpenAI, Cohere, or Google Palm.
For storing and retrieving embeddings, we will use a Vector database such as Chroma. There are many different Vector Databases you can opt for such as Qdrant, Weaviate, Milvus, and many more.
When a user sends a query, it will get converted to embeddings by the same model, and the chunks with similar meaning to the query will be fetched.
The fetched chunks will be concatenated with the query at the end and will be fed to the LLM via an API.
The fetched answer from the model will be returned to the user.

All these things will require a user-facing interface. For this article, we will build a simple Command Line Interface with Python Argparse.

Here is a workflow diagram of our CLI chatbot:

CLI Chatbot | PDF Chatbot without Langchain

Before going into the coding part, let’s understand a thing or two about vector Databases and Indexes.

What are Vector Databases and indexes?

As the name suggests, vector databases store vectors or embeddings. So, why do we need Vector Databases? Building any AI application requires embeddings of real-world data as the Machine learning models cannot directly process these raw data such as texts, images, or audio. When you are dealing with a large amount of this data that will be used repeatedly, it will need to be stored somewhere. So, why can’t we use a traditional database for this? Well, you can use traditional databases for your search needs, but vector databases offer a significant advantage: they can perform vector similarity search in addition to lexical search.

In our case, whenever a user sends a query, the vector DB will perform a vector similarity search over all the embeddings and fetch the K nearest neighbors. The search mechanism is superfast as it employs an algorithm called HNSW.

HNSW stands for Hierarchical Navigable Small World. It is a graph-based algorithm and indexing method for Approximate Nearest Neighbor search (ANN). ANN is a type of search that finds the k most similar items to a given item.

HNSW works by building a graph of the data points. The nodes in the graph represent the data points, and the edges in the graph represent the similarity between the data points. The graph is then traversed to find the k most similar items to the given item.

The HNSW algorithm is fast, reliable, and scalable. Most of the Vector Databases use HNSW as the default search algorithm.

Now, we are all set to delve into codes.

Build Project Environment

As with any Python project, start with creating a virtual environment. This keeps the development environment nice and tidy. Refer to this article for choosing the right Python environment for your project.

The project file structure is simple, we will have two Python files one for defining the CLI and the other for processing, storing, and querying data. Also, create a .env file to store your OpenAI API key.

This is the requirements.txt file install it before getting started.

#requiremnets.txt
openai
chromadb
PyPDF2
dotenv

Now, import the necessary classes and functions.

import os
import openai
import PyPDF2
import re
from chromadb import Client, Settings
from chromadb.utils import embedding_functions
from PyPDF2 import PdfReader
from typing import List, Dict
from dotenv import load_dotenv

Load the OpenAI API key from the .env file.

load_dotenv()
key = os.environ.get('OPENAI_API_KEY')
openai.api_key = key

Utility Functions for Chatbot CLI

To store text embeddings and their metadata, we will create a collection with ChromaDB.

ef = embedding_functions.ONNXMiniLM_L6_V2()
client = Client(settings = Settings(persist_directory="./", is_persistent=True))
collection_ = client.get_or_create_collection(name="test", embedding_function=ef)

As an embedding model, we are using MiniLM-L6-V2 with ONNX runtime. It is small yet capable and on top of that open-sourced.

Next, we will define a function to verify if a provided file path belongs to a valid PDF file.

def verify_pdf_path(file_path):
    try:
        # Attempt to open the PDF file in binary read mode
        with open(file_path, "rb") as pdf_file:
            # Create a PDF reader object using PyPDF2
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            
            # Check if the PDF has at least one page
            if len(pdf_reader.pages) > 0:
                # If it has pages, the PDF is not empty, so do nothing (pass)
                pass
            else:
                # If it has no pages, raise an exception indicating that the PDF is empty
                raise ValueError("PDF file is empty")
    except PyPDF2.errors.PdfReadError:
        # Handle the case where the PDF cannot be read (e.g., it's corrupted or not a valid PDF)
        raise PyPDF2.errors.PdfReadError("Invalid PDF file")
    except FileNotFoundError:
        # Handle the case where the specified file doesn't exist
        raise FileNotFoundError("File not found, check file address again")
    except Exception as e:
        # Handle other unexpected exceptions and display the error message
        raise Exception(f"Error: {e}")

One of the major parts of a PDF Q&A app is to get text chunks. So, we need to define a function that gets us the required chunks of text.

def get_text_chunks(text: str, word_limit: int) -> List[str]:
    """
    Divide a text into chunks with a specified word limit 
    while ensuring each chunk contains complete sentences.
    
    Parameters:
        text (str): The entire text to be divided into chunks.
        word_limit (int): The desired word limit for each chunk.
    
    Returns:
        List[str]: A list containing the chunks of text with 
        the specified word limit and complete sentences.
    """
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    chunks = []
    current_chunk = []

    for sentence in sentences:
        words = sentence.split()
        if len(" ".join(current_chunk + words)) <= word_limit:
            current_chunk.extend(words)
        else:
            chunks.append(" ".join(current_chunk))
            current_chunk = words

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

We have defined a basic algorithm for getting chunks. The idea is to let users create as many words as they want in a single text chunk. And every text chunk will end with a complete sentence, even if it breaches the limit. This is a simple algorithm. You may create something on your own.

Create a Dictionary

Now, we need a function to load texts from PDFs and create a dictionary to keep track of text chunks belonging to a single page.

def load_pdf(file: str, word: int) -> Dict[int, List[str]]:
    # Create a PdfReader object from the specified PDF file
    reader = PdfReader(file)
    
    # Initialize an empty dictionary to store the extracted text chunks
    documents = {}
    
    # Iterate through each page in the PDF
    for page_no in range(len(reader.pages)):
        # Get the current page
        page = reader.pages[page_no]
        
        # Extract text from the current page
        texts = page.extract_text()
        
        # Use the get_text_chunks function to split the extracted text into chunks of 'word' length
        text_chunks = get_text_chunks(texts, word)
        
        # Store the text chunks in the documents dictionary with the page number as the key
        documents[page_no] = text_chunks
    
    # Return the dictionary containing page numbers as keys and text chunks as values
    return documents

ChromaDB Collection

Now, we need to store the data in a ChromaDB collection.

def add_text_to_collection(file: str, word: int = 200) -> None:
    # Load the PDF file and extract text chunks
    docs = load_pdf(file, word)
    
    # Initialize empty lists to store data
    docs_strings = []  # List to store text chunks
    ids = []  # List to store unique IDs
    metadatas = []  # List to store metadata for each text chunk
    id = 0  # Initialize ID
    
    # Iterate through each page and text chunk in the loaded PDF
    for page_no in docs.keys():
        for doc in docs[page_no]:
            # Append the text chunk to the docs_strings list
            docs_strings.append(doc)
            
            # Append metadata for the text chunk, including the page number
            metadatas.append({'page_no': page_no})
            
            # Append a unique ID for the text chunk
            ids.append(id)
            
            # Increment the ID
            id += 1

    # Add the collected data to a collection
    collection_.add(
        ids=[str(id) for id in ids],  # Convert IDs to strings
        documents=docs_strings,  # Text chunks
        metadatas=metadatas,  # Metadata
    )
    
    # Return a success message
    return "PDF embeddings successfully added to collection"

In Chromadb, the metadata field stores additional information regarding the documents. In this case, the page number of a text chunk is its metadata. After extracting metadata from each text chunk, we can store them in the collection we created earlier. This is required only when the user provides a valid file path to a PDF file.

We will now define a function that processes user queries to fetch data from the database.

def query_collection(texts: str, n: int) -> List[str]:
    result = collection_.query(
                  query_texts = texts,
                  n_results = n,
                 )
    documents = result["documents"][0]
    metadatas = result["metadatas"][0]
    resulting_strings = []
    for page_no, text_chunk in zip(metadatas, documents):
        resulting_strings.append(f"Page {page_no['page_no']}: {text_chunk}")
    return resulting_strings

The above function uses a query method to retrieve “n” relevant data from the database. We then create a formatted string that starts with the page number of the text chunk.

Now, the only major thing remaining is to feed the LLM with information.

def get_response(queried_texts: List[str],) -> List[Dict]:
    global messages
    messages = [
                {"role": "system", "content": "You are a helpful assistant.\
                 And will always answer the question asked in 'ques:' and \
                 will quote the page number while answering any questions,\
                 It is always at the start of the prompt in the format 'page n'."},
                {"role": "user", "content": ''.join(queried_texts)}
          ]

    response = openai.ChatCompletion.create(
                            model = "gpt-3.5-turbo",
                            messages = messages,
                            temperature=0.2,               
                     )
    response_msg = response.choices[0].message.content
    messages = messages + [{"role":'assistant', 'content': response_msg}]
    return response_msg

The global variable messages store the context of the conversation. We have defined a system message to print the page number from where the LLM gets the answer.

Lastly, the ultimate utility function combines obtained text chunks with the user query, feeds it into the get_response() function, and returns the resulting answer string.

def get_answer(query: str, n: int):
    queried_texts = query_collection(texts = query, n = n)
    queried_string = [''.join(text) for text in queried_texts]
    queried_string = queried_string[0] + f"ques: {query}"
    answer = get_response(queried_texts = queried_string,)
    return answer

We are done with our utility functions. Let’s move on to building CLI.

Chatbot CLI

To use the chatbot on-demand, we need an interface. This could be a web app, a mobile app, or a CLI. In this article, we will build a CLI for our chatbot. If you want to build a nice-looking demo web app, you can use tools like Gradio or Streamlit. Check out this article on building a chatbot for PDF.

Build a ChatGPT for PDFs with Langchain

To build a CLI, we will need the Argparse library. Argparse is a potent library that lets you create CLIs in Python. It has a simple and easy syntax to create commands, sub-commands, and flags. So, before delving into it, here is a small primer on Argparse.

Python Argparse

The Argparse module was first released in Python 3.2, providing a quick and convenient way to build CLI applications with Python without relying on third-party installations. It allows us to parse command line arguments, create sub-commands in CLIs, and many more features, making it a reliable tool for building CLIs.

Here’s a small example of Argparse in action,

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("-f", "--filename", help="The name of the file to read.")
parser.add_argument("-n", "--number", help="The number of lines to print.", type=int)
parser.add_argument("-s", "--sort", help="Sort the lines in the file.", action="store_true")

args = parser.parse_args()

with open(args.filename) as f:
    lines = f.readlines()

if args.sort:
    lines.sort()

for line in lines:
    print(line)

The add_argument method lets us define sub-commands with checks and balances. We can define the type of argument or the action it needs to undertake when a flag is provided and a help parameter that explains the use case of a particular sub-command. The help subcommand will display all the flags and their use cases.

On a similar note, we will define sub-commands for the chatbot CLI.

Building the CLI

Import Argparse and necessary utility functions.

import argparse
from utils import (
    add_text_to_collection, 
    get_answer, 
    verify_pdf_path, 
    clear_coll
  )

Define Argument parser and add arguments.

def main():
    # Create a command-line argument parser with a description
    parser = argparse.ArgumentParser(description="PDF Processing CLI Tool")
    
    # Define command-line arguments
    parser.add_argument("-f", "--file", help="Path to the input PDF file")
    
    parser.add_argument(
        "-c", "--count",
        default=200, 
        type=int, 
        help="Optional integer value for the number of words in a single chunk"
    )
    
    parser.add_argument(
        "-q", "--question", 
        type=str,
        help="Ask a question"
    )
    
    parser.add_argument(
        "-cl", "--clear", 
        type=bool, 
        help="Clear existing collection data"
    )
    
    parser.add_argument(
        "-n", "--number", 
        type=int, 
        default=1, 
        help="Number of results to be fetched from the collection"
    )

    # Parse the command-line arguments
    args = parser.parse_args()

We have defined a few sub-commands, such as –file, –value, –question, etc.

–file: The string file path of a PDF.
–value: An optional parameter value that defines the number of words in a text chunk.
–question: Takes a user query as a parameter.
— number: Number of similar chunks to be fetched.
–clear: Clears the current Chromadb collection.

Now, we process the arguments;

 if args.file is not None:
        verify_pdf_path(args.file)
        confirmation = add_text_to_collection(file = args.file, word = args.value)
        print(confirmation)

 if args.question is not None:
        if args.number:
            n = args.number
        answer = get_answer(args.question, n = n)
        print("Answer:", answer)

 if args.clear:
        clear_coll()
        return "Current collection cleared successfully"

Putting everything together.

import argparse
from app import (
    add_text_to_collection, 
    get_answer, 
    verify_pdf_path, 
    clear_coll
)

def main():
    # Create a command-line argument parser with a description
    parser = argparse.ArgumentParser(description="PDF Processing CLI Tool")
    
    # Define command-line arguments
    parser.add_argument("-f", "--file", help="Path to the input PDF file")
    
    parser.add_argument(
        "-c", "--count",
        default=200, 
        type=int, 
        help="Optional integer value for the number of words in a single chunk"
    )
    
    parser.add_argument(
        "-q", "--question", 
        type=str,
        help="Ask a question"
    )
    
    parser.add_argument(
        "-cl", "--clear", 
        type=bool, 
        help="Clear existing collection data"
    )
    
    parser.add_argument(
        "-n", "--number", 
        type=int, 
        default=1, 
        help="Number of results to be fetched from the collection"
    )

    # Parse the command-line arguments
    args = parser.parse_args()
    
    # Check if the '--file' argument is provided
    if args.file is not None:
        # Verify the PDF file path and add its text to the collection
        verify_pdf_path(args.file)
        confirmation = add_text_to_collection(file=args.file, word=args.count)
        print(confirmation)

    # Check if the '--question' argument is provided
    if args.question is not None:
        n = args.number if args.number else 1  # Set 'n' to the specified number or default to 1
        answer = get_answer(args.question, n=n)
        print("Answer:", answer)

    # Check if the '--clear' argument is provided
    if args.clear:
        clear_coll()
        print("Current collection cleared successfully")

if __name__ == "__main__":
    main()

Now open your terminal and run the below script.

 python cli.py -f "path/to/file.pdf" -v 1000 -n 1  -q "query"

To delete the collection, type

python cli.py -cl True

If the provided file path does not belong to a PDF, it will raise a FileNotFoundError.

File not found error | PDF Chatbot without Langchain

The GitHub Repository: https://github.com/sunilkumardash9/pdf-cli-chatbot

Real-world Use Cases

A chatbot running as a CLI tool can be used in many real-world applications, such as

Academic Research: Researchers often deal with numerous research papers and articles in PDF format. A CLI chatbot could help them extract relevant information, create bibliographies, and organize their references efficiently.

Language Translation: Language professionals can use the chatbot to extract text from PDFs, translate it, and then generate translated documents, all from the command line.

Educational Institutions: Teachers and educators can extract content from educational resources to create customized learning materials or to prepare course content. Students can extract useful information from large PDFs from the chatbot CLI.

Open Source Project Management: CLI chatbots can help open-source software projects manage documentation, extract code snippets, and generate release notes from PDF manuals.

Conclusion

So, this was all about building a PDF Q&A chatbot with a Command Line Interface built without using frameworks such as the Langchain and Llama Index. Here is a quick summary of things we covered.

Langchain and other AI frameworks can be a great way to get started with AI development. However, it’s important to remember that they are not a silver bullet. They can make your code more complex and can cause bloat, so use them only when you need them.
The use of frameworks makes sense when the complexity of projects requires longer engineering hours if done from scratch.
A document Q&A workflow can be designed from scratch without a framework like Langchain from the first principle.

Frequently Asked Question

Q1. What is a chatbot pdf?

A. A chatbot PDF is an interactive bot specially designed to retrieve information from PDFs.

Q2. What is Langchain used for?

A. LangChain is an open-source framework that simplifies the creation of applications using large language models. It can be used for a variety of tasks, including chatbots, document analysis, code analysis, question answering, and generative tasks.

Q3. Is chatbot an AI tool?

A. Yes, chatbots are AI tools. They use artificial intelligence (AI) and natural language processing (NLP) to simulate human conversation. Chatbots can be used to provide customer service, answer questions, and even generate creative content.

Q4. What are Chatbots for PDFs used for?

A. Chatbots for PDF are tools that allow you to interact with PDF files using natural language. You can ask questions about the PDF, and Chatbot for PDF will try to answer them. You can also ask a PDF Chatbot to summarize the PDF or to extract specific information from it.

Q5. Can I chat with a PDF?

A. Yes, with the advent of capable Large Language Models and vector stores, it is possible to chat with PDFs.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

AI blogathon chatbot databases Generative AI pdf python query vector