Build Powerful Chat Assistant for PDFs and Articles Without OpenAI Key

Samy Ghebache 27 Sep, 2023

5 min read

Introduction

The world of Natural Language Processing is expanding tremendously, especially with the birth of large language models, which have revolutionized this field and made it accessible to everyone. In this article, we will explore and implement some NLP techniques to create a powerful chat assistant that can respond to your questions based on a given article (or PDF) using open-source libraries, all without requiring an OpenAI API key.

This article was published as a part of the Data Science Blogathon.

Workflow

The workflow of the application is as below:

Where the user provides a PDF file or a URL to an article, asks a question, and the application will attempt to answer it based on the provided source.

We will extract the content using the PYPDF2 library (in the case of a PDF file) or BeautifulSoup (in the case of an article URL). Then, we will split it into chunks using the CharacterTextSplitter from the langchain library.

For each chunk, we calculate its corresponding word embedding vector using all-MiniLM-L6-v2 model, which maps sentences & paragraphs to a 384 dimensional dense vector space (word embedding is just a technique to represent word/sentence as a vector), and the same technique is applied to the user question.

The vectors are given as input to the semantic search function provided by sentence_transformers which is a Python framework for state-of-the-art sentence, text and image embeddings.

This function will return the text chunk that may contain the answer , and the Question Answering model will generate the final answer based on the output of the semantic_search + user question.

Note

All the mentioned models are accessible via API, using only HTTP requests.
The code will be written using python.
FAQ-QN is a keyword that indicates you should take a look at the FAQ section, specifically at question number N, for more details.

Implementation

In this section, I will focus only on the implementation, while the details will be provided in the FAQ section.

Dependencies

We start by downloading the dependencies and then importing them.

pip install -r requirements.txt

numpy
torch
sentence-transformers
requests
langchain
beautifulsoup4
PyPDF2

import torch
import numpy as np
from sentence_transformers import util
from langchain.text_splitter import CharacterTextSplitter
from bs4 import BeautifulSoup
import requests

torch : very useful when dealing with tensors (Pytorch library).
requests : to send HTTP requests.

Content Extraction

In case of a PDF

try:
    pdf=PdfReader(path_pdf_file)
    result=''
    for i in range(len(pdf.pages)):
        result+=pdf.pages[i].extract_text()
except:
    print('PDF file doesn\'t exist'))
    exit(0)

In case of an article, we attempt to find the content between the html tags like h1, p, li, h2, etc (These tags work fine for website like : Medium and may differ in others)

try:
        request=requests.get(URL_LINK)
        request=BeautifulSoup(request.text,'html.parser')
        request=request.find_all(['h1','p','li','h2'])
except:
        print('Bad URL link')
        exit(0)

result=[element.text for element in request]

result=''.join(result)

Split into Chunks

Each chunk will contain 1000 tokens, with 200 tokens overlapped to keep the chunks related and prevent separation.(FAQ-Q2)

text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
      )
      
 chunks = text_splitter.split_text(result)

Word Embedding

You can download the model all-MiniLM-L6-v2 from huggingface, or you can just access it through HTTP requests since it’s available as an API. (FAQ-Q1)

Note: To access the huggingface APIs, you have to sign up (it’s free) to obtain your token.

hf_token='Put here you huggingface access token'

api_url= """
https://api-inference.huggingface.co/pipeline/feature-extraction/
sentence-transformers/all-MiniLM-L6-v2"""

headers = {"Authorization": f"Bearer {hf_token}"}
    
def query(texts):
  response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})
  return response.json()

user_question = 'Put your question here'

question = query([user_question])
            
query_embeddings = torch.FloatTensor(question)
    
output=query(chunks)
    
output=torch.from_numpy(np.array(output)).to(torch.float)

The query function returns the 384 dimensional dense vector, and the transformation to ‘torch.Float’ & FloatTensor is necessarily for the semantic_search function.

Semantic Search

Final will contain 2 text chunks that may include the answer ( i set top_k=2, to increase the probability of getting the right answer from the QA model).(FAQ-Q4)

result=util.semantic_search(query_embeddings, output,top_k=2)

final=[chunks[result[0][i]['corpus_id']] for i in range(len(result[0]))]

Question Answer Model

Since you have the context (text chunks) and the question, you can use any model you want (you can take a quick look into the huggingface QA models to get an idea). I chose AI21studio Question Answer model, you can sign up for free to get an access token.

AI21_api_key = 'AI21studio api key'
url = "https://api.ai21.com/studio/v1/answer"
    
payload = {
                "context":' '.join(final),
                "question":user_question
          }
  
headers = {
                "accept": "application/json",
                "content-type": "application/json",
                "Authorization": f"Bearer {AI21_api_key}"
          }
    
response = requests.post(url, json=payload, headers=headers)
    
if response.json()['answerInContext']:
     print(response.json()['answer'])
else:
     print('The answer is not found  in the document ⚠️, 
     please reformulate your question.')

The model enables you to verify if the answer is in context or not (in case of using Large language models, you may face the problem where the LLM answers a question that is not related to the provided context).(FAQ-Q3)

Conclusion

You can extend this project to various source inputs (PowerPoint files, YouTube videos/audios, slides, audiobooks) at a relatively low cost, so feel free to adapt it to your use cases. Additionally, you can create a simple UI for this application and host it.

Streamlit as I did (the github repo can be found here, don’t forget to hit the star button.

In this article, we built a powerful chat assistant for your PDF files/articles.

We used web scraping techniques to extract the text from the source.
Text was split into multiple chunks.
We calculated the word embedding vector for each chunk, and for the user question
We applied the semantic search function to detect the most relevant text chunk
The final answer was provided by the Question Answer model

Thank you for your time and attention. For further assistance:

LinkedIn : SAMY GHEBACHE

Email : [email protected].

Frequently Asked Questions

Q1. What about the word embedding model, MiniLM-L6-v2, and its training process?

A. This model was the result of fine-tuning the nreimers/MiniLM-L6-H384-uncased model on a dataset of 1 billion sentence pairs. Train the base model using a self-supervised technique, where we provide the model with a phrase containing a missing word and attempt to predict it. Consider the word embedding vectors are as the weights of this model, and we have 384 hidden layers representing the dimensions in our case.

Q2. Why do we split the text into chunks?

A. We could pass the entire extracted text to the Question Answer model directly without performing the semantic search operation, but it would be very costly in case you are using the OpenAI API (or any paid API). You have a cost for each token, so it will be quite expensive. In case you are using Question Answer models, they are limited in terms of the number of input tokens. Therefore, you can’t handle a PDF paper with many pages or even a long article. In addition, the performance of the models is not the same when dealing with chunk of 1000 tokens and a whole text with 10,000 tokens or more.

Q3. Can you explain the work of semantic search?

A. The idea behind this function is to project your sentences and paragraphs into an N-dimensional vector space. In our case, transform each chunk to V_chunk, and change the user question to V_question by the word embedding model, where Dimension(V_chunk) = Dimension(V_question) = N (N = 384 ). Then, we apply: Similarity(V_chunk, V_question) for each chunk, and obtain the vector with the highest similarity value.The SentenceTransformers framework uses the cosine-similarity

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.