In today’s data-driven world, whether you’re a student looking to extract insights from research papers or a data analyst seeking answers from datasets, we are inundated with information stored in various file formats. From research papers in PDF to reports in DOCX and plain text documents (TXT), to structured data in CSV files, there’s an ever-growing need to access and extract information from these diverse sources efficiently. That’s where the Multi-File Chatbot comes in – it’s a versatile tool designed to help you access information stored in PDFs, DOCX files, TXT documents, and CSV datasets and process multiple files simultaneously.
Prepare for an exciting journey as we plunge into the intricacies of the code and functionalities that bring the Multi-File Chatbot to life. Get ready to unlock the full potential of your data with the power of Generative AI at your fingertips!
Before we dive into the details, let’s outline the key learning objectives of this article:
This article was published as a part of the Data Science Blogathon.
In today’s digital age, the volume of information stored in various file formats has grown exponentially. The ability to efficiently access and extract valuable insights from these diverse sources has become increasingly vital. This need has given rise to a Multi-File Chatbot, a specialized tool designed to address these information retrieval challenges. File Chatbots, powered by advanced Generative AI, are the future of information retrieval.
A File Chatbot is an innovative software application powered by Artificial Intelligence (AI) and Natural Language Processing (NLP) technologies. It is tailored to analyze and extract information from a wide range of file formats, including but not limited to PDFs, DOCX documents, plain text files (TXT), and structured data in CSV files. Unlike traditional chatbots that primarily interact with users through text conversations, a File Chatbot focuses on understanding and responding to questions based on the content stored within these files.
The utility of a Multi-File Chatbot extends across various domains and industries. Here are some key use cases that highlight its significance:
– Research Paper Analysis: Students and researchers can use a File Chatbot to extract critical information and insights from extensive research papers stored in PDF format. It can provide summaries, answer specific questions, and aid in literature review processes.
–Textbook Assistance: Educational institutions can deploy File Chatbots to assist students by answering questions related to textbook content, thereby enhancing the learning experience.
The workflow of a Multi-File Chatbot involves several key steps, from user interaction to file processing and answering questions. Here’s a comprehensive overview of the workflow
virtual environments is a good practice to isolate project-specific dependencies and avoid conflicts with system-wide packages. Here’s how to set up a Python environment:
Create a Virtual Environment:
python -m venv env_name
Activate the Virtual Environment:
.\env_name\Scripts\activate
source env_name/bin/activate
Install Project Dependencies:
Note: Choose either Hugging Face or OpenAI for your language-related tasks.
import streamlit as st
from docx import Document
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template
from langchain.llms import HuggingFaceHub
import os
from dotenv import load_dotenv
import tempfile
from transformers import pipeline
import pandas as pd
import io
Processing the below files:
PDF Files
# Extract text from a PDF file
def get_pdf_text(pdf_file):
text = ""
pdf_reader = PdfReader(pdf_file)
for page in pdf_reader.pages:
text += page.extract_text()
return text
Docx Files
# Extract text from a DOCX file
def get_word_text(docx_file):
document = Document(docx_file)
text = "\n".join([paragraph.text for paragraph in document.paragraphs])
return text
Txt Files
# Extract text from a TXT file
def read_text_file(txt_file):
text = txt_file.getvalue().decode('utf-8')
return text
CSV Files
In addition to PDFs and DOCX files, our chatbot can work with CSV files. We use the Hugging Face Transformers library to answer questions based on tabular data. Here’s how we handle CSV files and user questions:
def handle_csv_file(csv_file, user_question):
# Read the CSV file
csv_text = csv_file.read().decode("utf-8")
# Create a DataFrame from the CSV text
df = pd.read_csv(io.StringIO(csv_text))
df = df.astype(str)
# Initialize a Hugging Face table-question-answering pipeline
qa_pipeline = pipeline("table-question-answering", model="google/tapas-large-finetuned-wtq")
# Use the pipeline to answer the question
response = qa_pipeline(table=df, query=user_question)
# Display the answer
st.write(response['answer'])
The extracted text from different files is combined and split into manageable chunks. These chunks are then used to create an intelligent knowledge base for the chatbot. We use state-of-the-art Natural Language Processing (NLP) techniques to understand the content better.
# Combine text from different files
def combine_text(text_list):
return "\n".join(text_list)
# Split text into chunks
def get_text_chunks(text):
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
return chunks
Creating vector store
Our project seamlessly integrates Hugging Face models and LangChain for optimal performance.
def get_vectorstore(text_chunks):
#embeddings = OpenAIEmbeddings()
embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
return vectorstore
To enable our chatbot to provide meaningful responses, we need a conversational AI model. In this project, we use a model from Hugging Face’s model hub. Here’s how we set up the conversational AI model:
def get_conversation_chain(vectorstore):
# llm = ChatOpenAI()
llm = HuggingFaceHub(
repo_id="google/flan-t5-xxl",
model_kwargs={"temperature": 0.5,
"max_length": 512})
memory = ConversationBufferMemory(
memory_key='chat_history',
return_messages=True)
conversation_chain = Conversational
RetrievalChain.from_llm(
llm=llm,
retriever=vectorstore.as_retriever(),
memory=memory
)
return conversation_chain
Users can ask questions related to the documents they’ve uploaded. The chatbot uses its knowledge base and NLP models to provide relevant answers in real-time. Here’s how we handle user i
def handle_userinput(user_question):
if st.session_state.conversation is not None:
response = st.session_state.conversation({'question': user_question})
st.session_state.chat_history = response['chat_history']
for i, message in enumerate(st.session_state.chat_history):
if i % 2 == 0:
st.write(user_template.replace(
"{{MSG}}", message.content), unsafe_allow_html=True)
else:
st.write(bot_template.replace(
"{{MSG}}", message.content), unsafe_allow_html=True)
else:
# Handle the case when conversation is not initialized
st.write("Please upload and process your documents first.")
We’ve deployed the chatbot using Streamlit, a fantastic Python library for creating web applications with minimal effort. Users can upload their documents and ask questions. The chatbot will generate responses based on the content of the documents. Here’s how we set up the Streamlit app:
def main():
load_dotenv()
st.set_page_config(
page_title="File Chatbot",
page_icon=":books:",
layout="wide"
)
st.write(css, unsafe_allow_html=True)
if "conversation" not in st.session_state:
st.session_state.conversation = None
if "chat_history" not in st.session_state:
st.session_state.chat_history = None
st.header("Chat with your multiple files:")
user_question = st.text_input("Ask a question about your documents:")
# Initialize variables to hold uploaded files
csv_file = None
other_files = []
with st.sidebar:
st.subheader("Your documents")
files = st.file_uploader(
"Upload your files here and click on 'Process'", accept_multiple_files=True)
for file in files:
if file.name.lower().endswith('.csv'):
csv_file = file # Store the CSV file
else:
other_files.append(file) # Store other file types
# Initialize empty lists for each file type
pdf_texts = []
word_texts = []
txt_texts = []
if st.button("Process"):
with st.spinner("Processing"):
for file in other_files:
if file.name.lower().endswith('.pdf'):
pdf_texts.append(get_pdf_text(file))
elif file.name.lower().endswith('.docx'):
word_texts.append(get_word_text(file))
elif file.name.lower().endswith('.txt'):
txt_texts.append(read_text_file(file))
# Combine text from different file types
combined_text = combine_text(pdf_texts + word_texts + txt_texts)
# Split the combined text into chunks
text_chunks = get_text_chunks(combined_text)
# Create vector store and conversation chain if non-CSV documents are uploaded
if len(other_files) > 0:
vectorstore = get_vectorstore(text_chunks)
st.session_state.conversation = get_conversation_chain(vectorstore)
else:
vectorstore = None # No need for vectorstore with CSV file
# Handle user input for CSV file separately
if csv_file is not None and user_question:
handle_csv_file(csv_file, user_question)
# Handle user input for text-based files
if user_question:
handle_userinput(user_question)
if __name__ == '__main__':
main()
As we embark on our Multi-File Chatbot project, it’s crucial to consider scalability and potential avenues for future enhancementsThe future holds exciting possibilities with advancements in Generative AI and NLP technologies. Here are key aspects to keep in mind as you plan for the growth and evolution of your chatbot:
In this blog post, we’ve explored the development of a Multi-File Chatbot using Streamlit and Natural language processing(NLP) techniques. This project showcases how to extract text from various types of documents, process user questions, and provide relevant answers using a conversational AI model. With this chatbot, users can effortlessly interact with their documents and gain valuable insights. You can further enhance this project by integrating more document types and improving the conversational AI model. Building such applications empowers users to make better use of their data and simplifies information retrieval from diverse sources. Start building your own Multi-File Chatbot and unlock the potential of your documents today!
A. The accuracy of the chatbot’s responses may vary based on factors such as the quality of the training data and the complexity of the user’s queries. Continuous improvement and fine-tuning of the chatbot’s models can enhance accuracy over time.
A. The blog mentions the use of pre-trained models from Hugging Face’s model hub and OpenAI for certain NLP tasks. Depending on your project’s requirements, you can explore existing pre-trained models or train custom models.
A. Many Multi-File Chatbots are designed to maintain context during conversations. They can remember and understand the context of ongoing interactions, allowing for more natural and coherent responses to follow-up questions or queries related to previous discussions.
A. While Multi-File Chatbots are versatile, their ability to handle specific file formats may depend on the availability of libraries and tools for text extraction and processing. In this blog, we are working on PDF, TXT, DOCS and CSV files. We can also add other file formats and consider expanding support based on user needs.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,