In the ever-evolving landscape of machine learning and artificial intelligence, the development of language model applications, particularly Retrieval Augmented Generation (RAG) systems, is becoming increasingly sophisticated. However, the real challenge surfaces not during the initial creation but in the ongoing maintenance and enhancement of these applications. This is where RAGAS—an evaluation library dedicated to providing metrics for RAG pipelines—comes into play. This article will explore the RAGAS library and teach you how to use it to evaluate RAG pipelines.
This article was published as a part of the Data Science Blogathon.
The inception of RAGAS is rooted in the vision of perpetuating the continuous improvement of Language Large Models (LLMs) and RAG applications through the adoption of Metrics-Driven Development (MDD). MDD is not merely a buzzword but a strategic approach in product development that leverages quantifiable data to guide decision-making processes.
By consistently tracking key metrics over time, developers and researchers can gain profound insights into the performance of their applications, thereby steering their projects toward excellence. RAGAS aims to enshrine this data-centric methodology as the open-source standard for LLM and RAG applications, ensuring that evaluation and monitoring become integral parts of the development lifecycle.
Evaluation metrics are an important part of RAG because they enable the systematic assessment of LLM applications. They foster an environment where experiments can be conducted with a high degree of reliability and reproducibility. In doing so, they provide a framework for objectively measuring the efficacy of various components within an RAG pipeline.
Furthermore, the aspect of monitoring offers a treasure trove of actionable insights gleaned from production data, empowering developers to refine and elevate the quality of their LLM applications continuously. Thus, RAGAS stands as a beacon for those committed to excellence in the development and sustenance of RAG systems, championing the cause of MDD to navigate the complex waters of AI application enhancement with precision and insight.
In this section, we will demonstrate how the RAGAS evaluation library works by implementing it on an existing RAG pipeline. We will not be building an RAG pipeline from scratch, so it is a prerequisite to have an existing RAG pipeline ready to generate responses for queries. We will be using the COQA-QUAC Dataset from Kaggle. This dataset contains various question, context, and their responses, which will be used as data for the RAG pipeline. We will manually generate responses for a few queries and use reference/ground truth responses to generate RAGAS scores.
RAGAS offers the following evaluation scores:
Additionally, RAGAS offers two end-to-end evaluation metrics for evaluating the end-to-end performance of an RAG pipeline.
In this article, we will only focus on evaluating the RAG pipeline using Faithfulness, Answer Relevancy, Context Relevancy, and Context Recall metrics. The only requirement here is that the input for evaluation must be a dictionary containing the query, response, and source documents. Now that we have discussed the objectives and requirements, let’s jump straight into using RAGAS.
First, let’s install all the necessary packages for RAGAS to work. Below is the list of all the necessary packages with their specific versions for installation:
langchain==0.1.13
openai
ragas==0.0.22
NOTE: Avoid using the latest version of RAGAS, as it has no implementation of Langchain in it. Now that we have our environment set up, let’s start using RAGAS for evaluating generated responses.
First, we will generate a response using the RAG pipeline. The output from the RAG pipeline must be a dictionary having ‘query’, ‘result’, and ‘source_documents’ keys. We can simply achieve this by setting the return_source_documents parameter to True in the RetrievalQA chain from Langchain. The below image shows the parameters that I have used for the same:
This is the format that Ragas Evaluator accepts. Below is an example of how the response variable should look like:
{'query': 'Where are Malayalis found in India?',
'result': "Malayalis are found in various ...",
'source_documents': [
Document(
page_content=': 0\nquestion: Where is Malayali located?',
metadata={'source': 'data/dummy-rag.csv', 'row': 0}
),
...
]
}
Notice the source documents are a list of documents containing the source references. This dictionary itself will be passed to the RAGAS evaluator to calculate each score. We will generate responses for 2-3 queries and get them as a Python dictionary in the above-mentioned format. We will then store them in the responses list, which will be used later.
Next, we will create evaluation chains using RAGAS Evaluator. We will use the faithfulness, answer relevancy, context relevancy, and context recall chains. First, we need to import a few necessary packages from RAGAS.
from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
)
We use RagasEvaluatorChain to create metrics for evaluation. It takes in a metric and initializes the metric which we then use to generate evaluation scores.
Next, we will create 4 different metrics using RagasEvaluatorChain.
eval_chains = {
m.name: RagasEvaluatorChain(metric=m)
for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}
This code creates a dictionary with 4 different evaluator chains: faithfulness, answer relevancy, context relevancy, and context recall.
Now we will loop over the generated response dictionaries and evaluate them. Assuming the responses are present in a list called ‘responses’, we will loop over it and take each response dictionary containing the following key-value pairs: query, response, source_documents.
for response in responses:
for name, eval_chain in eval_chains.items():
score_name = f"{name}_score"
print(f"{score_name}: {eval_chain(response)[score_name]}")
The above code snippet loops over each dictionary and generates the scores. The inner loop iterates over each evaluation metric to generate their scores. Below is an example output for the above code:
faithfulness_score: 1.0
answer_relevancy_score: 0.7461039226035786
context_relevancy_score: 0.0
context_recall_score: 1.0
Above is the score for a single query response. However, we can automate it to generate scores for more query responses. Below is the overall code for all steps:
from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
)
eval_chains = {
m.name: RagasEvaluatorChain(metric=m)
for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}
for response in responses:
for name, eval_chain in eval_chains.items():
score_name = f"{name}_score"
print(f"{score_name}: {eval_chain(response)[score_name]}")
RAGAS emerges as a pivotal tool in language model applications, particularly within the scope of RAG systems. By integrating MDD into the core of RAG pipelines, RAGAS provides a structured methodology to evaluate and enhance the performance of such systems. The comprehensive set of evaluation metrics includes Faithfulness, Answer Relevancy, Context Recall, and Context Relevancy. These facilitate a thorough analysis of the responses generated by the RAG pipeline, ensuring their alignment with the context and ground truth.
The practical demonstration of RAGAS on a pre-existing RAG pipeline utilizing the COQA-QUAC Dataset illustrates the library’s capacity to offer quantifiable insights and actionable feedback for developers. The process involves initializing the environment, generating responses, and employing RAGAS evaluator chains to compute the various scores. This hands-on example underscores the accessibility and utility of RAGAS in the continuous refinement of LLMs, thereby bolstering their reliability and efficiency. RAGAS stands as an open-source standard and an essential tool for developers and researchers to ensure delivering responsible AI and ML applications.
The media shown in this Blogathon article are not owned by Analytics Vidhya and are used at the Author’s discretion.