A Beginner’s Guide to Evaluating RAG Pipelines Using RAGAS

Subhadeep Mandal 07 May, 2024

7 min read

Introduction

In the ever-evolving landscape of machine learning and artificial intelligence, the development of language model applications, particularly Retrieval Augmented Generation (RAG) systems, is becoming increasingly sophisticated. However, the real challenge surfaces not during the initial creation but in the ongoing maintenance and enhancement of these applications. This is where RAGAS—an evaluation library dedicated to providing metrics for RAG pipelines—comes into play. This article will explore the RAGAS library and teach you how to use it to evaluate RAG pipelines.

Learning Objectives

Understand the inception and evolution of the RAGAS evaluation library.
Gain knowledge of RAG evaluation scores.
Learn to evaluate RAG systems using the RAGAS evaluation library.

This article was published as a part of the Data Science Blogathon.

What is RAGAS?

The inception of RAGAS is rooted in the vision of perpetuating the continuous improvement of Language Large Models (LLMs) and RAG applications through the adoption of Metrics-Driven Development (MDD). MDD is not merely a buzzword but a strategic approach in product development that leverages quantifiable data to guide decision-making processes.

By consistently tracking key metrics over time, developers and researchers can gain profound insights into the performance of their applications, thereby steering their projects toward excellence. RAGAS aims to enshrine this data-centric methodology as the open-source standard for LLM and RAG applications, ensuring that evaluation and monitoring become integral parts of the development lifecycle.

Evaluation metrics are an important part of RAG because they enable the systematic assessment of LLM applications. They foster an environment where experiments can be conducted with a high degree of reliability and reproducibility. In doing so, they provide a framework for objectively measuring the efficacy of various components within an RAG pipeline.

Furthermore, the aspect of monitoring offers a treasure trove of actionable insights gleaned from production data, empowering developers to refine and elevate the quality of their LLM applications continuously. Thus, RAGAS stands as a beacon for those committed to excellence in the development and sustenance of RAG systems, championing the cause of MDD to navigate the complex waters of AI application enhancement with precision and insight.

Implementing RAGAS and Generating Evaluation Scores

In this section, we will demonstrate how the RAGAS evaluation library works by implementing it on an existing RAG pipeline. We will not be building an RAG pipeline from scratch, so it is a prerequisite to have an existing RAG pipeline ready to generate responses for queries. We will be using the COQA-QUAC Dataset from Kaggle. This dataset contains various question, context, and their responses, which will be used as data for the RAG pipeline. We will manually generate responses for a few queries and use reference/ground truth responses to generate RAGAS scores.

RAGAS Evaluation Scores

RAGAS offers the following evaluation scores:

Faithfulness: This measures the factual consistency of the generated answer against the given context. It is calculated from the answer and retrieved context. The answer is scaled to the (0,1) range. Higher the better.
Answer Relevancy: The evaluation metric, Answer Relevancy, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context, and the answer.
Context Recall: Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.
Context Precision: Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground_truth, and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.
Context Relevancy: This metric gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.
Context Entity Recall: This metric gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone.

End-to-End Evaluation Metrics

Additionally, RAGAS offers two end-to-end evaluation metrics for evaluating the end-to-end performance of an RAG pipeline.

Answer Semantic Similarity: The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1.
Answer Correctness: The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1.

In this article, we will only focus on evaluating the RAG pipeline using Faithfulness, Answer Relevancy, Context Relevancy, and Context Recall metrics. The only requirement here is that the input for evaluation must be a dictionary containing the query, response, and source documents. Now that we have discussed the objectives and requirements, let’s jump straight into using RAGAS.

Hands-on RAG Evaluation Using RAGAS

First, let’s install all the necessary packages for RAGAS to work. Below is the list of all the necessary packages with their specific versions for installation:

langchain==0.1.13
openai
ragas==0.0.22

NOTE: Avoid using the latest version of RAGAS, as it has no implementation of Langchain in it. Now that we have our environment set up, let’s start using RAGAS for evaluating generated responses.

Step 1: Generate RAG Pipeline Output

First, we will generate a response using the RAG pipeline. The output from the RAG pipeline must be a dictionary having ‘query’, ‘result’, and ‘source_documents’ keys. We can simply achieve this by setting the return_source_documents parameter to True in the RetrievalQA chain from Langchain. The below image shows the parameters that I have used for the same:

This is the format that Ragas Evaluator accepts. Below is an example of how the response variable should look like:

 {'query': 'Where are Malayalis found in India?',
 'result': "Malayalis are found in various ...",
 'source_documents': [
     Document(
       page_content=': 0\nquestion: Where is Malayali located?', 
       metadata={'source': 'data/dummy-rag.csv', 'row': 0}
     ), 
     ...
  ]
 }

Notice the source documents are a list of documents containing the source references. This dictionary itself will be passed to the RAGAS evaluator to calculate each score. We will generate responses for 2-3 queries and get them as a Python dictionary in the above-mentioned format. We will then store them in the responses list, which will be used later.

Step 2: Create Evaluation Chains

Next, we will create evaluation chains using RAGAS Evaluator. We will use the faithfulness, answer relevancy, context relevancy, and context recall chains. First, we need to import a few necessary packages from RAGAS.

from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

We use RagasEvaluatorChain to create metrics for evaluation. It takes in a metric and initializes the metric which we then use to generate evaluation scores.

Step 3: Create Evaluation Metrics

Next, we will create 4 different metrics using RagasEvaluatorChain.

eval_chains = {
    m.name: RagasEvaluatorChain(metric=m)
    for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}

This code creates a dictionary with 4 different evaluator chains: faithfulness, answer relevancy, context relevancy, and context recall.

Step 4: Evaluate the RAG Pipeline

Now we will loop over the generated response dictionaries and evaluate them. Assuming the responses are present in a list called ‘responses’, we will loop over it and take each response dictionary containing the following key-value pairs: query, response, source_documents.

for response in responses:
  for name, eval_chain in eval_chains.items():
    score_name = f"{name}_score"
    print(f"{score_name}: {eval_chain(response)[score_name]}")

The above code snippet loops over each dictionary and generates the scores. The inner loop iterates over each evaluation metric to generate their scores. Below is an example output for the above code:

faithfulness_score: 1.0
answer_relevancy_score: 0.7461039226035786
context_relevancy_score: 0.0
context_recall_score: 1.0

Above is the score for a single query response. However, we can automate it to generate scores for more query responses. Below is the overall code for all steps:

from ragas.langchain.evalchain import RagasEvaluatorChain
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

eval_chains = {
    m.name: RagasEvaluatorChain(metric=m)
    for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}

for response in responses:
  for name, eval_chain in eval_chains.items():
    score_name = f"{name}_score"
    print(f"{score_name}: {eval_chain(response)[score_name]}")

Conclusion

RAGAS emerges as a pivotal tool in language model applications, particularly within the scope of RAG systems. By integrating MDD into the core of RAG pipelines, RAGAS provides a structured methodology to evaluate and enhance the performance of such systems. The comprehensive set of evaluation metrics includes Faithfulness, Answer Relevancy, Context Recall, and Context Relevancy. These facilitate a thorough analysis of the responses generated by the RAG pipeline, ensuring their alignment with the context and ground truth.

The practical demonstration of RAGAS on a pre-existing RAG pipeline utilizing the COQA-QUAC Dataset illustrates the library’s capacity to offer quantifiable insights and actionable feedback for developers. The process involves initializing the environment, generating responses, and employing RAGAS evaluator chains to compute the various scores. This hands-on example underscores the accessibility and utility of RAGAS in the continuous refinement of LLMs, thereby bolstering their reliability and efficiency. RAGAS stands as an open-source standard and an essential tool for developers and researchers to ensure delivering responsible AI and ML applications.

Key Takeaways

The RAGAS evaluation library anchors the principles of MDD within the workflow of LLMs and RAG system development.
The process of evaluating generated responses using RAGAS involves generating responses in the required dictionary format and creating and utilizing evaluator chains for computing scores.
By leveraging RAGAS, developers and researchers can gain objective insights into the performance of their RAG applications. This allows them to develop precise and informed enhancements.

The media shown in this Blogathon article are not owned by Analytics Vidhya and are used at the Author’s discretion.