Retrieval-Augmented Generation, or RAG, marks an important step forward for natural language processing. It helps large language models (LLMs) perform better by letting them check information outside their training data before creating a response. This means LLMs can work well with specific company knowledge or new information without costly retraining. Rerankers for RAG play a crucial role in refining retrieved information, ensuring the most relevant context is provided. RAG blends information retrieval with text generation, resulting in accurate, relevant answers that sound natural.
The first step in RAG involves finding documents related to a user’s query. Systems often use methods like keyword search or vector similarity. These methods are good starting points, but they can return many documents that aren’t all equally useful. The embedding models used might not grasp the fine details needed to pick the most relevant information.
Vector search, which looks for similar meanings, can struggle with short queries or specialized terms. Also, LLMs have limits on how much context they can handle well. Feeding them too many documents, even slightly relevant ones, can confuse the model and lower the quality of the final answer. This initial “noisy” retrieval can dilute the LLM’s focus. We need a way to refine this first batch of information.
This image depicts the retrieval and generation steps of RAG, a question is asked by the user and then our system extracts the results based on the question by searching the Vector store. Then the retrieved content is passed to the LLM along with the question and LLM provides a structured output.
This is where rerankers become essential. Reranking improves the precision of search results. Rerankers use smart algorithms to look at the initially retrieved documents and reorder them based on how well they match the user’s specific question and intent.
In RAG, rerankers act as a quality filter. They examine the first set of results and prioritize the documents that offer the best information for the query. The goal is to lift the most relevant pieces to the top. Think of a reranker as a specialist that double-checks the initial search, using a deeper understanding of language to find the best fit between the documents and the question.
This image illustrates a two-stage search process. Reranking is the second stage, where an initial set of search results, based on semantic or keyword matching, is refined to significantly improve the relevance and ordering of the final results, delivering a more accurate and useful outcome for the user’s query.
Rerankers boost the accuracy of the context given to the LLM. They analyze the meaning and relationship between the user’s question and each retrieved document, going beyond simple keyword matching. This deeper understanding helps identify the most useful information.
By focusing the LLM on a smaller, better set of documents, rerankers lead to more precise answers. The LLM gets high-quality context, allowing it to form more informed and direct responses. Rerankers calculate a score showing how semantically close a document is to a query, allowing for a better final ordering. They can find relevant information even without exact keyword matches.
This focus on quality context helps reduce LLM “hallucinations”—instances where the model generates incorrect but plausible information. Grounding the LLM in documents verified by a reranker makes the final output more trustworthy.
The standard RAG process involves retrieval then generation. An enhanced RAG pipeline adds a reranking step in the middle.
This two-stage method lets the initial retrieval cast a wide net (recall), while the reranker focuses on picking the best items from that net (precision). This division improves the overall process and gives the LLM the best possible input.
A query is used to search a vector database, retrieving the top 25 most relevant documents. These documents are then passed to a “Reranker” module. The reranker refines the results, selecting the top 3 most relevant documents for the final output.
Let us look into the top reranking models in 2025.
Source: Click Here
Several reranking models are popular choices for RAG pipelines:
Reranker | Model Type | Source | Strength | Weakness | Best For |
Cohere | Cross-encoder( API) | Private | High Accuracy, Multilingual, Ease of Use, Speed (Nimble) | Cost (API fees), Closed-source | General RAG, Enterprise, Multilingual, Ease of Use |
bge-reranker | Cross-encoder | Open-Source | High Accuracy, Open-source, Runs on moderate hardware | Requires self-hosting | General RAG, Open-source preference, Budget-conscious |
Voyage | Cross-encoder( API) | Private | Top-tier Relevance/Accuracy | Cost (API fees), Potentially higher latency (top model) | Max Accuracy Needs (Finance, Legal), Relevance-critical apps |
Jina | Cross-encoder / ColBERT variant | Mixed | Balanced Performance, Cost-effective, Long Docs (Jina-ColBERT) | May not reach peak accuracy | General RAG, Long documents, Balanced cost/performance |
FlashRank | Lightweight Cross-encoder | Open-Source | Very Fast, Low Resource Use, Easy Integration | Accuracy lower than large models | Speed-critical apps, Resource-constrained environments |
ColBERT | Multi-vector (Late Interaction) | Open-Source | Efficient at Scale (Large Collections), Fast Retrieval | Indexing compute/storage intensive | Very large document sets, Efficiency at scale |
MixedBread (mxbai-rerank-v2) | Cross-encoder | Open-Source | SOTA Perf (claimed), Fast Inference, Multilingual, Long Context, Versatile | Requires self-hosting, Relatively new | High-Performance RAG, Multilingual, Long Docs/Code/JSON, Open-Source Pref |
Cohere Rerank uses a sophisticated neural network, likely based on the transformer architecture, acting as a cross-encoder. It processes the query and document together to precisely judge relevance. It is a proprietary model accessed via an API.
First install the Cohere library.
%pip install --upgrade --quiet cohere
Set up the Cohere and ContextualCompressionRetriever.
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere
from langchain.chains import RetrievalQA
llm = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
chain = RetrievalQA.from_chain_type(
llm=Cohere(temperature=0), retriever=compression_retriever
)
Output:
{'query': 'What did the president say about Ketanji Brown Jackson', 'result': " The president speaks highly of Ketanji Brown Jackson, stating that she is one of the nation's top legal minds, and will continue the legacy of excellence of Justice Breyer. The president also mentions that he worked with her family and that she comes from a family of public school educators and police officers. Since her nomination, she has received support from various groups, including the Fraternal Order of Police and judges from both major political parties. \n\nWould you like me to extract another sentence from the provided text? "}
These models come from the Beijing Academy of Artificial Intelligence (BAAI) and are open-source (Apache 2.0 license). They are transformer-based, likely cross-encoders, designed specifically for reranking tasks. They are available in different sizes, like Base and Large.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")
compressor = CrossEncoderReranker(model=model, top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke("What is the plan for the economy?")
pretty_print_docs(compressed_docs)
Output:
Document 1:
More infrastructure and innovation in America.
More goods moving faster and cheaper in America.
More jobs where you can earn a good living in America.
And instead of relying on foreign supply chains, let’s make it in America.
Economists call it “increasing the productive capacity of our economy.”
I call it building a better America.
My plan to fight inflation will lower your costs and lower the deficit.
----------------------------------------------------------------------------------------------------
Document 2:
Second – cut energy costs for families an average of $500 a year by combatting
climate change.
Let’s provide investments and tax credits to weatherize your homes and businesses to
be energy efficient and you get a tax credit; double America’s clean energy
production in solar, wind, and so much more; lower the price of electric vehicles,
saving you another $80 a month because you’ll never have to pay at the gas pump
again.
----------------------------------------------------------------------------------------------------
Document 3:
Look at cars.
Last year, there weren’t enough semiconductors to make all the cars that people
wanted to buy.
And guess what, prices of automobiles went up.
So—we have a choice.
One way to fight inflation is to drive down wages and make Americans poorer.
I have a better plan to fight inflation.
Lower your costs, not your wages.
Make more cars and semiconductors in America.
More infrastructure and innovation in America.
More goods moving faster and cheaper in America.
Voyage AI provides proprietary neural network models (voyage-rerank-2, voyage-rerank-2-lite) accessed via API. These are likely advanced cross-encoders finely tuned for maximum relevance scoring.
First install the voyage library
%pip install --upgrade --quiet voyageai
%pip install --upgrade --quiet langchain-voyageai
Set up the Cohere and ContextualCompressionRetriever
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain.retrievers import ContextualCompressionRetriever
from langchain_openai import OpenAI
from langchain_voyageai import VoyageAIRerank
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_voyageai import VoyageAIEmbeddings
documents = TextLoader("../../how_to/state_of_the_union.txt").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
retriever = FAISS.from_documents(
texts, VoyageAIEmbeddings(model="voyage-law-2")
).as_retriever(search_kwargs={"k": 20})
llm = OpenAI(temperature=0)
compressor = VoyageAIRerank(
model="rerank-lite-1", voyageai_api_key=os.environ["VOYAGE_API_KEY"], top_k=3
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)
Output:
Document 1:
One of the most serious constitutional responsibilities a President has is
nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji
Brown Jackson. One of our nation’s top legal minds, who will continue Justice
Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:
So let’s not abandon our streets. Or choose between safety and equal justice.
Let’s come together to protect our communities, restore trust, and hold law
enforcement accountable.
That’s why the Justice Department required body cameras, banned chokeholds, and
restricted no-knock warrants for its officers.
----------------------------------------------------------------------------------------------------
Document 3:
I spoke with their families and told them that we are forever in debt for their
sacrifice, and we will carry on their mission to restore the trust and safety every
community deserves.
I’ve worked on these issues a long time.
I know what works: Investing in crime prevention and community police officers
who’ll walk the beat, who’ll know the neighborhood, and who can restore trust and
safety.
So let’s not abandon our streets. Or choose between safety and equal justice.
This offers reranking solutions including neural models like Jina Reranker v2 and Jina-ColBERT. Jina Reranker v2 is likely a cross-encoder style model. Jina-ColBERT implements the ColBERT architecture (explained next) using Jina’s base models.
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import JinaEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
documents = TextLoader(
"../../how_to/state_of_the_union.txt",
).load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
embedding = JinaEmbeddings(model_name="jina-embeddings-v2-base-en")
retriever = FAISS.from_documents(texts, embedding).as_retriever(search_kwargs={"k": 20})
query = "What did the president say about Ketanji Brown Jackson"
docs = retriever.get_relevant_documents(query)
Doing Reranking with JIna
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import JinaRerank
compressor = JinaRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.get_relevant_documents(
"What did the president say about Ketanji Jackson Brown"
)
pretty_print_docs(compressed_docs)
Output:
Document 1:
So let’s not abandon our streets. Or choose between safety and equal justice.
Let’s come together to protect our communities, restore trust, and hold law
enforcement accountable.
That’s why the Justice Department required body cameras, banned chokeholds, and
restricted no-knock warrants for its officers.
----------------------------------------------------------------------------------------------------
Document 2:
I spoke with their families and told them that we are forever in debt for their
sacrifice, and we will carry on their mission to restore the trust and safety every
community deserves.
I’ve worked on these issues a long time.
I know what works: Investing in crime prevention and community police officers
who’ll walk the beat, who’ll know the neighborhood, and who can restore trust and
safety.
So let’s not abandon our streets. Or choose between safety and equal justice.
ColBERT (Contextualized Late Interaction over BERT) is a multi-vector model. Instead of representing a document with one vector, it creates multiple contextualized vectors (often one per token). It uses a “late interaction” mechanism where query vectors are compared against the many document vectors after encoding. This allows document vectors to be pre-calculated and indexed.
Install the Ragtouille library for using ColBERT reranker.
pip install -U ragatouille
Now setting the up the ColBERT reranker
from ragatouille import RAGPretrainedModel
from langchain.retrievers import ContextualCompressionRetriever
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What animation studio did Miyazaki found"
)
print(compressed_docs[0])
Output:
Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded
the animation production company Studio Ghibli, with funding from Tokuma Shoten.
Studio Ghibli\'s first film, Laputa: Castle in the Sky (1986), employed the same
production crew of Nausicaä. Miyazaki\'s designs for the film\'s setting were
inspired by Greek architecture and "European urbanistic templates". Some of the
architecture in the film was also inspired by a Welsh mining town; Miyazaki
witnessed the mining strike upon his first', metadata={'relevance_score':
26.5194149017334})
FlashRank is designed as a very lightweight and fast reranking library, typically leveraging smaller, optimized transformer models (often distilled or pruned versions of larger models). It aims to provide significant relevance improvements over simple similarity search with minimal computational overhead. It functions like a cross-encoder but uses techniques to accelerate the process. It’s usually available as an open-source Python library.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)
This code snippet utilizes FlashrankRerank within a ContextualCompressionRetriever to improve the relevance of retrieved documents. It specifically reranks documents obtained by a base retriever (represented by a retriever) based on their relevance to the query “What did the president say about Ketanji Jackson Brown”. Finally, it prints the document IDs and the compressed, reranked documents.
Output:
[0, 5, 3]
Document 1:
One of the most serious constitutional responsibilities a President has is
nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji
Brown Jackson. One of our nation’s top legal minds, who will continue Justice
Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage,
their determination, inspires the world.
Groups of citizens blocking tanks with their bodies. Everyone from students to
retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European
Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United
States is here tonight.
----------------------------------------------------------------------------------------------------
Document 3:
And tonight, I’m announcing that the Justice Department will name a chief prosecutor
for pandemic fraud.
By the end of this year, the deficit will be down to less than half what it was
before I took office.
The only president ever to cut the deficit by more than one trillion dollars in a
single year.
Lowering your costs also means demanding more competition.
I’m a capitalist, but capitalism without competition isn’t capitalism
It’s exploitation—and it drives up prices.
The output shoes it reranks the retrieved chunks based on the relevancy.
Provided by Mixedbread AI, this family includes mxbai-rerank-base-v2 (0.5B parameters) and mxbai-rerank-large-v2 (1.5B parameters). They are open-source (Apache 2.0 license) cross-encoders based on the Qwen-2.5 architecture. A key differentiator is their training process, which incorporates a three-stage reinforcement learning (RL) approach (GRPO, Contrastive Learning, Preference Learning) on top of initial training.
!pip install mxbai_rerank
from mxbai_rerank import MxbaiRerankV2
# Load the model, here we use our base sized model
model = MxbaiRerankV2("mixedbread-ai/mxbai-rerank-base-v2")
# Example query and documents
query = "Who wrote To Kill a Mockingbird?"
documents = ["To Kill a Mockingbird is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
"The novel Moby-Dick was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
"Harper Lee, an American novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
"Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
"The Harry Potter series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
"The Great Gatsby, a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."
]
# Calculate the scores
results = model.rank(query, documents)
print(results)
Output:
[RankResult(index=0, score=9.847987174987793, document='To Kill a Mockingbird is a
novel by Harper Lee published in 1960. It was immediately successful, winning the
Pulitzer Prize, and has become a classic of modern American literature.'),
RankResult(index=2, score=8.258672714233398, document='Harper Lee, an American
novelist widely known for her novel To Kill a Mockingbird, was born in 1926 in
Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.'),
RankResult(index=3, score=3.579845428466797, document='Jane Austen was an English
novelist known primarily for her six major novels, which interpret, critique and
comment upon the British landed gentry at the end of the 18th century.'),
RankResult(index=4, score=2.716982841491699, document='The Harry Potter series,
which consists of seven fantasy novels written by British author J.K. Rowling, is
among the most popular and critically acclaimed books of the modern era.'),
RankResult(index=1, score=2.233165740966797, document='The novel Moby-Dick was
written by Herman Melville and first published in 1851. It is considered a
masterpiece of American literature and deals with complex themes of obsession,
revenge, and the conflict between good and evil.'),
RankResult(index=5, score=1.8150043487548828, document='The Great Gatsby, a novel
written by American author F. Scott Fitzgerald, was published in 1925. The story is
set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit
of Daisy Buchanan.')]
Evaluating rerankers is important. Common metrics help measure their effectiveness:
Selecting the best reranker involves balancing several factors:
There are trade-offs:
To choose wisely:
The best reranker fits your specific performance, efficiency, and cost requirements.
Rerankers for RAG are vital for getting the most out of RAG systems. They refine the information given to LLMs, leading to better, more trustworthy answers. With various models available, from highly precise cross-encoders to efficient bi-encoders and specialized options like ColBERT, developers have choices. Selecting the right one requires understanding the trade-offs between accuracy, speed, scalability, and cost. As RAG evolves, especially towards handling diverse data types, rerankers for RAG will continue to play a crucial role in building smarter, more reliable AI applications. Careful evaluation and selection remain key to success.
A. RAG is a technique that improves large language models (LLMs) by allowing them to retrieve external information before generating responses. This makes them more accurate, adaptable, and able to incorporate new knowledge without retraining.
A. Initial retrieval methods like keyword search or vector similarity can return many documents, but not all are highly relevant. This can lead to noisy inputs that reduce LLM performance. Refining these results is necessary to improve answer quality.
A. Rerankers reorder retrieved documents based on their relevance to the query. They act as a quality filter, ensuring the most relevant information is prioritized before being passed to the LLM for answer generation.
A. Cohere Rerank provides high accuracy, multilingual support, and API-based integration. Its “Nimble” variant is optimized for faster responses, making it ideal for real-time applications.
A. bge-reranker is open-source and can be self-hosted, reducing costs while maintaining high accuracy. It is suitable for teams that prefer full control over their models.