Last updated January 14, 2026
Reranking constitutes a refinement step that reorders documents retrieved by a RAG system according to their true relevance to the user query. Unlike initial retrieval which uses fast approximations like cosine similarity on embeddings, reranking applies more sophisticated models capable of finely evaluating the match between a question and a candidate passage.
Initial retrieval sacrifices precision for speed and scale
Modern RAG systems rely on two-stage retrieval. The first stage, often called “dense retrieval”, encodes documents and queries in a common vector space then retrieves the k nearest documents by cosine similarity. This approach filters millions of documents in milliseconds thanks to ANN (Approximate Nearest Neighbors) indexes like HNSW or IVF.
The problem lies in the fundamental tradeoff of dense retrieval. Bi-encoder embeddings encode query and document independently before comparing their representations. This independence enables pre-computing document embeddings but prevents any fine interaction between query tokens and document tokens. A document may contain exactly the sought information but in a phrasing the embedding does not perfectly capture.
Consider a concrete example. If a user asks “What are the temperature constraints for Pfizer vaccine storage?”, dense retrieval will retrieve documents mentioning vaccines, storage, and temperature. But it may also surface documents about other vaccines, generic medication storage, or temperature constraints in other contexts. The initial top-10 probably contains the answer, but not necessarily in first position.
Research on retrieval benchmarks shows that bi-encoder models achieve their best performance on recall@k with relatively high k, but precision@1 often remains insufficient for critical applications. Reranking intervenes precisely to correct this gap between recall and precision.
Reranking reorders candidates through fine relevance evaluation
Reranking operates on the subset of documents retrieved by the initial stage. Instead of processing millions of documents, the reranker evaluates only dozens to hundreds of candidates. This drastic volume reduction allows applying more expensive but more accurate models.
The typical architecture of a RAG pipeline with reranking proceeds in three steps. The initial retriever fetches a large top-k, typically 50 to 100 documents. The reranker evaluates each (query, document) pair and assigns a relevance score. The system keeps the top-n documents after reranking, with n much smaller than k (often 3 to 10). These final documents feed the LLM context for generation.
from sentence_transformers import CrossEncoder
from typing import List, Tuple
class RAGRerankingPipeline:
"""
RAG pipeline with reranking step between retrieval and generation.
Reranking improves precision without modifying the vector index.
"""
def __init__(
self,
retriever,
reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
top_k_retrieval: int = 50,
top_n_final: int = 5
):
self.retriever = retriever
self.reranker = CrossEncoder(reranker_model)
self.top_k = top_k_retrieval
self.top_n = top_n_final
def retrieve_and_rerank(self, query: str) -> List[Tuple[str, float]]:
"""
Retrieves documents then reorders them by relevance.
Returns top_n documents with their reranking scores.
"""
# Step 1: Large initial retrieval
candidates = self.retriever.search(query, top_k=self.top_k)
# Step 2: Prepare pairs for reranking
pairs = [(query, doc["content"]) for doc in candidates]
# Step 3: Scoring by cross-encoder
scores = self.reranker.predict(pairs)
# Step 4: Reordering and selection
scored_candidates = list(zip(candidates, scores))
scored_candidates.sort(key=lambda x: x[1], reverse=True)
results = [
(doc["content"], score)
for doc, score in scored_candidates[:self.top_n]
]
return results
The advantage of reranking lies in its decoupling from the retrieval system. Precision can be improved without rebuilding the vector index, without changing the embedding model, without modifying existing chunks. The reranker inserts as an additional layer that filters and reorders what the retriever already found.
Cross-encoders jointly evaluate query and document
Cross-encoders represent the most direct approach to reranking. Unlike bi-encoders that separately encode query and document, the cross-encoder concatenates both texts and processes them together in a single transformer model. Each query token can “see” each document token via attention mechanisms.
This complete interaction between query and document enables much finer relevance evaluation. The model can detect subtle semantic matches, negations, conditions, nuances that independent embeddings miss. The tradeoff is computational cost: each query-document pair requires a complete transformer forward pass.
Cross-encoder models trained on MS MARCO have dominated reranking benchmarks for several years. MS MARCO (Microsoft MAchine Reading COmprehension) contains millions of query-passage pairs with relevance annotations. Models like ms-marco-MiniLM-L-6-v2 offer a good tradeoff between performance and latency. Larger models like bge-reranker-v2-m3 or Cohere Rerank models achieve higher scores at the cost of increased latency.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
class CrossEncoderReranker:
"""
Cross-encoder for RAG document reranking.
Evaluates each (query, document) pair jointly.
"""
def __init__(self, model_name: str = "BAAI/bge-reranker-v2-m3"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.model.eval()
# GPU detection if available
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def score_pairs(
self,
query: str,
documents: List[str],
batch_size: int = 16
) -> List[float]:
"""
Computes relevance scores for each document.
Processes in batches to optimize GPU utilization.
"""
scores = []
for i in range(0, len(documents), batch_size):
batch_docs = documents[i:i + batch_size]
# Tokenize pairs
inputs = self.tokenizer(
[query] * len(batch_docs),
batch_docs,
padding=True,
truncation=True,
max_length=512,
return_tensors="pt"
).to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
# Logits represent relevance score
batch_scores = outputs.logits.squeeze(-1).cpu().tolist()
if isinstance(batch_scores, float):
batch_scores = [batch_scores]
scores.extend(batch_scores)
return scores
Cross-encoder latency depends linearly on the number of documents to evaluate. With 50 documents and a MiniLM model on GPU, the reranking step typically takes 50-100ms. On CPU, this time can rise to several hundred milliseconds or even seconds. This latency remains acceptable for most use cases but becomes problematic if the initial top-k exceeds a hundred documents.
ColBERT introduces late interaction for better speed-quality tradeoff
ColBERT (Contextualized Late Interaction over BERT), presented by Khattab and Zaharia in their paper arXiv:2004.12832 (2020), proposes an intermediate architecture between bi-encoders and cross-encoders. The central idea: encode query and documents separately like bi-encoders, but compute similarity via token-to-token interaction rather than simple global cosine similarity.
In ColBERT, each query and document token receives a contextualized embedding from BERT. At scoring time, the system computes maximum similarity between each query token and all document tokens. The final score is the sum of these maximum similarities. This “late interaction” captures fine matches without the cost of a joint forward pass.
ColBERT’s advantage is the possibility of pre-computing token embeddings for each document. Only query token embeddings require computation at inference time. Token-to-token interaction remains expensive but far less than a complete cross-encoder. ColBERTv2 and subsequent work optimized storage and computation to make this approach practical at scale.
import torch
from transformers import AutoModel, AutoTokenizer
class ColBERTReranker:
"""
Reranker based on ColBERT architecture (late interaction).
Compromise between cross-encoder precision and bi-encoder speed.
"""
def __init__(self, model_name: str = "colbert-ir/colbertv2.0"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def encode_text(self, text: str, max_length: int = 256) -> torch.Tensor:
"""Encodes text into token-level embeddings."""
inputs = self.tokenizer(
text,
return_tensors="pt",
max_length=max_length,
truncation=True,
padding="max_length"
).to(self.device)
with torch.no_grad():
outputs = self.model(**inputs)
# Normalize embeddings for cosine similarity
embeddings = torch.nn.functional.normalize(
outputs.last_hidden_state,
p=2,
dim=-1
)
return embeddings.squeeze(0) # [seq_len, hidden_dim]
def score_late_interaction(
self,
query_embeddings: torch.Tensor,
doc_embeddings: torch.Tensor
) -> float:
"""
Computes ColBERT score via late interaction.
For each query token, finds max similarity with document tokens.
"""
# Similarity between all tokens [query_len, doc_len]
similarities = torch.matmul(query_embeddings, doc_embeddings.T)
# MaxSim: for each query token, take max over document tokens
max_sims = similarities.max(dim=1).values
# Final score = sum of MaxSims
score = max_sims.sum().item()
return score
def rerank_documents(
self,
query: str,
documents: List[str]
) -> List[Tuple[str, float]]:
"""Reranks a list of documents for a query."""
# Encode query once
query_emb = self.encode_text(query)
results = []
for doc in documents:
doc_emb = self.encode_text(doc)
score = self.score_late_interaction(query_emb, doc_emb)
results.append((doc, score))
results.sort(key=lambda x: x[1], reverse=True)
return results
Published benchmarks show that ColBERT achieves performance close to cross-encoders on reranking tasks while being significantly faster. The latency difference becomes particularly marked when document embeddings are pre-computed, enabling near real-time reranking even on large sets.
LLM reranking exploits the reasoning capabilities of large models
A more recent approach consists of directly using an LLM for reranking. The principle is simple: ask the language model to order documents by relevance, making its reasoning explicit. This method, often called “listwise reranking” or “LLM-as-a-judge”, can achieve high performance on complex queries where semantic reasoning matters more than lexical matching.
The RankGPT paper and similar work explored different prompting strategies for reranking. The most direct approach presents documents to the LLM and asks it to order them from most to least relevant. Variants use pairwise comparisons (“Is document A more relevant than document B for this question?”) or absolute scores (“Rate relevance from 1 to 10”).
from openai import OpenAI
import re
class LLMReranker:
"""
Reranking via LLM with explicit prompting.
Suited to cases where complex semantic reasoning is needed.
"""
def __init__(self, model: str = "gpt-4o-mini"):
self.client = OpenAI()
self.model = model
def rerank_listwise(
self,
query: str,
documents: List[str],
top_n: int = 5
) -> List[Tuple[str, int]]:
"""
Asks the LLM to order documents by relevance.
Returns top_n documents with their rank.
"""
# Format documents with identifiers
formatted_docs = []
for i, doc in enumerate(documents):
# Truncate for context
truncated = doc[:500] + "..." if len(doc) > 500 else doc
formatted_docs.append(f"[Doc {i+1}]: {truncated}")
prompt = f"""Given the following question and candidate documents,
order the documents from most relevant to least relevant for answering the question.
Question: {query}
Documents:
{chr(10).join(formatted_docs)}
Return only the document numbers in decreasing relevance order,
separated by commas. For example: 3, 1, 5, 2, 4
Ranking:"""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0
)
# Parse response to extract order
order_text = response.choices[0].message.content.strip()
numbers = re.findall(r'\d+', order_text)
ordered_indices = [int(n) - 1 for n in numbers if int(n) <= len(documents)]
# Build results
results = []
for rank, idx in enumerate(ordered_indices[:top_n]):
if 0 <= idx < len(documents):
results.append((documents[idx], rank + 1))
return results
def rerank_pointwise(
self,
query: str,
documents: List[str]
) -> List[Tuple[str, float]]:
"""
Evaluates each document individually with a relevance score.
More accurate but more expensive in tokens.
"""
results = []
for doc in documents:
prompt = f"""Evaluate the relevance of the following document for answering the question.
Question: {query}
Document: {doc[:800]}
Give a relevance score from 0 to 10, where:
- 0-2: Not relevant
- 3-4: Marginally relevant
- 5-6: Partially relevant
- 7-8: Highly relevant
- 9-10: Perfectly relevant
Reply only with the numeric score."""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
score = float(response.choices[0].message.content.strip())
except ValueError:
score = 0.0
results.append((doc, score))
results.sort(key=lambda x: x[1], reverse=True)
return results
LLM reranking presents distinct advantages and disadvantages. On the advantage side, the LLM can handle complex queries requiring multi-step reasoning, understands nuances and implicit context, and can be guided by domain-specific instructions. On the disadvantage side, token and latency costs far exceed other approaches, results can vary with prompting, and the model can introduce its own biases.
Reranker choice depends on application context and operational constraints
Each reranking approach addresses different needs. The following table summarizes the main characteristics of the three reranker families.
| Criterion | Cross-Encoder | ColBERT | LLM Reranking |
|---|---|---|---|
| Relative precision | High | High | Variable with prompting |
| Latency (50 docs) | 50-200ms GPU | 20-50ms GPU | 2-10s |
| Inference cost | Moderate | Low | High |
| Customization | Fine-tuning | Fine-tuning | Prompting |
| Complex understanding | Good | Good | Excellent |
| On-premise deployment | Easy | Easy | Possible but expensive |
For low-latency, high-volume applications, ColBERT or a lightweight cross-encoder (MiniLM) are natural choices. Real-time systems like chatbots or search engines benefit from their speed without sacrificing too much quality.
For applications where precision trumps latency, larger cross-encoders (bge-reranker-large, Cohere Rerank) offer the best performance on standard benchmarks. Internal B2B use cases, document assistants, or compliance systems fall into this category.
LLM reranking suits scenarios where queries are complex and rare. Contract analysis, legal research, or technical questions may justify the extra cost if answer quality demands it. The prohibitive cost for volume makes this approach unsuited to consumer applications.
Practical implementation combines multiple strategies based on needs
A mature production system may combine multiple reranking approaches. A common strategy uses a fast reranker (ColBERT or MiniLM) for initial filtering, then a more precise reranker on the resulting top-10 if the query warrants it.
class MultiStageRerankingPipeline:
"""
Cascading reranking pipeline to optimize precision and latency.
Uses a fast reranker then a precise reranker on best candidates.
"""
def __init__(
self,
retriever,
fast_reranker: CrossEncoder,
precise_reranker: CrossEncoder,
initial_top_k: int = 100,
intermediate_top_k: int = 20,
final_top_n: int = 5
):
self.retriever = retriever
self.fast_reranker = fast_reranker
self.precise_reranker = precise_reranker
self.initial_top_k = initial_top_k
self.intermediate_top_k = intermediate_top_k
self.final_top_n = final_top_n
def search(self, query: str) -> List[dict]:
"""Executes the complete retrieval and reranking pipeline."""
# Step 1: Large initial retrieval
candidates = self.retriever.search(query, top_k=self.initial_top_k)
# Step 2: First fast reranking
pairs = [(query, doc["content"]) for doc in candidates]
fast_scores = self.fast_reranker.predict(pairs)
filtered_candidates = sorted(
zip(candidates, fast_scores),
key=lambda x: x[1],
reverse=True
)[:self.intermediate_top_k]
# Step 3: Precise reranking on best candidates
filtered_docs = [c[0] for c in filtered_candidates]
final_pairs = [(query, doc["content"]) for doc in filtered_docs]
precise_scores = self.precise_reranker.predict(final_pairs)
final_results = sorted(
zip(filtered_docs, precise_scores),
key=lambda x: x[1],
reverse=True
)[:self.final_top_n]
return [
{**doc, "reranking_score": score}
for doc, score in final_results
]
Production monitoring must track several metrics to evaluate reranking effectiveness. Comparing positions before/after reranking (position lift) indicates whether the reranker adds value. Reranking time per query detects performance regressions. User feedback (clicks, reformulations) validates perceived improvement.
Reranking limitations should not be ignored
Reranking is not a silver bullet. Its first limitation is fundamental: the reranker can only reorder what the retriever found. If the relevant document is not in the initial top-k, no amount of reranking will surface it. A bad retriever cannot be compensated by an excellent reranker.
The second limitation concerns long document semantics. Rerankers process passages of a few hundred tokens. If relevant information is diluted in a multi-page document, the retrieved passage may not contain it, and the reranker will evaluate irrelevant content. Upstream chunking remains crucial.
The third limitation is cost. In production, every millisecond counts. A reranker adds an inference step that consumes compute and latency. For very high-volume systems (millions of queries/day), this cost can become prohibitive. The tradeoff between precision and operational cost must be explicit.
Finally, rerankers trained on MS MARCO or similar datasets may generalize poorly to specific domains. A reranker performing well on generic web queries may underperform on technical, medical, or legal vocabulary. Fine-tuning on domain-specific data then becomes necessary.
What next?
Reranking represents an accessible optimization that substantially improves existing RAG pipelines. Integrating a cross-encoder takes a few lines of code and does not impact retrieval infrastructure. For teams observing imprecise RAG answers despite correct retrieval, reranking often constitutes the first lever to activate.
Racine AI integrates adaptive reranking strategies in its document processing solutions. Our pipelines combine vector retrieval, multi-stage reranking, and VLM validation to ensure extracted information precisely matches business needs. Reranker selection, configuration, and integration in the overall architecture are part of our document intelligence expertise.