Guides

Reranking Transforms Raw RAG Results into Precise Answers

Racine AI January 14, 2026

Last updated January 14, 2026

Reranking constitutes a refinement step that reorders documents retrieved by a RAG system according to their true relevance to the user query. Unlike initial retrieval which uses fast approximations like cosine similarity on embeddings, reranking applies more sophisticated models capable of finely evaluating the match between a question and a candidate passage.

Initial retrieval sacrifices precision for speed and scale

Modern RAG systems rely on two-stage retrieval. The first stage, often called “dense retrieval”, encodes documents and queries in a common vector space then retrieves the k nearest documents by cosine similarity. This approach filters millions of documents in milliseconds thanks to ANN (Approximate Nearest Neighbors) indexes like HNSW or IVF.

The problem lies in the fundamental tradeoff of dense retrieval. Bi-encoder embeddings encode query and document independently before comparing their representations. This independence enables pre-computing document embeddings but prevents any fine interaction between query tokens and document tokens. A document may contain exactly the sought information but in a phrasing the embedding does not perfectly capture.

Consider a concrete example. If a user asks “What are the temperature constraints for Pfizer vaccine storage?”, dense retrieval will retrieve documents mentioning vaccines, storage, and temperature. But it may also surface documents about other vaccines, generic medication storage, or temperature constraints in other contexts. The initial top-10 probably contains the answer, but not necessarily in first position.

Research on retrieval benchmarks shows that bi-encoder models achieve their best performance on recall@k with relatively high k, but precision@1 often remains insufficient for critical applications. Reranking intervenes precisely to correct this gap between recall and precision.

Reranking reorders candidates through fine relevance evaluation

Reranking operates on the subset of documents retrieved by the initial stage. Instead of processing millions of documents, the reranker evaluates only dozens to hundreds of candidates. This drastic volume reduction allows applying more expensive but more accurate models.

The typical architecture of a RAG pipeline with reranking proceeds in three steps. The initial retriever fetches a large top-k, typically 50 to 100 documents. The reranker evaluates each (query, document) pair and assigns a relevance score. The system keeps the top-n documents after reranking, with n much smaller than k (often 3 to 10). These final documents feed the LLM context for generation.

from sentence_transformers import CrossEncoder
from typing import List, Tuple

class RAGRerankingPipeline:
    """
    RAG pipeline with reranking step between retrieval and generation.
    Reranking improves precision without modifying the vector index.
    """

    def __init__(
        self,
        retriever,
        reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_k_retrieval: int = 50,
        top_n_final: int = 5
    ):
        self.retriever = retriever
        self.reranker = CrossEncoder(reranker_model)
        self.top_k = top_k_retrieval
        self.top_n = top_n_final

    def retrieve_and_rerank(self, query: str) -> List[Tuple[str, float]]:
        """
        Retrieves documents then reorders them by relevance.
        Returns top_n documents with their reranking scores.
        """
        # Step 1: Large initial retrieval
        candidates = self.retriever.search(query, top_k=self.top_k)

        # Step 2: Prepare pairs for reranking
        pairs = [(query, doc["content"]) for doc in candidates]

        # Step 3: Scoring by cross-encoder
        scores = self.reranker.predict(pairs)

        # Step 4: Reordering and selection
        scored_candidates = list(zip(candidates, scores))
        scored_candidates.sort(key=lambda x: x[1], reverse=True)

        results = [
            (doc["content"], score)
            for doc, score in scored_candidates[:self.top_n]
        ]

        return results

The advantage of reranking lies in its decoupling from the retrieval system. Precision can be improved without rebuilding the vector index, without changing the embedding model, without modifying existing chunks. The reranker inserts as an additional layer that filters and reorders what the retriever already found.

Cross-encoders jointly evaluate query and document

Cross-encoders represent the most direct approach to reranking. Unlike bi-encoders that separately encode query and document, the cross-encoder concatenates both texts and processes them together in a single transformer model. Each query token can “see” each document token via attention mechanisms.

This complete interaction between query and document enables much finer relevance evaluation. The model can detect subtle semantic matches, negations, conditions, nuances that independent embeddings miss. The tradeoff is computational cost: each query-document pair requires a complete transformer forward pass.

Cross-encoder models trained on MS MARCO have dominated reranking benchmarks for several years. MS MARCO (Microsoft MAchine Reading COmprehension) contains millions of query-passage pairs with relevance annotations. Models like ms-marco-MiniLM-L-6-v2 offer a good tradeoff between performance and latency. Larger models like bge-reranker-v2-m3 or Cohere Rerank models achieve higher scores at the cost of increased latency.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

class CrossEncoderReranker:
    """
    Cross-encoder for RAG document reranking.
    Evaluates each (query, document) pair jointly.
    """

    def __init__(self, model_name: str = "BAAI/bge-reranker-v2-m3"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.model.eval()

        # GPU detection if available
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def score_pairs(
        self,
        query: str,
        documents: List[str],
        batch_size: int = 16
    ) -> List[float]:
        """
        Computes relevance scores for each document.
        Processes in batches to optimize GPU utilization.
        """
        scores = []

        for i in range(0, len(documents), batch_size):
            batch_docs = documents[i:i + batch_size]

            # Tokenize pairs
            inputs = self.tokenizer(
                [query] * len(batch_docs),
                batch_docs,
                padding=True,
                truncation=True,
                max_length=512,
                return_tensors="pt"
            ).to(self.device)

            with torch.no_grad():
                outputs = self.model(**inputs)
                # Logits represent relevance score
                batch_scores = outputs.logits.squeeze(-1).cpu().tolist()

            if isinstance(batch_scores, float):
                batch_scores = [batch_scores]
            scores.extend(batch_scores)

        return scores

Cross-encoder latency depends linearly on the number of documents to evaluate. With 50 documents and a MiniLM model on GPU, the reranking step typically takes 50-100ms. On CPU, this time can rise to several hundred milliseconds or even seconds. This latency remains acceptable for most use cases but becomes problematic if the initial top-k exceeds a hundred documents.

ColBERT introduces late interaction for better speed-quality tradeoff

ColBERT (Contextualized Late Interaction over BERT), presented by Khattab and Zaharia in their paper arXiv:2004.12832 (2020), proposes an intermediate architecture between bi-encoders and cross-encoders. The central idea: encode query and documents separately like bi-encoders, but compute similarity via token-to-token interaction rather than simple global cosine similarity.

In ColBERT, each query and document token receives a contextualized embedding from BERT. At scoring time, the system computes maximum similarity between each query token and all document tokens. The final score is the sum of these maximum similarities. This “late interaction” captures fine matches without the cost of a joint forward pass.

ColBERT’s advantage is the possibility of pre-computing token embeddings for each document. Only query token embeddings require computation at inference time. Token-to-token interaction remains expensive but far less than a complete cross-encoder. ColBERTv2 and subsequent work optimized storage and computation to make this approach practical at scale.

import torch
from transformers import AutoModel, AutoTokenizer

class ColBERTReranker:
    """
    Reranker based on ColBERT architecture (late interaction).
    Compromise between cross-encoder precision and bi-encoder speed.
    """

    def __init__(self, model_name: str = "colbert-ir/colbertv2.0"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def encode_text(self, text: str, max_length: int = 256) -> torch.Tensor:
        """Encodes text into token-level embeddings."""
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            max_length=max_length,
            truncation=True,
            padding="max_length"
        ).to(self.device)

        with torch.no_grad():
            outputs = self.model(**inputs)
            # Normalize embeddings for cosine similarity
            embeddings = torch.nn.functional.normalize(
                outputs.last_hidden_state,
                p=2,
                dim=-1
            )

        return embeddings.squeeze(0)  # [seq_len, hidden_dim]

    def score_late_interaction(
        self,
        query_embeddings: torch.Tensor,
        doc_embeddings: torch.Tensor
    ) -> float:
        """
        Computes ColBERT score via late interaction.
        For each query token, finds max similarity with document tokens.
        """
        # Similarity between all tokens [query_len, doc_len]
        similarities = torch.matmul(query_embeddings, doc_embeddings.T)

        # MaxSim: for each query token, take max over document tokens
        max_sims = similarities.max(dim=1).values

        # Final score = sum of MaxSims
        score = max_sims.sum().item()

        return score

    def rerank_documents(
        self,
        query: str,
        documents: List[str]
    ) -> List[Tuple[str, float]]:
        """Reranks a list of documents for a query."""
        # Encode query once
        query_emb = self.encode_text(query)

        results = []
        for doc in documents:
            doc_emb = self.encode_text(doc)
            score = self.score_late_interaction(query_emb, doc_emb)
            results.append((doc, score))

        results.sort(key=lambda x: x[1], reverse=True)
        return results

Published benchmarks show that ColBERT achieves performance close to cross-encoders on reranking tasks while being significantly faster. The latency difference becomes particularly marked when document embeddings are pre-computed, enabling near real-time reranking even on large sets.

LLM reranking exploits the reasoning capabilities of large models

A more recent approach consists of directly using an LLM for reranking. The principle is simple: ask the language model to order documents by relevance, making its reasoning explicit. This method, often called “listwise reranking” or “LLM-as-a-judge”, can achieve high performance on complex queries where semantic reasoning matters more than lexical matching.

The RankGPT paper and similar work explored different prompting strategies for reranking. The most direct approach presents documents to the LLM and asks it to order them from most to least relevant. Variants use pairwise comparisons (“Is document A more relevant than document B for this question?”) or absolute scores (“Rate relevance from 1 to 10”).

from openai import OpenAI
import re

class LLMReranker:
    """
    Reranking via LLM with explicit prompting.
    Suited to cases where complex semantic reasoning is needed.
    """

    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model

    def rerank_listwise(
        self,
        query: str,
        documents: List[str],
        top_n: int = 5
    ) -> List[Tuple[str, int]]:
        """
        Asks the LLM to order documents by relevance.
        Returns top_n documents with their rank.
        """
        # Format documents with identifiers
        formatted_docs = []
        for i, doc in enumerate(documents):
            # Truncate for context
            truncated = doc[:500] + "..." if len(doc) > 500 else doc
            formatted_docs.append(f"[Doc {i+1}]: {truncated}")

        prompt = f"""Given the following question and candidate documents,
order the documents from most relevant to least relevant for answering the question.

Question: {query}

Documents:
{chr(10).join(formatted_docs)}

Return only the document numbers in decreasing relevance order,
separated by commas. For example: 3, 1, 5, 2, 4

Ranking:"""

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        # Parse response to extract order
        order_text = response.choices[0].message.content.strip()
        numbers = re.findall(r'\d+', order_text)
        ordered_indices = [int(n) - 1 for n in numbers if int(n) <= len(documents)]

        # Build results
        results = []
        for rank, idx in enumerate(ordered_indices[:top_n]):
            if 0 <= idx < len(documents):
                results.append((documents[idx], rank + 1))

        return results

    def rerank_pointwise(
        self,
        query: str,
        documents: List[str]
    ) -> List[Tuple[str, float]]:
        """
        Evaluates each document individually with a relevance score.
        More accurate but more expensive in tokens.
        """
        results = []

        for doc in documents:
            prompt = f"""Evaluate the relevance of the following document for answering the question.

Question: {query}

Document: {doc[:800]}

Give a relevance score from 0 to 10, where:
- 0-2: Not relevant
- 3-4: Marginally relevant
- 5-6: Partially relevant
- 7-8: Highly relevant
- 9-10: Perfectly relevant

Reply only with the numeric score."""

            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0
            )

            try:
                score = float(response.choices[0].message.content.strip())
            except ValueError:
                score = 0.0

            results.append((doc, score))

        results.sort(key=lambda x: x[1], reverse=True)
        return results

LLM reranking presents distinct advantages and disadvantages. On the advantage side, the LLM can handle complex queries requiring multi-step reasoning, understands nuances and implicit context, and can be guided by domain-specific instructions. On the disadvantage side, token and latency costs far exceed other approaches, results can vary with prompting, and the model can introduce its own biases.

Reranker choice depends on application context and operational constraints

Each reranking approach addresses different needs. The following table summarizes the main characteristics of the three reranker families.

Criterion	Cross-Encoder	ColBERT	LLM Reranking
Relative precision	High	High	Variable with prompting
Latency (50 docs)	50-200ms GPU	20-50ms GPU	2-10s
Inference cost	Moderate	Low	High
Customization	Fine-tuning	Fine-tuning	Prompting
Complex understanding	Good	Good	Excellent
On-premise deployment	Easy	Easy	Possible but expensive

For low-latency, high-volume applications, ColBERT or a lightweight cross-encoder (MiniLM) are natural choices. Real-time systems like chatbots or search engines benefit from their speed without sacrificing too much quality.

For applications where precision trumps latency, larger cross-encoders (bge-reranker-large, Cohere Rerank) offer the best performance on standard benchmarks. Internal B2B use cases, document assistants, or compliance systems fall into this category.

LLM reranking suits scenarios where queries are complex and rare. Contract analysis, legal research, or technical questions may justify the extra cost if answer quality demands it. The prohibitive cost for volume makes this approach unsuited to consumer applications.

Practical implementation combines multiple strategies based on needs

A mature production system may combine multiple reranking approaches. A common strategy uses a fast reranker (ColBERT or MiniLM) for initial filtering, then a more precise reranker on the resulting top-10 if the query warrants it.

class MultiStageRerankingPipeline:
    """
    Cascading reranking pipeline to optimize precision and latency.
    Uses a fast reranker then a precise reranker on best candidates.
    """

    def __init__(
        self,
        retriever,
        fast_reranker: CrossEncoder,
        precise_reranker: CrossEncoder,
        initial_top_k: int = 100,
        intermediate_top_k: int = 20,
        final_top_n: int = 5
    ):
        self.retriever = retriever
        self.fast_reranker = fast_reranker
        self.precise_reranker = precise_reranker
        self.initial_top_k = initial_top_k
        self.intermediate_top_k = intermediate_top_k
        self.final_top_n = final_top_n

    def search(self, query: str) -> List[dict]:
        """Executes the complete retrieval and reranking pipeline."""
        # Step 1: Large initial retrieval
        candidates = self.retriever.search(query, top_k=self.initial_top_k)

        # Step 2: First fast reranking
        pairs = [(query, doc["content"]) for doc in candidates]
        fast_scores = self.fast_reranker.predict(pairs)

        filtered_candidates = sorted(
            zip(candidates, fast_scores),
            key=lambda x: x[1],
            reverse=True
        )[:self.intermediate_top_k]

        # Step 3: Precise reranking on best candidates
        filtered_docs = [c[0] for c in filtered_candidates]
        final_pairs = [(query, doc["content"]) for doc in filtered_docs]
        precise_scores = self.precise_reranker.predict(final_pairs)

        final_results = sorted(
            zip(filtered_docs, precise_scores),
            key=lambda x: x[1],
            reverse=True
        )[:self.final_top_n]

        return [
            {**doc, "reranking_score": score}
            for doc, score in final_results
        ]

Production monitoring must track several metrics to evaluate reranking effectiveness. Comparing positions before/after reranking (position lift) indicates whether the reranker adds value. Reranking time per query detects performance regressions. User feedback (clicks, reformulations) validates perceived improvement.

Reranking limitations should not be ignored

Reranking is not a silver bullet. Its first limitation is fundamental: the reranker can only reorder what the retriever found. If the relevant document is not in the initial top-k, no amount of reranking will surface it. A bad retriever cannot be compensated by an excellent reranker.

The second limitation concerns long document semantics. Rerankers process passages of a few hundred tokens. If relevant information is diluted in a multi-page document, the retrieved passage may not contain it, and the reranker will evaluate irrelevant content. Upstream chunking remains crucial.

The third limitation is cost. In production, every millisecond counts. A reranker adds an inference step that consumes compute and latency. For very high-volume systems (millions of queries/day), this cost can become prohibitive. The tradeoff between precision and operational cost must be explicit.

Finally, rerankers trained on MS MARCO or similar datasets may generalize poorly to specific domains. A reranker performing well on generic web queries may underperform on technical, medical, or legal vocabulary. Fine-tuning on domain-specific data then becomes necessary.

What next?

Reranking represents an accessible optimization that substantially improves existing RAG pipelines. Integrating a cross-encoder takes a few lines of code and does not impact retrieval infrastructure. For teams observing imprecise RAG answers despite correct retrieval, reranking often constitutes the first lever to activate.

Racine AI integrates adaptive reranking strategies in its document processing solutions. Our pipelines combine vector retrieval, multi-stage reranking, and VLM validation to ensure extracted information precisely matches business needs. Reranker selection, configuration, and integration in the overall architecture are part of our document intelligence expertise.

Technical newsletter

1 article per month on document AI. No spam.

Sources

Common questions

What is the difference between a bi-encoder and a cross-encoder for retrieval?

Bi-encoders encode query and document independently then compare their vectors via cosine similarity. Cross-encoders concatenate both texts and process them jointly through a transformer, enabling fine token-level interaction. Cross-encoders are more accurate but slower since each pair requires a full forward pass.

Does reranking always improve RAG results?

Reranking can only reorder what the initial retriever found. If the relevant document is not in the initial top-k, no amount of reranking will surface it. The retriever quality sets the ceiling for reranking effectiveness.

What latency does a cross-encoder reranker add?

With a MiniLM cross-encoder on GPU and 50 documents, expect 50-100ms. On CPU, this rises to several hundred milliseconds. Larger models like bge-reranker-v2-m3 are more accurate but slower.

When should I use LLM-based reranking instead of cross-encoders?

LLM reranking suits scenarios with complex queries requiring multi-step reasoning, where the extra cost and latency (2-10 seconds) is justified by answer quality. It is not suited for high-volume consumer applications due to cost.

How does ColBERT compare to cross-encoders for reranking?

ColBERT achieves performance close to cross-encoders while being significantly faster, especially when document embeddings are pre-computed. The late interaction mechanism captures fine token-level matches without the cost of a full joint forward pass.

Can I combine multiple reranking strategies?

Yes. A common approach uses a fast reranker (ColBERT or MiniLM) for initial filtering from 100 to 20 documents, then a precise reranker on the top-20 for final selection. This cascading approach optimizes both latency and precision.

Do rerankers trained on MS MARCO generalize to domain-specific content?

General rerankers may underperform on specialized domains like medical, legal, or technical content. Fine-tuning on domain-specific annotated data is recommended for optimal performance in these cases.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Email [email protected]

Tell us about your project and get a response within 48h.

Contact us