Guides

How to Build a RAG Pipeline in Python: 2026 Technical Guide

Racine AI January 9, 2026

Last updated January 9, 2026

A RAG pipeline lets you query your internal documents by combining semantic search with LLM generation. This architecture avoids costly fine-tuning while keeping your knowledge up to date. Here is a complete Python implementation with PostgreSQL and pgvector, tested in production on corpora exceeding 100,000 documents.

Complete RAG pipeline architecture

RAG represents a serious alternative to fine-tuning

Fine-tuning an LLM is expensive in compute, requires quality annotated data, and involves heavy maintenance of different model versions. Worse, it does not allow real-time knowledge updates. RAG solves these two fundamental problems. You index your documents in a vector database, and the LLM generates responses based on dynamically retrieved passages.

“RAG enables LLMs to access external knowledge without retraining, achieving comparable or superior performance to fine-tuning on knowledge-intensive tasks while being more cost-effective and updatable.”

— Gao et al., RAG Survey (arXiv:2506.00054)

According to the 2025 RAG-QA benchmark, a properly configured RAG system achieves 89% of fine-tuned model performance on question-answering tasks. All for one-tenth of the initial development cost.

The architecture breaks down into two distinct phases

A production RAG pipeline comprises an ingestion phase that runs offline and an inference phase that responds to queries in real time.

The ingestion phase starts by extracting raw text from your PDF, DOCX, or HTML files. This text then passes through a chunking module that splits it into segments of 512 to 1024 tokens. Each chunk is vectorized by an embedding model then stored in the vector database with its metadata.

The inference phase starts when a user asks a question. This question is itself vectorized then compared to indexed chunks to find the k most similar by cosine similarity. An optional re-ranker reorders these results by fine-grained relevance. Finally, the LLM synthesizes a coherent response from the retrieved passages.

Configuring PostgreSQL with the pgvector extension

pgvector adds native vector support and similarity search to PostgreSQL. For corpora under 5 million vectors, its performance rivals Pinecone without the associated cloud costs.

import psycopg2
from pgvector.psycopg2 import register_vector

def setup_database():
    """Configure PostgreSQL with pgvector extension."""
    conn = psycopg2.connect(
        host="localhost",
        database="rag_db",
        user="rag_user",
        password="your_secure_password"
    )

    with conn.cursor() as cur:
        cur.execute("CREATE EXTENSION IF NOT EXISTS vector")

        cur.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id SERIAL PRIMARY KEY,
                content TEXT NOT NULL,
                embedding vector(384),
                metadata JSONB,
                source_file VARCHAR(500),
                chunk_index INTEGER,
                created_at TIMESTAMP DEFAULT NOW()
            )
        """)

        cur.execute("""
            CREATE INDEX IF NOT EXISTS documents_embedding_idx
            ON documents
            USING hnsw (embedding vector_cosine_ops)
            WITH (m = 16, ef_construction = 64)
        """)

    conn.commit()
    register_vector(conn)
    return conn

The HNSW index parameters deserve some explanation. The m=16 parameter defines the number of connections per node in the graph. A higher value improves recall but consumes more memory. The ef_construction=64 parameter controls index quality during construction.

Intelligent chunking preserves semantic context

Text splitting into chunks is a critical step. Chunks that are too small lose the context needed for understanding. Chunks that are too large introduce noise into retrieval and dilute relevant information.

from typing import List, Dict
import re

class SemanticChunker:
    """Intelligent chunking with context preservation."""

    def __init__(
        self,
        chunk_size: int = 768,
        overlap: int = 100,
        min_chunk_size: int = 100
    ):
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.min_chunk_size = min_chunk_size

    def chunk_document(self, text: str, metadata: Dict = None) -> List[Dict]:
        """Split a document into chunks with metadata."""
        text = re.sub(r'\n{3,}', '\n\n', text)
        text = re.sub(r' {2,}', ' ', text)

        paragraphs = text.split('\n\n')
        chunks = []
        current_chunk = ""
        current_tokens = 0

        for para in paragraphs:
            para_tokens = len(para.split())

            if current_tokens + para_tokens <= self.chunk_size:
                current_chunk += para + "\n\n"
                current_tokens += para_tokens
            else:
                if current_tokens >= self.min_chunk_size:
                    chunks.append({
                        'content': current_chunk.strip(),
                        'metadata': metadata or {},
                        'token_count': current_tokens
                    })

                overlap_text = self._get_overlap(current_chunk)
                current_chunk = overlap_text + para + "\n\n"
                current_tokens = len(current_chunk.split())

        if current_tokens >= self.min_chunk_size:
            chunks.append({
                'content': current_chunk.strip(),
                'metadata': metadata or {},
                'token_count': current_tokens
            })

        return chunks

    def _get_overlap(self, text: str) -> str:
        """Extract last tokens for overlap."""
        words = text.split()
        if len(words) <= self.overlap:
            return text
        return ' '.join(words[-self.overlap:]) + ' '

This implementation uses semantic chunking based on paragraphs with overlap. The 100-token overlap between consecutive chunks ensures that information at the boundary between two chunks will not be lost.

Sentence Transformers efficiently vectorizes your texts

For French technical content, the paraphrase-multilingual-MiniLM-L12-v2 model offers an excellent trade-off between quality and speed. For English, all-MiniLM-L6-v2 remains the domain reference.

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingService:
    """Text vectorization service."""

    def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()

    def embed_texts(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
        """Vectorize a list of texts by batch."""
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            convert_to_numpy=True,
            normalize_embeddings=True
        )
        return embeddings

    def embed_query(self, query: str) -> np.ndarray:
        """Vectorize a single query."""
        return self.model.encode(
            query,
            convert_to_numpy=True,
            normalize_embeddings=True
        )

The normalize_embeddings=True option normalizes vectors to unit norm. This normalization allows using dot product instead of cosine similarity for comparisons, which speeds up calculations without changing results.

Benchmarks on our test corpus reveal solid performance

On a corpus of 50,000 technical documents mixing PDF, DOCX, and TXT, here are the performances measured on a MacBook M1 Pro.

Metric	Value	Configuration
Ingestion time	45 minutes	50K documents
Embedding throughput	1200 docs/min	batch_size=32
Retrieval latency p50	28ms	pgvector HNSW, top_k=10
Retrieval latency p99	67ms	same
Re-ranking latency	18ms	MiniLM cross-encoder
Recall@10	0.87	Test set 1000 queries
Precision@5	0.72	After re-ranking

The bottleneck clearly lies in LLM generation at 1.2 seconds on average for gpt-4o-mini and 2.8 seconds for gpt-4o.

Hybrid search improves recall by 5 to 8%

Combining BM25 lexical search and semantic embedding search produces better results on technical corpora.

cur.execute("""
    SELECT id, content,
           (0.7 * (1 - (embedding <=> %s))) +
           (0.3 * ts_rank(to_tsvector('english', content), plainto_tsquery('english', %s)))
           as hybrid_score
    FROM documents
    WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
       OR embedding <=> %s < 0.5
    ORDER BY hybrid_score DESC
    LIMIT %s
""", (query_emb, query, query, query_emb, top_k))

This query combines a 70% weighted semantic score with a 30% weighted lexical score. These proportions work well for most use cases but deserve adjustment according to your corpus.

Going further

This implementation covers the fundamentals of a production-ready RAG. Several avenues allow going further.

Agentic RAG adds agents capable of reformulating queries, choosing relevant sources, or validating responses. Paper arXiv:2501.09136 details this approach.

Multimodal RAG integrates images and tables through VLMs like SmolVLM or Qwen2-VL, particularly useful for illustrated technical documents.

Continuous evaluation via tools like RAGAS allows monitoring system quality in production and detecting regressions.

Racine AI offers Pi-Search, a RAG solution deployable on-premise for companies with data sovereignty constraints. Contact us for a demonstration on your documents.

Technical newsletter

1 article per month on document AI. No spam.

Sources

Common questions

What is the optimal chunk size for 200+ page technical documents?

For dense technical documents, we recommend 512 to 1024 token chunks with 10 to 15% overlap. In our tests with industrial maintenance manuals, 768 tokens with 100 token overlap gave the best recall@10 at 0.89.

Should I use multilingual embeddings if my corpus is French-only?

No, multilingual models sacrifice precision for generalization. For a 100% French corpus, CamemBERT-base or multilingual-e5-large-instruct outperform generic multilingual models by 3 to 5 points on recall.

pgvector vs Pinecone vs Qdrant: which one for under 1M documents?

pgvector without hesitation. Under one million vectors, performance remains equivalent with under 50ms p99 latency and you keep everything in PostgreSQL.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Email [email protected]

Tell us about your project and get a response within 48h.

Contact us