Last updated January 9, 2026
A RAG pipeline lets you query your internal documents by combining semantic search with LLM generation. This architecture avoids costly fine-tuning while keeping your knowledge up to date. Here is a complete Python implementation with PostgreSQL and pgvector, tested in production on corpora exceeding 100,000 documents.
RAG represents a serious alternative to fine-tuning
Fine-tuning an LLM is expensive in compute, requires quality annotated data, and involves heavy maintenance of different model versions. Worse, it does not allow real-time knowledge updates. RAG solves these two fundamental problems. You index your documents in a vector database, and the LLM generates responses based on dynamically retrieved passages.
“RAG enables LLMs to access external knowledge without retraining, achieving comparable or superior performance to fine-tuning on knowledge-intensive tasks while being more cost-effective and updatable.”
— Gao et al., RAG Survey (arXiv:2506.00054)
According to the 2025 RAG-QA benchmark, a properly configured RAG system achieves 89% of fine-tuned model performance on question-answering tasks. All for one-tenth of the initial development cost.
The architecture breaks down into two distinct phases
A production RAG pipeline comprises an ingestion phase that runs offline and an inference phase that responds to queries in real time.
The ingestion phase starts by extracting raw text from your PDF, DOCX, or HTML files. This text then passes through a chunking module that splits it into segments of 512 to 1024 tokens. Each chunk is vectorized by an embedding model then stored in the vector database with its metadata.
The inference phase starts when a user asks a question. This question is itself vectorized then compared to indexed chunks to find the k most similar by cosine similarity. An optional re-ranker reorders these results by fine-grained relevance. Finally, the LLM synthesizes a coherent response from the retrieved passages.
Configuring PostgreSQL with the pgvector extension
pgvector adds native vector support and similarity search to PostgreSQL. For corpora under 5 million vectors, its performance rivals Pinecone without the associated cloud costs.
import psycopg2
from pgvector.psycopg2 import register_vector
def setup_database():
"""Configure PostgreSQL with pgvector extension."""
conn = psycopg2.connect(
host="localhost",
database="rag_db",
user="rag_user",
password="your_secure_password"
)
with conn.cursor() as cur:
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(384),
metadata JSONB,
source_file VARCHAR(500),
chunk_index INTEGER,
created_at TIMESTAMP DEFAULT NOW()
)
""")
cur.execute("""
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
conn.commit()
register_vector(conn)
return conn
The HNSW index parameters deserve some explanation. The m=16 parameter defines the number of connections per node in the graph. A higher value improves recall but consumes more memory. The ef_construction=64 parameter controls index quality during construction.
Intelligent chunking preserves semantic context
Text splitting into chunks is a critical step. Chunks that are too small lose the context needed for understanding. Chunks that are too large introduce noise into retrieval and dilute relevant information.
from typing import List, Dict
import re
class SemanticChunker:
"""Intelligent chunking with context preservation."""
def __init__(
self,
chunk_size: int = 768,
overlap: int = 100,
min_chunk_size: int = 100
):
self.chunk_size = chunk_size
self.overlap = overlap
self.min_chunk_size = min_chunk_size
def chunk_document(self, text: str, metadata: Dict = None) -> List[Dict]:
"""Split a document into chunks with metadata."""
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r' {2,}', ' ', text)
paragraphs = text.split('\n\n')
chunks = []
current_chunk = ""
current_tokens = 0
for para in paragraphs:
para_tokens = len(para.split())
if current_tokens + para_tokens <= self.chunk_size:
current_chunk += para + "\n\n"
current_tokens += para_tokens
else:
if current_tokens >= self.min_chunk_size:
chunks.append({
'content': current_chunk.strip(),
'metadata': metadata or {},
'token_count': current_tokens
})
overlap_text = self._get_overlap(current_chunk)
current_chunk = overlap_text + para + "\n\n"
current_tokens = len(current_chunk.split())
if current_tokens >= self.min_chunk_size:
chunks.append({
'content': current_chunk.strip(),
'metadata': metadata or {},
'token_count': current_tokens
})
return chunks
def _get_overlap(self, text: str) -> str:
"""Extract last tokens for overlap."""
words = text.split()
if len(words) <= self.overlap:
return text
return ' '.join(words[-self.overlap:]) + ' '
This implementation uses semantic chunking based on paragraphs with overlap. The 100-token overlap between consecutive chunks ensures that information at the boundary between two chunks will not be lost.
Sentence Transformers efficiently vectorizes your texts
For French technical content, the paraphrase-multilingual-MiniLM-L12-v2 model offers an excellent trade-off between quality and speed. For English, all-MiniLM-L6-v2 remains the domain reference.
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingService:
"""Text vectorization service."""
def __init__(self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.dimension = self.model.get_sentence_embedding_dimension()
def embed_texts(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
"""Vectorize a list of texts by batch."""
embeddings = self.model.encode(
texts,
batch_size=batch_size,
show_progress_bar=True,
convert_to_numpy=True,
normalize_embeddings=True
)
return embeddings
def embed_query(self, query: str) -> np.ndarray:
"""Vectorize a single query."""
return self.model.encode(
query,
convert_to_numpy=True,
normalize_embeddings=True
)
The normalize_embeddings=True option normalizes vectors to unit norm. This normalization allows using dot product instead of cosine similarity for comparisons, which speeds up calculations without changing results.
Benchmarks on our test corpus reveal solid performance
On a corpus of 50,000 technical documents mixing PDF, DOCX, and TXT, here are the performances measured on a MacBook M1 Pro.
| Metric | Value | Configuration |
|---|---|---|
| Ingestion time | 45 minutes | 50K documents |
| Embedding throughput | 1200 docs/min | batch_size=32 |
| Retrieval latency p50 | 28ms | pgvector HNSW, top_k=10 |
| Retrieval latency p99 | 67ms | same |
| Re-ranking latency | 18ms | MiniLM cross-encoder |
| Recall@10 | 0.87 | Test set 1000 queries |
| Precision@5 | 0.72 | After re-ranking |
The bottleneck clearly lies in LLM generation at 1.2 seconds on average for gpt-4o-mini and 2.8 seconds for gpt-4o.
Hybrid search improves recall by 5 to 8%
Combining BM25 lexical search and semantic embedding search produces better results on technical corpora.
cur.execute("""
SELECT id, content,
(0.7 * (1 - (embedding <=> %s))) +
(0.3 * ts_rank(to_tsvector('english', content), plainto_tsquery('english', %s)))
as hybrid_score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
OR embedding <=> %s < 0.5
ORDER BY hybrid_score DESC
LIMIT %s
""", (query_emb, query, query, query_emb, top_k))
This query combines a 70% weighted semantic score with a 30% weighted lexical score. These proportions work well for most use cases but deserve adjustment according to your corpus.
Going further
This implementation covers the fundamentals of a production-ready RAG. Several avenues allow going further.
Agentic RAG adds agents capable of reformulating queries, choosing relevant sources, or validating responses. Paper arXiv:2501.09136 details this approach.
Multimodal RAG integrates images and tables through VLMs like SmolVLM or Qwen2-VL, particularly useful for illustrated technical documents.
Continuous evaluation via tools like RAGAS allows monitoring system quality in production and detecting regressions.
Racine AI offers Pi-Search, a RAG solution deployable on-premise for companies with data sovereignty constraints. Contact us for a demonstration on your documents.