Last updated January 15, 2026
Customer support teams handle growing volumes of requests with increasingly fast response expectations. An AI assistant based on RAG (Retrieval-Augmented Generation) architecture enables instant responses to frequent questions while escalating complex cases to human agents.
RAG combines semantic search and text generation
The RAG architecture proposed by Lewis et al. in their foundational paper (arXiv:2005.11401) combines two components. A retrieval system searches for relevant passages in a document base. A language model generates a synthetic response from these passages.
This approach solves the fundamental problem of LLMs: their knowledge is frozen at training date and they can hallucinate false information. By grounding generation in verified documents, RAG produces factually correct and up-to-date responses.
The comprehensive survey by Gao et al. (arXiv:2312.10997) distinguishes three generations of RAG architectures. Naive RAG simply chains retrieval and generation. Advanced RAG adds query pre-processing and response post-processing steps. Modular RAG decomposes the pipeline into reconfigurable blocks.
For a customer support assistant, Advanced RAG offers the best trade-off between complexity and performance.
Technical architecture breaks down into four layers
The ingestion layer transforms documents into vectors
The ingestion pipeline processes product documentation, existing FAQs, resolved ticket histories. Each document goes through several steps.
Parsing extracts text from different formats (PDF, HTML, Markdown, Word). For technical PDFs, a VLM preserves table and diagram structure.
Chunking splits documents into optimally-sized segments. Retrieval research shows 512 to 1024 token chunks work well for customer support. Overlap between chunks (50-100 tokens) preserves context at boundaries.
Embedding converts each chunk to a dense vector. The MTEB benchmark ranks multilingual models. For French, multilingual-e5 models or text-embedding-3-large achieve good scores on retrieval tasks.
Indexing stores vectors in a vector database. pgvector on PostgreSQL offers excellent trade-off between performance and operational simplicity for medium-sized corpora.
The retrieval layer finds relevant passages
When a user asks a question, the system encodes it with the same embedding model as documents. A similarity search returns the k nearest chunks.
The Dense Passage Retrieval paper (arXiv:2004.04906) by Karpukhin et al. shows dense search outperforms BM25 by 9 to 19% on question answering tasks. This improvement comes from embeddings’ ability to capture semantic rather than lexical similarity.
A cross-encoder reranker can refine results. It rescores candidates by evaluating them jointly with the query, capturing finer relevance signals than cosine similarity.
The generation layer produces the response
Retrieved chunks feed the LLM prompt. The system prompt defines assistant behavior: tone, response format, security guidelines.
You are a customer support assistant. Only answer from provided documents.
If the answer is not in documents, indicate you don't know and offer to escalate to a human agent.
Cite your sources when providing factual information.
The LLM generates a natural response by synthesizing chunk information. Model choice depends on deployment constraints. Cloud APIs (Claude, GPT-4) offer best performance. Open source models deployed with vLLM allow full data control.
The integration layer connects channels
The assistant exposes a REST API consumed by different frontends. Web chat widget, mobile app, email integration use the same endpoints.
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Query(BaseModel):
question: str
conversation_id: str | None = None
user_context: dict | None = None
@app.post("/api/chat")
async def chat(query: Query):
# Retrieve relevant chunks
chunks = await retrieve_relevant_chunks(query.question)
# Generate response
response = await generate_response(
question=query.question,
context=chunks,
history=get_conversation_history(query.conversation_id)
)
return {
"answer": response.text,
"sources": [c.metadata for c in chunks],
"confidence": response.confidence_score
}
Classic pitfalls to avoid in implementation
Naive chunking degrades performance
Splitting documents at fixed size without considering structure breaks meaning. A technical procedure cut mid-step becomes incomprehensible. Semantic chunking that respects natural boundaries (titles, paragraphs) improves retrieval quality.
Retrieval without filtering brings noise
Returning k nearest chunks without relevance threshold sometimes includes irrelevant results. A minimum similarity score filters weak candidates. The system must know how to say “I don’t know” rather than generate response from off-topic chunks.
Lack of conversational context frustrates users
A chatbot without memory forces users to repeat context with each message. Conversation history management understands implicit references (“and for that?”, “how do I do it?”).
Responses too long lose users
An LLM can generate text walls when a short answer suffices. Length constraints in prompt and post-processing synthesis produce concise and actionable responses.
Quality metrics guide continuous improvement
Retrieval metrics evaluate source relevance
Recall@k measures percentage of queries for which the correct document appears in top k results. MRR (Mean Reciprocal Rank) weights by position of first relevant result.
These metrics are evaluated on a test dataset with reference question-document pairs. Manual annotation of a real query sample feeds this dataset.
Generation metrics evaluate response quality
Faithfulness verifies the response is faithful to sources. A response containing information absent from retrieved chunks indicates hallucination.
Completeness verifies the response covers all question aspects. A partial response leaves users unsatisfied.
Coherence verifies the response is grammatically correct and fluent.
Product metrics measure business impact
Resolution rate without escalation measures percentage of conversations resolved by chatbot alone. A high rate indicates good knowledge base coverage.
User satisfaction score (CSAT) collected via feedback widget gives user sentiment.
Average resolution time compared to human support quantifies efficiency gain.
On-premise deployment guarantees data confidentiality
For sensitive data, on-premise deployment avoids any transit through external cloud services. Architecture adapts with open source components.
vLLM deploys open source language models with performance close to commercial APIs. 4-bit quantization allows running 7B parameter models on consumer GPUs.
pgvector on PostgreSQL ensures vector search without external service. HNSW index offers acceptable performance up to several million chunks.
Fine-tuning the embedding model on domain data improves retrieval precision. A few thousand annotated question-passage pairs suffice for significant gains.
Current limitations of RAG architecture
Response quality depends on documentation quality
A RAG cannot answer better than its knowledge base. Obsolete, incomplete or poorly structured documents produce poor quality responses. Investment in documentation remains prerequisite.
Complex questions requiring multi-step reasoning remain difficult
A question whose answer requires crossing multiple documents poses problems for classic retrieval. Multi-hop or agentic RAG architectures emerge to address these cases.
Perceived latency impacts user experience
Token-by-token generation introduces perceptible latency. Response streaming mitigates this perception by displaying text progressively.
What’s next?
RAG assistant implementation for customer support follows incremental progression. Start with restricted documentation scope, measure quality, iterate on chunking and prompt engineering before expanding coverage.
For high volumes or sensitive data requiring on-premise deployment, Pi-Edge offers a turnkey solution integrating the complete RAG pipeline.