Back to blog
Guides

AI Sovereignty: How to Deploy LLMs Without Sending Your Data to the Cloud

Racine AI

Last updated January 9, 2026

On-premise deployment of language models gives you full control over your sensitive data. Open source models like Llama 4, Mistral 3 and Qwen 2.5 deliver competitive performance compared to cloud APIs, and can be deployed on your own infrastructure without sharing data with third parties.

Cloud APIs Raise Significant Sovereignty Concerns

Using APIs from providers like OpenAI, Anthropic or Google means your prompts and documents pass through their servers. Even when these providers contractually commit to not using your data for training (in their enterprise API tiers), several risks remain.

The legal risk concerns data localization. OpenAI’s servers are located in the United States, subject to the Cloud Act which allows US authorities to access data held by American companies, including data stored outside US territory. For sensitive data governed by GDPR or trade secret protections, this exposure can be problematic.

“The CLOUD Act allows U.S. law enforcement to compel U.S.-based technology companies to provide requested data stored on servers regardless of whether the data are stored in the U.S. or on foreign soil.”

— Congressional Research Service, “The CLOUD Act”, 2018

The operational risk concerns dependency. An API outage, a pricing change, or a modification to terms of service can impact your operations overnight. The November 2023 OpenAI incident demonstrated the vulnerability of systems that rely 100% on an external API.

The strategic risk concerns intellectual property. The prompts you send can reveal your workflows, your business processes, and your client data. Even without reuse for training, this information leaves your perimeter of control.

The Open Source Model Landscape Has Changed Dramatically

The situation has evolved considerably since 2023. Open source models now rival proprietary APIs on many tasks. Here are some important milestones.

Llama 4 (Meta, 2025-2026) offers models ranging from 8B to 405B parameters under a permissive commercial license. The Llama 4 Scout model with a 10 million token context window opens unprecedented possibilities for document processing (Meta AI Blog, 2026).

Mistral 3 Large (Mistral AI, 2025) is a MoE model with 675B total parameters developed by a French company. Mistral models benefit from geographic and regulatory proximity for European enterprises (Mistral AI Documentation).

Qwen 2.5 (Alibaba, 2024-2025) delivers high-performing models up to 72B parameters, particularly well-suited for multilingual use including French.

DeepSeek R1 (DeepSeek, 2025) demonstrated GPT-4 level performance on mathematical reasoning with open source models.

All of these models can be downloaded and run on your own servers, with no data sent externally.

Infrastructure Requirements Depend on Model Size

Sizing the infrastructure is often the first obstacle. Modern LLMs are demanding in terms of GPU memory.

ModelParametersMinimum VRAM (FP16)VRAM with Quantization (INT4)
Llama 4 8B8B16 GB6 GB
Mistral 3 7B7B14 GB5 GB
Qwen 2.5 14B14B28 GB10 GB
Llama 4 70B70B140 GB40 GB
Mistral 3 Large675B (MoE)~300 GB active~100 GB

Estimates based on the standard formula: VRAM = Parameters x 2 bytes (FP16) or x 0.5 bytes (INT4). Actual requirements vary by implementation.

For a 7-8B model with 4-bit quantization, a single RTX 4090 (24 GB VRAM) is sufficient. For 70B+ models, you need multiple professional-grade cards (A100, H100) or multi-GPU configurations.

CPU inference is possible via llama.cpp but with response times 10 to 50 times slower than GPU. Suitable for development or low volumes, but not for interactive production workloads.

Serving Frameworks Simplify Deployment

Several frameworks streamline on-premise LLM deployment. Each has its strengths.

vLLM (UC Berkeley) optimizes throughput through PagedAttention, which efficiently manages the KV cache memory. Ideal for high-volume scenarios with many concurrent requests.

# Deployment example with vLLM
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-4-8B-Instruct",
    tensor_parallel_size=1,  # Number of GPUs
    dtype="half",  # FP16
    quantization="awq",  # Quantization to reduce VRAM
    max_model_len=8192
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024
)

def generate(prompt: str) -> str:
    outputs = llm.generate([prompt], sampling_params)
    return outputs[0].outputs[0].text

Ollama simplifies installation for teams that do not want to manage technical details. A single command downloads and launches a model. Convenient for prototyping, but less configurable for production.

# Installation and launch with Ollama
ollama pull llama4:8b
ollama run llama4:8b "Summarize this technical document..."

Text Generation Inference (Hugging Face) offers a good balance between simplicity and performance, with an OpenAI-compatible API that eases migration from cloud APIs.

# Docker Compose for TGI
version: '3.8'
services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:latest
    ports:
      - "8080:80"
    volumes:
      - ./models:/data
    environment:
      - MODEL_ID=mistralai/Mistral-3-7B-Instruct
      - QUANTIZE=bitsandbytes-nf4
      - MAX_INPUT_LENGTH=4096
      - MAX_TOTAL_TOKENS=8192
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Quantization Reduces Hardware Requirements Without Sacrificing Too Much Quality

Quantization converts model weights from FP16 (16-bit) to INT8 or INT4 (8 or 4-bit). This compression reduces memory requirements by a factor of 2 to 4, at the cost of a slight performance degradation.

According to MMLU benchmarks published by the llama.cpp maintainers, Q4_K_M quantization (4-bit) generally retains over 95% of the original model’s performance on comprehension tasks. Degradation becomes more noticeable on creative generation or complex reasoning tasks.

# Loading a quantized model with transformers + bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-4-8B-Instruct")

def generate_response(prompt: str) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

For mission-critical use cases, test the quantized version against your specific prompts before validating the deployment.

RAG Architecture Enables You to Leverage Internal Documents

A standalone LLM only knows what it was trained on. To leverage your internal documents, RAG (Retrieval-Augmented Generation) architecture retrieves relevant passages from a vector database and injects them into the LLM context.

This architecture can be implemented entirely on-premise using open source components.

from sentence_transformers import SentenceTransformer
import psycopg2
from pgvector.psycopg2 import register_vector

class OnPremiseRAG:
    """Fully on-premise RAG pipeline."""

    def __init__(self, llm_client, db_connection):
        self.llm = llm_client
        self.conn = db_connection
        self.embedder = SentenceTransformer(
            "sentence-transformers/all-MiniLM-L6-v2"
        )
        register_vector(self.conn)

    def search_documents(self, query: str, top_k: int = 5) -> list:
        """Search for relevant documents in pgvector."""
        query_embedding = self.embedder.encode(query).tolist()

        with self.conn.cursor() as cur:
            cur.execute("""
                SELECT content, source, 1 - (embedding <=> %s) as similarity
                FROM documents
                ORDER BY embedding <=> %s
                LIMIT %s
            """, (query_embedding, query_embedding, top_k))

            return [
                {"content": row[0], "source": row[1], "similarity": row[2]}
                for row in cur.fetchall()
            ]

    def generate_answer(self, query: str) -> dict:
        """Generate an answer based on internal documents."""
        contexts = self.search_documents(query)

        context_str = "\n\n---\n\n".join([
            f"[Source: {c['source']}]\n{c['content']}"
            for c in contexts
        ])

        prompt = f"""Context from internal documents:
{context_str}

Question: {query}

Answer based solely on the provided context. If the information is not available, state this explicitly."""

        answer = self.llm.generate(prompt)

        return {
            "answer": answer,
            "sources": [c["source"] for c in contexts]
        }

All components remain within your infrastructure: the embedding model, the PostgreSQL database with pgvector, and the LLM. No data leaves your environment.

Securing the Deployment Requires Multiple Layers

An on-premise deployment does not automatically guarantee security. Several measures are essential.

Network isolation places LLM services in a dedicated VLAN with no direct Internet access. Requests pass through an authenticated reverse proxy.

# Nginx configuration for LLM proxy
server {
    listen 443 ssl;
    server_name llm-internal.company.local;

    ssl_certificate /etc/ssl/certs/llm.crt;
    ssl_certificate_key /etc/ssl/private/llm.key;

    # Client certificate authentication
    ssl_client_certificate /etc/ssl/ca/company-ca.crt;
    ssl_verify_client on;

    location /v1/ {
        proxy_pass http://llm-backend:8080/;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-User $ssl_client_s_dn_cn;

        # Rate limiting
        limit_req zone=llm_limit burst=10 nodelay;
    }
}

Request logging enables auditing and detection of abnormal usage. Be careful not to log sensitive content itself, but rather metadata (who, when, request size).

import logging
from datetime import datetime
import hashlib

class AuditLogger:
    """Audit logger for LLM requests."""

    def __init__(self, log_path: str):
        self.logger = logging.getLogger("llm_audit")
        handler = logging.FileHandler(log_path)
        handler.setFormatter(logging.Formatter(
            '%(asctime)s - %(message)s'
        ))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def log_request(self, user_id: str, prompt_length: int,
                    response_length: int, latency_ms: float):
        """Log a request without exposing content."""

        # Hash the prompt for correlation without exposure
        prompt_hash = hashlib.sha256(
            f"{user_id}{datetime.now().isoformat()}".encode()
        ).hexdigest()[:16]

        self.logger.info(
            f"user={user_id} prompt_len={prompt_length} "
            f"response_len={response_length} latency_ms={latency_ms:.0f} "
            f"request_id={prompt_hash}"
        )

Access management integrates with your existing IAM (Active Directory, LDAP, SSO) to control who can use which models with which documents.

Production Monitoring Detects Problems Before Users Do

An LLM service in production requires monitoring specific to LLMs beyond standard system metrics.

P50/P95/P99 latency measures actual user experience. A high P95 indicates occasional slowdowns affecting a subset of users.

Throughput (tokens per second) measures system capacity. A drop may indicate a resource issue or excessive concurrency.

Error rate includes timeouts, empty responses, and generation errors. A spike requires immediate investigation.

GPU/VRAM utilization allows you to anticipate saturation and plan scaling.

from prometheus_client import Counter, Histogram, Gauge
import time

# Prometheus metrics for LLM
llm_request_duration = Histogram(
    'llm_request_duration_seconds',
    'LLM request duration',
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

llm_tokens_generated = Counter(
    'llm_tokens_generated_total',
    'Total number of tokens generated'
)

llm_errors = Counter(
    'llm_errors_total',
    'Total LLM errors',
    ['error_type']
)

llm_gpu_memory_usage = Gauge(
    'llm_gpu_memory_bytes',
    'GPU memory usage',
    ['gpu_id']
)

class MonitoredLLM:
    """LLM wrapper with Prometheus metrics."""

    def __init__(self, llm_client):
        self.llm = llm_client

    def generate(self, prompt: str) -> str:
        start_time = time.time()

        try:
            response = self.llm.generate(prompt)
            duration = time.time() - start_time

            llm_request_duration.observe(duration)
            llm_tokens_generated.inc(len(response.split()))

            return response

        except TimeoutError:
            llm_errors.labels(error_type='timeout').inc()
            raise
        except Exception as e:
            llm_errors.labels(error_type='other').inc()
            raise

These metrics feed Grafana dashboards and alerts that notify the team before users start complaining.

On-Premise Costs Compare to Cloud Costs Over the Long Term

The economic analysis depends heavily on usage volume and time horizon.

Cloud costs (e.g., OpenAI GPT-4o) are billed per usage: approximately 5permillioninputtokensand5 per million input tokens and 15 per million output tokens (January 2026 pricing, subject to change). Predictable and requiring no upfront investment, but scaling linearly with usage.

On-premise costs include the initial investment (servers, GPUs) plus operational costs (electricity, maintenance, personnel). The marginal cost per request becomes virtually zero once the infrastructure is amortized.

The break-even point depends on volume. For an organization processing a few thousand requests per day, cloud remains more cost-effective. For millions of monthly requests, on-premise becomes profitable.

def cost_comparison(
    monthly_requests: int,
    avg_tokens_per_request: int,
    cloud_cost_per_million_input: float = 5.0,
    cloud_cost_per_million_output: float = 15.0,
    onprem_monthly_cost: float = 5000.0  # Server + electricity + maintenance
) -> dict:
    """Compare cloud vs on-premise costs."""

    monthly_tokens = monthly_requests * avg_tokens_per_request
    monthly_tokens_millions = monthly_tokens / 1_000_000

    # Assumption: 50% input, 50% output
    cloud_cost = monthly_tokens_millions * (
        cloud_cost_per_million_input * 0.5 +
        cloud_cost_per_million_output * 0.5
    )

    return {
        "monthly_requests": monthly_requests,
        "monthly_tokens": monthly_tokens,
        "cloud_monthly_cost": cloud_cost,
        "onprem_monthly_cost": onprem_monthly_cost,
        "breakeven_at_requests": int(
            onprem_monthly_cost / (cloud_cost / monthly_requests)
        ) if monthly_requests > 0 else 0,
        "recommendation": "cloud" if cloud_cost < onprem_monthly_cost else "on-premise"
    }

Beyond pure cost, sovereignty carries a strategic value that is difficult to quantify but very real for certain organizations.

Regulatory Constraints Drive Adoption of On-Premise

Several sectors face regulatory obligations that make cloud deployments problematic.

The financial sector (banks, insurance companies) is subject to outsourcing regulations (EBA Guidelines, DORA) that mandate control over critical data.

The healthcare sector handles health data governed by GDPR with strengthened requirements. Hosting with a US cloud provider raises legal questions.

The defense sector and sensitive government agencies often have classification requirements that prohibit any transit through non-sovereign infrastructure.

Companies with sensitive intellectual property (R&D, patents, formulations) may need to guarantee that their trade secrets never leave their perimeter.

For these cases, on-premise is not an option but a necessity.

The Deployment Roadmap Follows Progressive Phases

A successful deployment proceeds in phases to manage risk.

Phase 1: Proof of Concept on a non-critical use case. Validate that the chosen model meets your quality requirements. Evaluate real-world performance on your hardware.

Phase 2: Pilot with a small group of users. Collect feedback on user experience. Identify problematic prompts and edge cases.

Phase 3: Progressive production rollout with controlled scaling. Intensive monitoring during the first weeks. Rollback procedures defined.

Phase 4: Industrialization with deployment automation, auto-scaling, and full integration with existing systems.

from dataclasses import dataclass
from enum import Enum
from typing import List

class DeploymentPhase(Enum):
    POC = "poc"
    PILOT = "pilot"
    PRODUCTION = "production"
    INDUSTRIALIZED = "industrialized"

@dataclass
class DeploymentChecklist:
    phase: DeploymentPhase
    items: List[str]
    completed: List[bool]

    def progress(self) -> float:
        if not self.items:
            return 1.0
        return sum(self.completed) / len(self.items)

POC_CHECKLIST = DeploymentChecklist(
    phase=DeploymentPhase.POC,
    items=[
        "Model selected and tested locally",
        "Minimum infrastructure provisioned",
        "Quality benchmark on target use case",
        "Performance benchmark (latency, throughput)",
        "Cost estimate validated",
        "Approval to proceed to pilot"
    ],
    completed=[False] * 6
)

PILOT_CHECKLIST = DeploymentChecklist(
    phase=DeploymentPhase.PILOT,
    items=[
        "Pilot group identified (5-20 users)",
        "User training completed",
        "Feedback mechanism in place",
        "Operational monitoring active",
        "Level 1 support available",
        "Success criteria defined and measured",
        "Approval to proceed to production"
    ],
    completed=[False] * 7
)

Each phase validates the assumptions of the previous one and reduces risk for the next.

Common Mistakes to Avoid During Deployment

Several pitfalls await teams embarking on on-premise deployment.

Under-sizing the initial infrastructure. A sluggish model discourages users. Plan for a 30-50% margin above your load estimates.

Neglecting prompt engineering. An open source model with well-crafted prompts can outperform a larger model with naive prompts. Invest in this skill.

Ignoring updates. Models evolve rapidly. A model from 6 months ago may already be significantly outdated. Plan for upgrades.

Forgetting internal documentation. How do users access the service? What are its limitations? Who should they contact if something goes wrong? Document everything for your users.

Failing to plan for support. Users will have questions, encounter issues, and make requests. Who responds? With what response time?


Racine AI helps enterprises deploy sovereign AI solutions. Our Pi-Search offering combines on-premise LLMs with optimized RAG to leverage your internal documents without compromising confidentiality. Contact us for an assessment of your sovereignty requirements.

Technical newsletter

1 article per month on document AI. No spam.

8 - 2 =

Common questions

Are open source models truly comparable to proprietary APIs?

The landscape has shifted significantly since 2023. Llama 4, Mistral 3 and Qwen 2.5 rival proprietary APIs on many tasks. The gap is narrowing on standard tasks (Q&A, summarization, extraction). Proprietary APIs still hold the edge on edge cases and complex reasoning.

What is the minimum infrastructure needed to deploy an LLM on-premise?

For a 7-8B model with 4-bit quantization, a single RTX 4090 (24 GB VRAM) is sufficient. For 70B+ models, you need multiple professional-grade cards (A100, H100). CPU inference is possible via llama.cpp but with response times 10 to 50 times slower.

Does quantization significantly degrade model performance?

According to MMLU benchmarks published by the llama.cpp maintainers, Q4_K_M quantization (4-bit) generally retains over 95% of the original model's performance on comprehension tasks. Degradation becomes more noticeable on creative generation or complex reasoning tasks.

How do you secure an on-premise LLM deployment?

Multiple layers are required: network isolation (dedicated VLAN, no direct Internet access), authentication via reverse proxy with certificates, request logging (metadata only, not sensitive content), and integration with existing IAM to control access.

Which serving framework should you choose for production?

vLLM optimizes throughput for high-volume scenarios. Text Generation Inference (HuggingFace) offers a good balance of simplicity and performance with an OpenAI-compatible API. Ollama simplifies installation for prototyping but is less configurable.

Is on-premise deployment more cost-effective than cloud in the long run?

The break-even point depends on volume. For a few thousand requests per day, cloud remains more cost-effective. For millions of monthly requests, on-premise becomes profitable. Beyond pure cost, sovereignty carries a strategic value that is difficult to quantify.

Which regulations drive organizations toward on-premise deployment?

The financial sector (EBA Guidelines, DORA), healthcare (strengthened GDPR requirements), defense and sensitive government agencies all have requirements that make cloud problematic. The US Cloud Act also raises concerns for data hosted by US providers.

How do you integrate an on-premise LLM with a RAG pipeline over internal documents?

All components can run on-premise: embedding model (Sentence Transformers), vector database (PostgreSQL + pgvector), and LLM. The RAG architecture retrieves relevant passages and injects them into the LLM context. No data leaves your infrastructure.

What metrics should you monitor for an LLM in production?

Four essential metrics: P50/P95/P99 latency (user experience), throughput in tokens per second (capacity), error rate (quality of service), and GPU/VRAM utilization (anticipating saturation). A Grafana dashboard with these metrics enables early problem detection.

How should you plan the deployment roadmap for a sovereign LLM?

Proceed in phases: PoC on a non-critical use case to validate the model and infrastructure, pilot with a small group to collect feedback, progressive production rollout with intensive monitoring, then industrialization with automation. Each phase validates the assumptions of the previous one.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Tell us about your project and get a response within 48h.

Contact us