Back to blog
Guides

Streaming LLM Responses Delivers Real-Time User Experiences

Racine AI

Last updated January 14, 2026

Streaming LLM responses transforms user experience from waiting seconds for complete answers to seeing text appear token-by-token in real time. This technique eliminates perceived latency by displaying output as the model generates it, making even slow inference feel responsive and interactive.

Token-by-token generation creates natural delays that streaming can hide

Large language models generate output one token at a time through autoregressive decoding. Each token prediction depends on all previous tokens, creating a sequential process that cannot be parallelized. For a response of 500 tokens at typical inference speeds, users might wait several seconds before seeing any output.

The OpenAI documentation notes that streaming “allows the client to start processing the response before the complete response is available” (OpenAI API Reference, 2025). This architectural choice reflects a fundamental insight: users perceive an interface as faster when they see progressive results, even if total completion time remains unchanged.

Anthropic’s Claude API and Google’s Gemini API both support streaming through similar mechanisms. The stream=true parameter activates server-sent events that deliver each token as it becomes available. This consistency across providers suggests streaming has become a standard expectation for LLM interfaces.

Without streaming, a chatbot that takes 8 seconds to generate a response shows a loading spinner for the entire duration. With streaming, the first token appears after 200-400 milliseconds, and users read along as the rest arrives. The psychological difference is substantial.

Server-Sent Events provide the simplest streaming implementation

SSE offers a lightweight protocol for server-to-client streaming over HTTP. Unlike WebSockets, SSE uses standard HTTP connections that work through proxies and load balancers without special configuration. The simplicity makes SSE the default choice for most LLM streaming scenarios.

The EventSource API in browsers handles SSE connections automatically, including reconnection on network interruption. Server implementations send events with a specific format: lines prefixed with data: followed by the payload, terminated by double newlines.

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json

app = FastAPI()
client = OpenAI()

async def stream_llm_response(prompt: str):
    """
    Generator that yields SSE-formatted chunks from OpenAI streaming API.
    Each chunk contains a single token or partial token.
    """
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            token = chunk.choices[0].delta.content
            # Format as SSE event
            data = json.dumps({"token": token})
            yield f"data: {data}\n\n"

    # Signal completion
    yield "data: [DONE]\n\n"

@app.post("/chat/stream")
async def chat_stream(request: Request):
    """
    Endpoint that returns a streaming response using SSE format.
    The Content-Type header tells the client to expect an event stream.
    """
    body = await request.json()
    prompt = body.get("prompt", "")

    return StreamingResponse(
        stream_llm_response(prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"  # Disable nginx buffering
        }
    )

The X-Accel-Buffering: no header deserves attention. Nginx and other reverse proxies buffer responses by default, which defeats streaming by accumulating chunks before forwarding them. Disabling this buffering ensures tokens reach the client immediately.

Client-side consumption of the SSE stream uses the browser’s built-in EventSource or the Fetch API with readable streams for more control:

// Using EventSource (simpler, automatic reconnection)
const eventSource = new EventSource('/chat/stream', {
  method: 'POST',
  // Note: EventSource doesn't support POST body natively
  // Use fetch with ReadableStream instead for POST requests
});

eventSource.onmessage = (event) => {
  if (event.data === '[DONE]') {
    eventSource.close();
    return;
  }

  const data = JSON.parse(event.data);
  appendToken(data.token);
};

// Using Fetch API with ReadableStream (more flexible)
async function streamChat(prompt) {
  const response = await fetch('/chat/stream', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();

    if (done) break;

    buffer += decoder.decode(value, { stream: true });

    // Parse SSE events from buffer
    const lines = buffer.split('\n\n');
    buffer = lines.pop(); // Keep incomplete event in buffer

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') continue;

        const parsed = JSON.parse(data);
        appendToken(parsed.token);
      }
    }
  }
}

function appendToken(token) {
  const output = document.getElementById('chat-output');
  output.textContent += token;
}

WebSockets enable bidirectional streaming for complex interactions

While SSE handles most LLM streaming needs, WebSockets offer advantages for applications requiring bidirectional communication. Voice interfaces that stream audio in both directions, collaborative editing with real-time updates, or applications that need to cancel generation mid-stream all benefit from WebSocket connections.

WebSockets maintain a persistent connection after an initial HTTP upgrade handshake. This eliminates the overhead of establishing new connections for each interaction, reducing latency for follow-up messages in conversational applications.

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from openai import OpenAI
import json
import asyncio

app = FastAPI()
client = OpenAI()

class ConnectionManager:
    """
    Manages active WebSocket connections.
    Handles broadcasting and cleanup on disconnect.
    """

    def __init__(self):
        self.active_connections: list[WebSocket] = []

    async def connect(self, websocket: WebSocket):
        await websocket.accept()
        self.active_connections.append(websocket)

    def disconnect(self, websocket: WebSocket):
        self.active_connections.remove(websocket)

manager = ConnectionManager()

@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
    """
    WebSocket endpoint for bidirectional chat streaming.
    Supports message cancellation and real-time interactions.
    """
    await manager.connect(websocket)

    try:
        while True:
            # Receive message from client
            data = await websocket.receive_json()

            if data.get("type") == "cancel":
                # Client wants to cancel current generation
                continue

            prompt = data.get("prompt", "")

            # Stream response back
            stream = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )

            for chunk in stream:
                if chunk.choices[0].delta.content:
                    token = chunk.choices[0].delta.content
                    await websocket.send_json({
                        "type": "token",
                        "content": token
                    })

            await websocket.send_json({"type": "done"})

    except WebSocketDisconnect:
        manager.disconnect(websocket)

The WebSocket approach adds complexity but enables features impossible with SSE. Cancellation requires the client to send a message while the server is streaming, something SSE cannot support since it only allows server-to-client communication.

class ChatWebSocket {
  constructor(url) {
    this.ws = new WebSocket(url);
    this.onToken = null;
    this.onComplete = null;
    this.buffer = '';

    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);

      switch (data.type) {
        case 'token':
          this.buffer += data.content;
          if (this.onToken) this.onToken(data.content, this.buffer);
          break;
        case 'done':
          if (this.onComplete) this.onComplete(this.buffer);
          this.buffer = '';
          break;
      }
    };
  }

  send(prompt) {
    this.buffer = '';
    this.ws.send(JSON.stringify({ type: 'message', prompt }));
  }

  cancel() {
    this.ws.send(JSON.stringify({ type: 'cancel' }));
  }
}

// Usage
const chat = new ChatWebSocket('ws://localhost:8000/ws/chat');

chat.onToken = (token, fullText) => {
  document.getElementById('output').textContent = fullText;
};

chat.send('Explain quantum computing');

// User clicks cancel button
document.getElementById('cancel-btn').onclick = () => chat.cancel();

Chunked transfer encoding works at the HTTP protocol level

For environments where SSE or WebSocket support is limited, HTTP chunked transfer encoding provides an alternative streaming mechanism. The server sends the response body in chunks without specifying the total content length upfront, allowing clients to process data as it arrives.

This approach works with any HTTP client that supports streaming responses. The tradeoff is less structure than SSE events, requiring custom parsing logic to handle partial data and message boundaries.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI

app = FastAPI()
client = OpenAI()

async def generate_chunks(prompt: str):
    """
    Generator yielding raw text chunks without SSE formatting.
    Simpler but requires client-side buffering logic.
    """
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )

    for chunk in stream:
        if content := chunk.choices[0].delta.content:
            yield content

@app.post("/chat/chunked")
async def chat_chunked(prompt: str):
    """
    Returns a chunked response with Transfer-Encoding: chunked.
    Each chunk contains one or more tokens as plain text.
    """
    return StreamingResponse(
        generate_chunks(prompt),
        media_type="text/plain"
    )

React components need careful state management for streaming updates

Frontend frameworks require specific patterns to render streaming content efficiently. Naive approaches that update state on every token cause excessive re-renders, degrading performance and creating visual jitter. Batching updates and using refs for the underlying text buffer solve these issues.

import { useState, useRef, useEffect, useCallback } from 'react';

function useStreamingResponse(endpoint) {
  const [displayText, setDisplayText] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);
  const bufferRef = useRef('');
  const rafRef = useRef(null);

  const updateDisplay = useCallback(() => {
    setDisplayText(bufferRef.current);
    rafRef.current = null;
  }, []);

  const appendToken = useCallback((token) => {
    bufferRef.current += token;

    // Batch updates using requestAnimationFrame
    if (!rafRef.current) {
      rafRef.current = requestAnimationFrame(updateDisplay);
    }
  }, [updateDisplay]);

  const streamMessage = useCallback(async (prompt) => {
    bufferRef.current = '';
    setDisplayText('');
    setIsStreaming(true);

    try {
      const response = await fetch(endpoint, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ prompt })
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();

      let sseBuffer = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        sseBuffer += decoder.decode(value, { stream: true });

        const events = sseBuffer.split('\n\n');
        sseBuffer = events.pop();

        for (const event of events) {
          if (event.startsWith('data: ')) {
            const data = event.slice(6);
            if (data === '[DONE]') continue;

            const parsed = JSON.parse(data);
            appendToken(parsed.token);
          }
        }
      }
    } finally {
      setIsStreaming(false);
      // Final update to ensure all buffered content is displayed
      setDisplayText(bufferRef.current);
    }
  }, [endpoint, appendToken]);

  useEffect(() => {
    return () => {
      if (rafRef.current) {
        cancelAnimationFrame(rafRef.current);
      }
    };
  }, []);

  return { displayText, isStreaming, streamMessage };
}

function ChatComponent() {
  const { displayText, isStreaming, streamMessage } = useStreamingResponse('/chat/stream');
  const [input, setInput] = useState('');

  const handleSubmit = (e) => {
    e.preventDefault();
    streamMessage(input);
    setInput('');
  };

  return (
    <div className="chat-container">
      <div className="messages">
        <div className="assistant-message">
          {displayText}
          {isStreaming && <span className="cursor">|</span>}
        </div>
      </div>

      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          disabled={isStreaming}
          placeholder="Type your message..."
        />
        <button type="submit" disabled={isStreaming}>
          Send
        </button>
      </form>
    </div>
  );
}

The requestAnimationFrame batching limits DOM updates to the display refresh rate, typically 60 times per second. This prevents the performance degradation that occurs when updating state for every individual token, which can happen hundreds of times per second with fast inference.

Token buffering strategies affect readability and performance

Different buffering strategies create different user experiences. No buffering displays each token immediately, creating a typewriter effect but potentially choppy rendering. Word buffering accumulates tokens until a word boundary, then flushes the complete word. Sentence buffering waits for punctuation, producing smoother but more delayed output.

import re
from typing import AsyncIterator

class TokenBuffer:
    """
    Buffers tokens and yields them according to different strategies.
    Balances perceived latency against rendering smoothness.
    """

    def __init__(self, strategy: str = "word"):
        self.buffer = ""
        self.strategy = strategy

    def add(self, token: str) -> str | None:
        """
        Add a token to the buffer.
        Returns content to flush if a boundary is reached.
        """
        self.buffer += token

        if self.strategy == "none":
            # No buffering - immediate flush
            result = self.buffer
            self.buffer = ""
            return result

        elif self.strategy == "word":
            # Flush on whitespace
            if re.search(r'\s$', self.buffer):
                result = self.buffer
                self.buffer = ""
                return result

        elif self.strategy == "sentence":
            # Flush on sentence-ending punctuation
            if re.search(r'[.!?]\s*$', self.buffer):
                result = self.buffer
                self.buffer = ""
                return result

        return None

    def flush(self) -> str:
        """Return any remaining buffered content."""
        result = self.buffer
        self.buffer = ""
        return result

async def buffered_stream(
    token_stream: AsyncIterator[str],
    strategy: str = "word"
) -> AsyncIterator[str]:
    """
    Apply buffering strategy to a token stream.
    Yields content chunks according to the chosen strategy.
    """
    buffer = TokenBuffer(strategy)

    async for token in token_stream:
        if chunk := buffer.add(token):
            yield chunk

    # Flush remaining content
    if final := buffer.flush():
        yield final

The choice of buffering strategy depends on the use case. Chat interfaces typically use no buffering or word buffering for immediacy. Document generation might use sentence buffering for smoother reading. Code generation often benefits from line buffering that flushes on newlines.

Error handling requires graceful degradation during streaming

Network interruptions, rate limits, and server errors can occur mid-stream. Robust implementations detect these failures and provide meaningful feedback without losing already-received content.

from fastapi import HTTPException
from openai import OpenAI, APIError, RateLimitError
import asyncio

async def resilient_stream(prompt: str, max_retries: int = 3):
    """
    Stream with retry logic and error handling.
    Preserves partial output on failure and attempts recovery.
    """
    client = OpenAI()
    accumulated = ""
    retry_count = 0

    while retry_count < max_retries:
        try:
            stream = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                stream=True
            )

            for chunk in stream:
                if content := chunk.choices[0].delta.content:
                    accumulated += content
                    yield {"type": "token", "content": content}

            yield {"type": "done", "total": accumulated}
            return

        except RateLimitError as e:
            retry_count += 1
            wait_time = 2 ** retry_count  # Exponential backoff

            yield {
                "type": "error",
                "recoverable": True,
                "message": f"Rate limited. Retrying in {wait_time}s...",
                "partial": accumulated
            }

            await asyncio.sleep(wait_time)

        except APIError as e:
            yield {
                "type": "error",
                "recoverable": False,
                "message": str(e),
                "partial": accumulated
            }
            return

    yield {
        "type": "error",
        "recoverable": False,
        "message": "Max retries exceeded",
        "partial": accumulated
    }

Client-side error handling should preserve the partial response and offer retry options:

async function streamWithRecovery(prompt, outputElement) {
  let partialContent = '';

  try {
    const response = await fetch('/chat/stream', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value, { stream: true });
      const events = parseSSE(text);

      for (const event of events) {
        switch (event.type) {
          case 'token':
            partialContent += event.content;
            outputElement.textContent = partialContent;
            break;

          case 'error':
            if (event.recoverable) {
              showRetryMessage(event.message);
            } else {
              showError(event.message, partialContent);
            }
            break;

          case 'done':
            showComplete();
            break;
        }
      }
    }
  } catch (networkError) {
    // Connection lost mid-stream
    showError(
      'Connection lost. Your partial response has been preserved.',
      partialContent
    );
    offerRetry(prompt, partialContent);
  }
}

Backpressure handling prevents memory exhaustion on slow clients

When clients cannot consume tokens as fast as the server produces them, unhandled backpressure leads to memory growth. The server buffers unsent data, potentially exhausting memory under load. Proper flow control respects the client’s consumption rate.

from fastapi import FastAPI
from starlette.responses import StreamingResponse
from asyncio import Queue, wait_for, TimeoutError
import asyncio

class BackpressureAwareStream:
    """
    Streaming response that respects client backpressure.
    Implements a bounded queue with timeout for slow consumers.
    """

    def __init__(self, max_buffer: int = 100, timeout: float = 30.0):
        self.queue: Queue = Queue(maxsize=max_buffer)
        self.timeout = timeout
        self.done = False

    async def put(self, item: str):
        """
        Add item to queue, blocking if buffer is full.
        Raises TimeoutError if client is too slow.
        """
        try:
            await wait_for(
                self.queue.put(item),
                timeout=self.timeout
            )
        except TimeoutError:
            self.done = True
            raise

    async def complete(self):
        """Signal stream completion."""
        await self.queue.put(None)
        self.done = True

    async def __aiter__(self):
        """Iterate over queued items."""
        while True:
            item = await self.queue.get()
            if item is None:
                break
            yield item

async def produce_tokens(stream: BackpressureAwareStream, prompt: str):
    """
    Producer coroutine that generates tokens and queues them.
    Respects backpressure from the stream buffer.
    """
    client = OpenAI()

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )

        for chunk in response:
            if content := chunk.choices[0].delta.content:
                await stream.put(f"data: {content}\n\n")

        await stream.complete()

    except TimeoutError:
        # Client too slow, abort generation
        pass

@app.post("/chat/backpressure")
async def chat_with_backpressure(prompt: str):
    """
    Endpoint with backpressure-aware streaming.
    Protects server memory from slow clients.
    """
    stream = BackpressureAwareStream(max_buffer=50, timeout=30.0)

    # Start producer in background
    asyncio.create_task(produce_tokens(stream, prompt))

    return StreamingResponse(
        stream,
        media_type="text/event-stream"
    )

Load balancing streaming connections requires sticky sessions

HTTP load balancers distribute requests across backend servers. For streaming, this creates a challenge: the initial request and subsequent chunks must reach the same server. Without session affinity, chunks from different servers interleave, corrupting the response.

WebSocket connections naturally maintain server affinity since the connection persists. SSE over HTTP requires explicit configuration. Most load balancers support cookie-based or IP-based session affinity that keeps streaming requests on their originating server.

For Kubernetes deployments, the session affinity annotation ensures pods receive all chunks from connections they initiate:

apiVersion: v1
kind: Service
metadata:
  name: llm-streaming-service
spec:
  selector:
    app: llm-server
  ports:
    - port: 80
      targetPort: 8000
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour session stickiness

Monitoring streaming endpoints differs from standard request metrics

Traditional request duration metrics lose meaning for streaming endpoints. A 30-second stream is not a slow request but a successful long-running response. Monitoring should track time to first token, tokens per second throughput, and completion rate rather than total duration.

import time
from prometheus_client import Histogram, Counter, Gauge

# Metrics definitions
time_to_first_token = Histogram(
    'llm_time_to_first_token_seconds',
    'Time from request to first token',
    buckets=[0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)

tokens_per_second = Histogram(
    'llm_tokens_per_second',
    'Token generation throughput',
    buckets=[10, 25, 50, 100, 200]
)

active_streams = Gauge(
    'llm_active_streams',
    'Currently active streaming connections'
)

stream_completion = Counter(
    'llm_stream_completions_total',
    'Total stream completions',
    ['status']  # success, error, cancelled
)

class MeteredStream:
    """
    Wrapper that instruments a token stream with Prometheus metrics.
    """

    def __init__(self, token_generator):
        self.generator = token_generator
        self.start_time = None
        self.first_token_time = None
        self.token_count = 0

    async def __aiter__(self):
        self.start_time = time.time()
        active_streams.inc()

        try:
            async for token in self.generator:
                if self.first_token_time is None:
                    self.first_token_time = time.time()
                    ttft = self.first_token_time - self.start_time
                    time_to_first_token.observe(ttft)

                self.token_count += 1
                yield token

            # Calculate throughput
            duration = time.time() - self.start_time
            if duration > 0:
                tps = self.token_count / duration
                tokens_per_second.observe(tps)

            stream_completion.labels(status='success').inc()

        except Exception:
            stream_completion.labels(status='error').inc()
            raise

        finally:
            active_streams.dec()

Moving forward with streaming implementations

Start simple with SSE for unidirectional streaming before adding WebSocket complexity. Most LLM applications only need server-to-client streaming, and SSE handles this case with minimal code. Reserve WebSockets for applications that genuinely require bidirectional communication or mid-stream cancellation.

Test streaming behavior under realistic conditions. Slow clients, network interruptions, and high concurrency expose issues that unit tests miss. Load testing with tools like k6 or Locust can simulate streaming connections and measure time to first token under load.

Consider the full path from LLM provider to end user. Each hop introduces potential buffering: the LLM API, your backend server, reverse proxies, CDNs, and the client browser. Audit each layer to ensure streaming actually streams end-to-end. A single buffering component negates the benefits of token-by-token delivery.

Technical newsletter

1 article per month on document AI. No spam.

7 + 6 =

Common questions

What is the difference between SSE and WebSocket for LLM streaming?

SSE only allows server-to-client communication over standard HTTP, which suffices for most chatbots. WebSocket offers bidirectional communication enabling features like mid-stream cancellation, but adds complexity. Literature recommends starting with SSE and moving to WebSocket only if bidirectionality is required.

How to prevent nginx from buffering streaming responses?

Add the X-Accel-Buffering: no header in the HTTP response. This header disables nginx buffering for that specific request. Without this header, nginx accumulates chunks before forwarding them, which completely negates the streaming benefit.

Why use requestAnimationFrame for React streaming updates?

Tokens potentially arrive hundreds of times per second. Updating React state on each token causes excessive re-renders and degrades performance. Batching via requestAnimationFrame limits DOM updates to screen refresh rate, typically 60 times per second.

How to handle cancellation of an ongoing LLM generation?

Cancellation requires bidirectional communication that SSE does not allow. With WebSocket, the client can send a cancel message while the server streams. Server-side, implement a cancellation flag check in the generation loop.

Which token buffering strategy to choose?

For interactive chatbots, no buffering or word buffering offers best responsiveness. For document generation, sentence buffering produces smoother reading. For code, line buffering that flushes on newlines is often optimal.

How to monitor streaming endpoints in production?

Traditional request duration metrics do not fit. Instead track time to first token (TTFT), tokens per second throughput, concurrent active streams count, and completion rate. These streaming-specific metrics reveal real performance issues.

Why are sticky sessions necessary for SSE load balancing?

With SSE over HTTP, the load balancer could route chunks from the same response to different backend servers. This corrupts the response. Session affinity (by cookie or IP) ensures all chunks from a connection reach the same server.

How to preserve partial response on network error?

Client-side, accumulate received text in a local variable independent of connection state. On error, display partial content with an error message. Offer a retry option that can resume or restart depending on use case.

Is backpressure a real problem in production?

Yes, especially under load. If the client consumes tokens slower than the server produces them, the server buffers unsent data. With many slow connections, server memory can exhaust. A bounded queue with timeout protects against this scenario.

What latency to aim for time to first token?

For a responsive user experience, aim for under 500 milliseconds to first token. Beyond one second, the interface feels slow despite streaming. This metric depends mainly on network latency and LLM initialization time, not your streaming implementation.

Let's discuss

Your Project.

AI Documents, legacy automation, field inspection. We deploy solutions that go to production.

Tell us about your project and get a response within 48h.

Contact us