Last updated January 14, 2026
Streaming LLM responses transforms user experience from waiting seconds for complete answers to seeing text appear token-by-token in real time. This technique eliminates perceived latency by displaying output as the model generates it, making even slow inference feel responsive and interactive.
Token-by-token generation creates natural delays that streaming can hide
Large language models generate output one token at a time through autoregressive decoding. Each token prediction depends on all previous tokens, creating a sequential process that cannot be parallelized. For a response of 500 tokens at typical inference speeds, users might wait several seconds before seeing any output.
The OpenAI documentation notes that streaming “allows the client to start processing the response before the complete response is available” (OpenAI API Reference, 2025). This architectural choice reflects a fundamental insight: users perceive an interface as faster when they see progressive results, even if total completion time remains unchanged.
Anthropic’s Claude API and Google’s Gemini API both support streaming through similar mechanisms. The stream=true parameter activates server-sent events that deliver each token as it becomes available. This consistency across providers suggests streaming has become a standard expectation for LLM interfaces.
Without streaming, a chatbot that takes 8 seconds to generate a response shows a loading spinner for the entire duration. With streaming, the first token appears after 200-400 milliseconds, and users read along as the rest arrives. The psychological difference is substantial.
Server-Sent Events provide the simplest streaming implementation
SSE offers a lightweight protocol for server-to-client streaming over HTTP. Unlike WebSockets, SSE uses standard HTTP connections that work through proxies and load balancers without special configuration. The simplicity makes SSE the default choice for most LLM streaming scenarios.
The EventSource API in browsers handles SSE connections automatically, including reconnection on network interruption. Server implementations send events with a specific format: lines prefixed with data: followed by the payload, terminated by double newlines.
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import OpenAI
import json
app = FastAPI()
client = OpenAI()
async def stream_llm_response(prompt: str):
"""
Generator that yields SSE-formatted chunks from OpenAI streaming API.
Each chunk contains a single token or partial token.
"""
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
# Format as SSE event
data = json.dumps({"token": token})
yield f"data: {data}\n\n"
# Signal completion
yield "data: [DONE]\n\n"
@app.post("/chat/stream")
async def chat_stream(request: Request):
"""
Endpoint that returns a streaming response using SSE format.
The Content-Type header tells the client to expect an event stream.
"""
body = await request.json()
prompt = body.get("prompt", "")
return StreamingResponse(
stream_llm_response(prompt),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no" # Disable nginx buffering
}
)
The X-Accel-Buffering: no header deserves attention. Nginx and other reverse proxies buffer responses by default, which defeats streaming by accumulating chunks before forwarding them. Disabling this buffering ensures tokens reach the client immediately.
Client-side consumption of the SSE stream uses the browser’s built-in EventSource or the Fetch API with readable streams for more control:
// Using EventSource (simpler, automatic reconnection)
const eventSource = new EventSource('/chat/stream', {
method: 'POST',
// Note: EventSource doesn't support POST body natively
// Use fetch with ReadableStream instead for POST requests
});
eventSource.onmessage = (event) => {
if (event.data === '[DONE]') {
eventSource.close();
return;
}
const data = JSON.parse(event.data);
appendToken(data.token);
};
// Using Fetch API with ReadableStream (more flexible)
async function streamChat(prompt) {
const response = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE events from buffer
const lines = buffer.split('\n\n');
buffer = lines.pop(); // Keep incomplete event in buffer
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') continue;
const parsed = JSON.parse(data);
appendToken(parsed.token);
}
}
}
}
function appendToken(token) {
const output = document.getElementById('chat-output');
output.textContent += token;
}
WebSockets enable bidirectional streaming for complex interactions
While SSE handles most LLM streaming needs, WebSockets offer advantages for applications requiring bidirectional communication. Voice interfaces that stream audio in both directions, collaborative editing with real-time updates, or applications that need to cancel generation mid-stream all benefit from WebSocket connections.
WebSockets maintain a persistent connection after an initial HTTP upgrade handshake. This eliminates the overhead of establishing new connections for each interaction, reducing latency for follow-up messages in conversational applications.
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from openai import OpenAI
import json
import asyncio
app = FastAPI()
client = OpenAI()
class ConnectionManager:
"""
Manages active WebSocket connections.
Handles broadcasting and cleanup on disconnect.
"""
def __init__(self):
self.active_connections: list[WebSocket] = []
async def connect(self, websocket: WebSocket):
await websocket.accept()
self.active_connections.append(websocket)
def disconnect(self, websocket: WebSocket):
self.active_connections.remove(websocket)
manager = ConnectionManager()
@app.websocket("/ws/chat")
async def websocket_chat(websocket: WebSocket):
"""
WebSocket endpoint for bidirectional chat streaming.
Supports message cancellation and real-time interactions.
"""
await manager.connect(websocket)
try:
while True:
# Receive message from client
data = await websocket.receive_json()
if data.get("type") == "cancel":
# Client wants to cancel current generation
continue
prompt = data.get("prompt", "")
# Stream response back
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
await websocket.send_json({
"type": "token",
"content": token
})
await websocket.send_json({"type": "done"})
except WebSocketDisconnect:
manager.disconnect(websocket)
The WebSocket approach adds complexity but enables features impossible with SSE. Cancellation requires the client to send a message while the server is streaming, something SSE cannot support since it only allows server-to-client communication.
class ChatWebSocket {
constructor(url) {
this.ws = new WebSocket(url);
this.onToken = null;
this.onComplete = null;
this.buffer = '';
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case 'token':
this.buffer += data.content;
if (this.onToken) this.onToken(data.content, this.buffer);
break;
case 'done':
if (this.onComplete) this.onComplete(this.buffer);
this.buffer = '';
break;
}
};
}
send(prompt) {
this.buffer = '';
this.ws.send(JSON.stringify({ type: 'message', prompt }));
}
cancel() {
this.ws.send(JSON.stringify({ type: 'cancel' }));
}
}
// Usage
const chat = new ChatWebSocket('ws://localhost:8000/ws/chat');
chat.onToken = (token, fullText) => {
document.getElementById('output').textContent = fullText;
};
chat.send('Explain quantum computing');
// User clicks cancel button
document.getElementById('cancel-btn').onclick = () => chat.cancel();
Chunked transfer encoding works at the HTTP protocol level
For environments where SSE or WebSocket support is limited, HTTP chunked transfer encoding provides an alternative streaming mechanism. The server sends the response body in chunks without specifying the total content length upfront, allowing clients to process data as it arrives.
This approach works with any HTTP client that supports streaming responses. The tradeoff is less structure than SSE events, requiring custom parsing logic to handle partial data and message boundaries.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import OpenAI
app = FastAPI()
client = OpenAI()
async def generate_chunks(prompt: str):
"""
Generator yielding raw text chunks without SSE formatting.
Simpler but requires client-side buffering logic.
"""
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if content := chunk.choices[0].delta.content:
yield content
@app.post("/chat/chunked")
async def chat_chunked(prompt: str):
"""
Returns a chunked response with Transfer-Encoding: chunked.
Each chunk contains one or more tokens as plain text.
"""
return StreamingResponse(
generate_chunks(prompt),
media_type="text/plain"
)
React components need careful state management for streaming updates
Frontend frameworks require specific patterns to render streaming content efficiently. Naive approaches that update state on every token cause excessive re-renders, degrading performance and creating visual jitter. Batching updates and using refs for the underlying text buffer solve these issues.
import { useState, useRef, useEffect, useCallback } from 'react';
function useStreamingResponse(endpoint) {
const [displayText, setDisplayText] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const bufferRef = useRef('');
const rafRef = useRef(null);
const updateDisplay = useCallback(() => {
setDisplayText(bufferRef.current);
rafRef.current = null;
}, []);
const appendToken = useCallback((token) => {
bufferRef.current += token;
// Batch updates using requestAnimationFrame
if (!rafRef.current) {
rafRef.current = requestAnimationFrame(updateDisplay);
}
}, [updateDisplay]);
const streamMessage = useCallback(async (prompt) => {
bufferRef.current = '';
setDisplayText('');
setIsStreaming(true);
try {
const response = await fetch(endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let sseBuffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
sseBuffer += decoder.decode(value, { stream: true });
const events = sseBuffer.split('\n\n');
sseBuffer = events.pop();
for (const event of events) {
if (event.startsWith('data: ')) {
const data = event.slice(6);
if (data === '[DONE]') continue;
const parsed = JSON.parse(data);
appendToken(parsed.token);
}
}
}
} finally {
setIsStreaming(false);
// Final update to ensure all buffered content is displayed
setDisplayText(bufferRef.current);
}
}, [endpoint, appendToken]);
useEffect(() => {
return () => {
if (rafRef.current) {
cancelAnimationFrame(rafRef.current);
}
};
}, []);
return { displayText, isStreaming, streamMessage };
}
function ChatComponent() {
const { displayText, isStreaming, streamMessage } = useStreamingResponse('/chat/stream');
const [input, setInput] = useState('');
const handleSubmit = (e) => {
e.preventDefault();
streamMessage(input);
setInput('');
};
return (
<div className="chat-container">
<div className="messages">
<div className="assistant-message">
{displayText}
{isStreaming && <span className="cursor">|</span>}
</div>
</div>
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
disabled={isStreaming}
placeholder="Type your message..."
/>
<button type="submit" disabled={isStreaming}>
Send
</button>
</form>
</div>
);
}
The requestAnimationFrame batching limits DOM updates to the display refresh rate, typically 60 times per second. This prevents the performance degradation that occurs when updating state for every individual token, which can happen hundreds of times per second with fast inference.
Token buffering strategies affect readability and performance
Different buffering strategies create different user experiences. No buffering displays each token immediately, creating a typewriter effect but potentially choppy rendering. Word buffering accumulates tokens until a word boundary, then flushes the complete word. Sentence buffering waits for punctuation, producing smoother but more delayed output.
import re
from typing import AsyncIterator
class TokenBuffer:
"""
Buffers tokens and yields them according to different strategies.
Balances perceived latency against rendering smoothness.
"""
def __init__(self, strategy: str = "word"):
self.buffer = ""
self.strategy = strategy
def add(self, token: str) -> str | None:
"""
Add a token to the buffer.
Returns content to flush if a boundary is reached.
"""
self.buffer += token
if self.strategy == "none":
# No buffering - immediate flush
result = self.buffer
self.buffer = ""
return result
elif self.strategy == "word":
# Flush on whitespace
if re.search(r'\s$', self.buffer):
result = self.buffer
self.buffer = ""
return result
elif self.strategy == "sentence":
# Flush on sentence-ending punctuation
if re.search(r'[.!?]\s*$', self.buffer):
result = self.buffer
self.buffer = ""
return result
return None
def flush(self) -> str:
"""Return any remaining buffered content."""
result = self.buffer
self.buffer = ""
return result
async def buffered_stream(
token_stream: AsyncIterator[str],
strategy: str = "word"
) -> AsyncIterator[str]:
"""
Apply buffering strategy to a token stream.
Yields content chunks according to the chosen strategy.
"""
buffer = TokenBuffer(strategy)
async for token in token_stream:
if chunk := buffer.add(token):
yield chunk
# Flush remaining content
if final := buffer.flush():
yield final
The choice of buffering strategy depends on the use case. Chat interfaces typically use no buffering or word buffering for immediacy. Document generation might use sentence buffering for smoother reading. Code generation often benefits from line buffering that flushes on newlines.
Error handling requires graceful degradation during streaming
Network interruptions, rate limits, and server errors can occur mid-stream. Robust implementations detect these failures and provide meaningful feedback without losing already-received content.
from fastapi import HTTPException
from openai import OpenAI, APIError, RateLimitError
import asyncio
async def resilient_stream(prompt: str, max_retries: int = 3):
"""
Stream with retry logic and error handling.
Preserves partial output on failure and attempts recovery.
"""
client = OpenAI()
accumulated = ""
retry_count = 0
while retry_count < max_retries:
try:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if content := chunk.choices[0].delta.content:
accumulated += content
yield {"type": "token", "content": content}
yield {"type": "done", "total": accumulated}
return
except RateLimitError as e:
retry_count += 1
wait_time = 2 ** retry_count # Exponential backoff
yield {
"type": "error",
"recoverable": True,
"message": f"Rate limited. Retrying in {wait_time}s...",
"partial": accumulated
}
await asyncio.sleep(wait_time)
except APIError as e:
yield {
"type": "error",
"recoverable": False,
"message": str(e),
"partial": accumulated
}
return
yield {
"type": "error",
"recoverable": False,
"message": "Max retries exceeded",
"partial": accumulated
}
Client-side error handling should preserve the partial response and offer retry options:
async function streamWithRecovery(prompt, outputElement) {
let partialContent = '';
try {
const response = await fetch('/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value, { stream: true });
const events = parseSSE(text);
for (const event of events) {
switch (event.type) {
case 'token':
partialContent += event.content;
outputElement.textContent = partialContent;
break;
case 'error':
if (event.recoverable) {
showRetryMessage(event.message);
} else {
showError(event.message, partialContent);
}
break;
case 'done':
showComplete();
break;
}
}
}
} catch (networkError) {
// Connection lost mid-stream
showError(
'Connection lost. Your partial response has been preserved.',
partialContent
);
offerRetry(prompt, partialContent);
}
}
Backpressure handling prevents memory exhaustion on slow clients
When clients cannot consume tokens as fast as the server produces them, unhandled backpressure leads to memory growth. The server buffers unsent data, potentially exhausting memory under load. Proper flow control respects the client’s consumption rate.
from fastapi import FastAPI
from starlette.responses import StreamingResponse
from asyncio import Queue, wait_for, TimeoutError
import asyncio
class BackpressureAwareStream:
"""
Streaming response that respects client backpressure.
Implements a bounded queue with timeout for slow consumers.
"""
def __init__(self, max_buffer: int = 100, timeout: float = 30.0):
self.queue: Queue = Queue(maxsize=max_buffer)
self.timeout = timeout
self.done = False
async def put(self, item: str):
"""
Add item to queue, blocking if buffer is full.
Raises TimeoutError if client is too slow.
"""
try:
await wait_for(
self.queue.put(item),
timeout=self.timeout
)
except TimeoutError:
self.done = True
raise
async def complete(self):
"""Signal stream completion."""
await self.queue.put(None)
self.done = True
async def __aiter__(self):
"""Iterate over queued items."""
while True:
item = await self.queue.get()
if item is None:
break
yield item
async def produce_tokens(stream: BackpressureAwareStream, prompt: str):
"""
Producer coroutine that generates tokens and queues them.
Respects backpressure from the stream buffer.
"""
client = OpenAI()
try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in response:
if content := chunk.choices[0].delta.content:
await stream.put(f"data: {content}\n\n")
await stream.complete()
except TimeoutError:
# Client too slow, abort generation
pass
@app.post("/chat/backpressure")
async def chat_with_backpressure(prompt: str):
"""
Endpoint with backpressure-aware streaming.
Protects server memory from slow clients.
"""
stream = BackpressureAwareStream(max_buffer=50, timeout=30.0)
# Start producer in background
asyncio.create_task(produce_tokens(stream, prompt))
return StreamingResponse(
stream,
media_type="text/event-stream"
)
Load balancing streaming connections requires sticky sessions
HTTP load balancers distribute requests across backend servers. For streaming, this creates a challenge: the initial request and subsequent chunks must reach the same server. Without session affinity, chunks from different servers interleave, corrupting the response.
WebSocket connections naturally maintain server affinity since the connection persists. SSE over HTTP requires explicit configuration. Most load balancers support cookie-based or IP-based session affinity that keeps streaming requests on their originating server.
For Kubernetes deployments, the session affinity annotation ensures pods receive all chunks from connections they initiate:
apiVersion: v1
kind: Service
metadata:
name: llm-streaming-service
spec:
selector:
app: llm-server
ports:
- port: 80
targetPort: 8000
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600 # 1 hour session stickiness
Monitoring streaming endpoints differs from standard request metrics
Traditional request duration metrics lose meaning for streaming endpoints. A 30-second stream is not a slow request but a successful long-running response. Monitoring should track time to first token, tokens per second throughput, and completion rate rather than total duration.
import time
from prometheus_client import Histogram, Counter, Gauge
# Metrics definitions
time_to_first_token = Histogram(
'llm_time_to_first_token_seconds',
'Time from request to first token',
buckets=[0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
)
tokens_per_second = Histogram(
'llm_tokens_per_second',
'Token generation throughput',
buckets=[10, 25, 50, 100, 200]
)
active_streams = Gauge(
'llm_active_streams',
'Currently active streaming connections'
)
stream_completion = Counter(
'llm_stream_completions_total',
'Total stream completions',
['status'] # success, error, cancelled
)
class MeteredStream:
"""
Wrapper that instruments a token stream with Prometheus metrics.
"""
def __init__(self, token_generator):
self.generator = token_generator
self.start_time = None
self.first_token_time = None
self.token_count = 0
async def __aiter__(self):
self.start_time = time.time()
active_streams.inc()
try:
async for token in self.generator:
if self.first_token_time is None:
self.first_token_time = time.time()
ttft = self.first_token_time - self.start_time
time_to_first_token.observe(ttft)
self.token_count += 1
yield token
# Calculate throughput
duration = time.time() - self.start_time
if duration > 0:
tps = self.token_count / duration
tokens_per_second.observe(tps)
stream_completion.labels(status='success').inc()
except Exception:
stream_completion.labels(status='error').inc()
raise
finally:
active_streams.dec()
Moving forward with streaming implementations
Start simple with SSE for unidirectional streaming before adding WebSocket complexity. Most LLM applications only need server-to-client streaming, and SSE handles this case with minimal code. Reserve WebSockets for applications that genuinely require bidirectional communication or mid-stream cancellation.
Test streaming behavior under realistic conditions. Slow clients, network interruptions, and high concurrency expose issues that unit tests miss. Load testing with tools like k6 or Locust can simulate streaming connections and measure time to first token under load.
Consider the full path from LLM provider to end user. Each hop introduces potential buffering: the LLM API, your backend server, reverse proxies, CDNs, and the client browser. Audit each layer to ensure streaming actually streams end-to-end. A single buffering component negates the benefits of token-by-token delivery.