Real-time token streaming via Server-Sent Events, replacing the blocking ainvoke() pattern with progressive response delivery
Phase 15 replaces the blocking graph.ainvoke() call in the chat endpoint with Server-Sent Events (SSE) streaming using graph.astream(). Previously, the chat endpoint waited 5-15 seconds for the full LangGraph pipeline to complete before returning any response. During this time, users saw only a loading indicator with no feedback about what was happening. This phase introduces real-time token-by-token streaming — the same interaction pattern used by ChatGPT, Claude, and other modern AI chat interfaces.
Business Value: Perceived latency drops from 5-15 seconds to near-instant. Users see the first token within ~200ms of submitting their message, transforming the experience from "waiting for a wall of text" to "watching the tutor think and respond in real time."
The core decision is to use the browser's native fetch() API with ReadableStream to consume a text/event-stream response from a new streaming endpoint. This was chosen over three alternatives:
HTMX has a built-in SSE extension (hx-ext="sse"), but it replaces the target element's content on each event rather than appending tokens. For chat streaming, we need to append each token to a growing response bubble. The HTMX SSE model is designed for discrete updates (e.g., notifications, status changes), not for progressive text accumulation. Workarounds involving innerHTML += on each event would fight against HTMX's declarative model and introduce brittleness.
WebSockets provide full-duplex communication, but chat streaming is fundamentally unidirectional — the server streams tokens to the client. WebSockets add connection management complexity (heartbeats, reconnection, multiplexing), require a different server infrastructure path (upgrade handshake, persistent connections), and are overkill for a request-response pattern where the client sends a message and receives a streamed reply.
The native EventSource API is purpose-built for SSE, but it only supports GET requests. Chat messages must be sent via POST with form data (message text, language, level, conversation context). There is no clean way to encode this as query parameters, especially as conversation history grows. Using EventSource would require either a two-step flow (POST the message first, then GET the stream) or encoding large payloads in URLs, both of which add unnecessary complexity.
The key technical insight enabling this feature: LangGraph's astream(stream_mode=["messages", "updates"]) intercepts LLM callbacks automatically, even when graph nodes use ainvoke() internally. This means no changes were needed to the LLM configuration, node code, or prompt setup. The existing respond_node calls llm.ainvoke(messages) as before — LangGraph's streaming infrastructure hooks into the underlying LLM client to emit tokens as they arrive.
Two stream modes are used simultaneously:
messages: Emits individual LLM tokens asAIMessageChunkobjects during node execution. These are the tokens streamed to the client in real time.updates: Emits the final state update when each node completes. These are used to detect when post-LLM nodes (scaffold, grammar, pronunciation) finish and to extract their results from the graph state.
The server emits a structured sequence of SSE event types:
event: token
data: {"content": "Hola"}
event: token
data: {"content": " amigo"}
event: token
data: {"content": "!"}
event: response_complete
data: {"message_id": "msg_123"}
event: scaffolding
data: {"html": "<div class=\"scaffold\">...</div>"}
event: grammar
data: {"html": "<div class=\"grammar-feedback\">...</div>"}
event: pronunciation
data: {"html": "<div class=\"pronunciation-tips\">...</div>"}
event: done
data: {}
Protocol design rationale:
tokenevents carry individual text chunks for progressive displayresponse_completesignals the LLM has finished generating, allowing the client to finalize the response bubblescaffolding,grammar, andpronunciationevents carry server-rendered HTML fragments produced by the existing Jinja2 partials. This reuses the same templates from the non-streaming path, avoiding template duplication.doneis the terminal event, signaling the client to clean up (re-enable input, hide loading state)- Events after
response_completeare optional — they only fire if the corresponding graph nodes produce output (e.g., scaffolding only appears for A0-A1 levels)
Post-response enrichments (scaffolding, grammar feedback, pronunciation tips) are sent as pre-rendered HTML fragments rather than raw JSON data. This reuses the existing Jinja2 partials (scaffold.html, grammar_feedback.html, pronunciation_tips.html) and keeps the client-side JavaScript simple — it just inserts HTML into the DOM. This follows the HTMX philosophy of server-rendered HTML, even though the streaming transport itself is JavaScript-driven.
The original POST /chat endpoint is preserved alongside the new POST /chat/stream endpoint:
POST /chat: Original blocking endpoint usinggraph.ainvoke(). Returns a complete HTML fragment via HTMX. Serves as a fallback for clients without JavaScript or for testing.POST /chat/stream: New streaming endpoint usinggraph.astream(). Returnstext/event-streamconsumed bystream.js.
Both endpoints accept the same form data and produce the same final result. The streaming endpoint yields tokens progressively, while the original returns everything at once.
A new JavaScript module handles the client-side streaming logic:
async function streamChat(formData) {
const response = await fetch("/chat/stream", {
method: "POST",
body: formData,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
// Create empty response bubble in chat container
const bubble = createResponseBubble();
// Read SSE stream
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Parse SSE events from buffer
// Dispatch to handlers: appendToken, insertHTML, finalize
}
}Responsibilities:
- Intercept the chat form submission (prevent default HTMX behavior for the streaming path)
- Create an empty AI response bubble in the chat container
- Open a
fetch()stream to/chat/stream - Parse the SSE text protocol from the
ReadableStream - Append tokens to the response bubble as they arrive
- Insert server-rendered HTML fragments for scaffolding, grammar, and pronunciation
- Re-enable the input form and hide loading state on
done - Handle errors and connection failures gracefully
1. User submits message
2. JavaScript shows user message optimistically (existing behavior)
3. JavaScript creates empty AI response bubble with cursor animation
4. fetch() POST to /chat/stream with form data
5. Server begins LangGraph astream()
6. Tokens arrive → append to bubble text
7. response_complete → finalize bubble, stop cursor animation
8. scaffolding/grammar/pronunciation → insert HTML below bubble
9. done → re-enable input, scroll to bottom
@router.post("/chat/stream")
async def stream_chat(
request: Request,
message: str = Form(...),
level: str = Form("A1"),
language: str = Form("es"),
) -> StreamingResponse:
"""Stream chat response as Server-Sent Events."""
async def event_generator():
async for stream_mode, chunk in graph.astream(
input_state,
stream_mode=["messages", "updates"],
):
if stream_mode == "messages":
# LLM token — emit as SSE token event
if hasattr(chunk, "content") and chunk.content:
yield f"event: token\ndata: {json.dumps({'content': chunk.content})}\n\n"
elif stream_mode == "updates":
# Node completed — check for post-response data
# Render HTML partials and emit as typed events
...
yield "event: done\ndata: {}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
},
)Key headers:
Cache-Control: no-cacheprevents proxies from buffering the streamX-Accel-Buffering: nodisables Nginx buffering if deployed behind a reverse proxy
No new dependencies. The implementation uses:
langgraph(existing):astream()withstream_modeparameterfastapi(existing):StreamingResponsefor SSE- Browser
fetch()+ReadableStreamAPIs (native, no polyfill needed for modern browsers)
src/api/routes/chat.py: New/chat/streamendpoint alongside existing/chatsrc/static/js/stream.js: New client-side streaming modulesrc/templates/chat.html: Updated to loadstream.jsand wire up form submission- Existing Jinja2 partials: Reused for server-rendering HTML fragments in SSE events
- Real-time UX: First token visible within ~200ms, matching user expectations set by ChatGPT and Claude
- No LLM changes needed:
astream(stream_mode=["messages", "updates"])intercepts callbacks automatically — existing node code usingainvoke()works without modification - Server-rendered HTML reuse: Post-response enrichments (scaffolding, grammar, pronunciation) use the same Jinja2 partials as the non-streaming path
- Graceful degradation: The original
/chatendpoint remains functional as a fallback - Progressive enhancement: The streaming experience layers on top of the existing HTMX architecture without replacing it
- Additional JavaScript complexity:
stream.jsintroduces imperative DOM manipulation alongside the declarative HTMX patterns used elsewhere in the app. This is a necessary trade-off — HTMX's SSE extension cannot handle token-by-token appending. - Two chat endpoints to maintain:
/chatand/chat/streamaccept the same inputs but have different response mechanisms. Changes to chat logic (e.g., new form fields, authentication) must be applied to both endpoints. - SSE parsing complexity: The client must parse the SSE text protocol manually from the
ReadableStream, handling buffering, multi-line data fields, and partial reads. This is straightforward but adds code that would be unnecessary withEventSource(if it supported POST).
test_chat_stream.py: Test the streaming endpoint- Verify SSE event format (correct
event:anddata:lines) - Verify token events contain content
- Verify
response_completefires after last token - Verify
doneis always the final event - Verify post-response events (scaffolding, grammar, pronunciation) contain valid HTML
- Verify A2-B1 levels skip scaffolding event
- Verify SSE event format (correct
- Full stream consumption: submit message, collect all events, verify final state matches non-streaming endpoint output
- Error handling: verify stream closes cleanly on graph errors
- Verify token-by-token rendering in browser at all CEFR levels
- Verify scaffolding/grammar/pronunciation appear after response completes
- Verify input re-enables after
doneevent - Verify cursor animation during streaming
- Test on mobile (iOS Safari, Android Chrome) for streaming support
- Tokens stream to the browser as the LLM generates them
- Post-response enrichments appear after the response completes
- The existing
/chatendpoint continues to work unchanged - All CEFR levels and languages work with streaming
- First token visible within 500ms of form submission
- No perceptible delay between consecutive tokens
- Total response time equal to or better than blocking endpoint
- SSE event protocol followed correctly (token → response_complete → enrichments → done)
- Stream closes cleanly on completion and on error
- No memory leaks from unclosed streams or readers
- All existing tests pass unchanged