Skip to content

[Architecture] Blocking Workflow and Lack of Streaming Architecture Severely Impacts Time-to-First-Token (TTFT) #706

@QuocAnhh

Description

@QuocAnhh

After integrating Parlant into a production-like environment and benchmarking it against standard
LangChain/LangGraph implementations, we have identified that the current Parlant architecture imposes a significant "Architectural Tax" on latency. The current workflow appears to be designed as a synchronous, blocking pipeline that prioritizes internal state consistency over user experience (latency).

Architectural Analysis (The Inefficiency)
The current event loop seems to follow a strictly serial execution path that prevents any form of
"perceived speed" optimization:
1. Strict "Think-Then-Speak" Pattern:
The framework appears to enforce a rigid sequence:
Ingestion -> Context/Guideline Retrieval -> Planning -> Tool Execution -> Final Response Generation.
* The Issue: The API does not expose the Final Response Generation stream until the entire block is
complete. In modern Agentic UX, we need "Think-While-Speaking" or at least direct access to the
LLM's token stream as soon as the "Planning" phase determines a response is needed.

2. Opaque "Black Box" Latency:
Between the user sending a message and the message event being emitted, the server performs heavy
lifting (Glossary matching, Guideline checking) that is opaque to the client. For simple queries (e.g.,
"Hello"), this overhead is disproportionately high compared to a direct LLM call (~3s vs ~500ms).

3. Polling vs. Push Protocol:
Reliance on HTTP Long Polling (/events?wait_for_data=60) is inefficient for conversational AI. It
introduces unnecessary network Round-Trip Time (RTT) and connection management overhead compared to a persistent Server-Sent Events (SSE) or WebSocket stream which is the industry standard for LLM applications.

Impact

  • High Latency (TTFT): Time-To-First-Token is consistently >3-5 seconds, regardless of query complexity.
  • Poor UX: Users receive no feedback while the agent performs internal "Guideline Checks" or "Context
    Updates."
  • Unusable for Real-Time: The architecture is currently unsuitable for Voice-to-Voice or real-time chat
    applications where sub-1000ms latency is required.

Suggested Improvements

  1. Implement Server-Sent Events (SSE): Move away from Long Polling. Provide an endpoint that streams LLM
    tokens (delta events) directly to the client as they are generated, bypassing the requirement to wait
    for the full message object.
  2. Asynchronous Pipeline: Decouple the "Guideline/Safety" checks from the initial token generation where
    possible, or optimize the pipeline to allow "Optimistic Responses."
  3. Expose Intermediate Thought States: Native support for streaming the "Planning" or "Reasoning" tokens
    to the client so the UI can render the agent's "Thought Process" (similar to OpenAI's reasoning
    tokens) to mitigate perceived latency.

Conclusion
While the Guideline and Glossary features are powerful, the current execution model acts as a "Slow Proxy"that negates the benefits of fast LLMs (like Groq or GPT-4o-mini). A refactor towards a streaming-first architecture is critical for production adoption.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions