[Architecture] Blocking Workflow and Lack of Streaming Architecture Severely Impacts   Time-to-First-Token (TTFT)

After integrating Parlant into a production-like environment and benchmarking it against standard
  LangChain/LangGraph implementations, we have identified that the current Parlant architecture imposes a significant "Architectural Tax" on latency. The current workflow appears to be designed as a synchronous, blocking pipeline that prioritizes internal state consistency over user experience (latency).

  **Architectural Analysis (The Inefficiency)**
  The current event loop seems to follow a strictly serial execution path that prevents any form of
  "perceived speed" optimization:
   **1. Strict "Think-Then-Speak" Pattern:**
      The framework appears to enforce a rigid sequence:
      Ingestion -> Context/Guideline Retrieval -> Planning -> Tool Execution -> Final Response Generation.
       * The Issue: The API does not expose the Final Response Generation stream until the entire block is
         complete. In modern Agentic UX, we need "Think-While-Speaking" or at least direct access to the
         LLM's token stream as soon as the "Planning" phase determines a response is needed.

   **2. Opaque "Black Box" Latency:**
      Between the user sending a message and the message event being emitted, the server performs heavy
  lifting (Glossary matching, Guideline checking) that is opaque to the client. For simple queries (e.g.,
  "Hello"), this overhead is disproportionately high compared to a direct LLM call (~3s vs ~500ms).

   **3. Polling vs. Push Protocol:**
      Reliance on HTTP Long Polling (/events?wait_for_data=60) is inefficient for conversational AI. It
  introduces unnecessary network Round-Trip Time (RTT) and connection management overhead compared to a persistent Server-Sent Events (SSE) or WebSocket stream which is the industry standard for LLM applications.

  Impact
   * High Latency (TTFT): Time-To-First-Token is consistently >3-5 seconds, regardless of query complexity.
   * Poor UX: Users receive no feedback while the agent performs internal "Guideline Checks" or "Context
     Updates."
   * Unusable for Real-Time: The architecture is currently unsuitable for Voice-to-Voice or real-time chat
     applications where sub-1000ms latency is required.

  Suggested Improvements
   1. Implement Server-Sent Events (SSE): Move away from Long Polling. Provide an endpoint that streams LLM
      tokens (delta events) directly to the client as they are generated, bypassing the requirement to wait
      for the full message object.
   2. Asynchronous Pipeline: Decouple the "Guideline/Safety" checks from the initial token generation where
      possible, or optimize the pipeline to allow "Optimistic Responses."
   3. Expose Intermediate Thought States: Native support for streaming the "Planning" or "Reasoning" tokens
      to the client so the UI can render the agent's "Thought Process" (similar to OpenAI's reasoning
      tokens) to mitigate perceived latency.

  Conclusion
  While the Guideline and Glossary features are powerful, the current execution model acts as a "Slow Proxy"that negates the benefits of fast LLMs (like Groq or GPT-4o-mini). A refactor towards a streaming-first architecture is critical for production adoption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Architecture] Blocking Workflow and Lack of Streaming Architecture Severely Impacts Time-to-First-Token (TTFT) #706

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Architecture] Blocking Workflow and Lack of Streaming Architecture Severely Impacts Time-to-First-Token (TTFT) #706

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions