Skip to content

Agent Pipeline Profiling and Latency Optimization #611

@itomek

Description

@itomek

Problem

The agent pipeline has multiple stages between receiving a user prompt and returning a response: library initialization, RAG lookup, prompt construction, LLM inference, tool execution, and response formatting. There is no instrumentation to identify which stages are bottlenecks. Optimization has been ad-hoc (e.g., fixing lazy loading) rather than data-driven.

Proposed Changes

1. Pipeline Stage Profiling

Add timing instrumentation at each stage of the agent pipeline:

  • Session initialization (library loading, model warm-up)
  • Document context retrieval (RAG pipeline)
  • System prompt + context assembly
  • LLM inference (first token, total generation)
  • Tool selection and execution (per tool)
  • Response formatting and streaming

Each stage should emit a structured timing record with:

{
  "stage": "rag_retrieval",
  "start_ms": 1234,
  "end_ms": 5678,
  "duration_ms": 4444,
  "metadata": {"documents_searched": 3, "chunks_retrieved": 12}
}

2. Profile Trace API

  • Expose pipeline traces via a new API endpoint: GET /api/system/profile-trace?session_id=X
  • Returns the full pipeline trace for a given message/session
  • Can be consumed by the frontend for a developer/debug view or by external tools

3. MCP Integration for Automated Profiling

  • Expose profiling data via MCP so external agents can:
    • Query current pipeline performance
    • Identify the top bottleneck stages
    • Compare traces across different configurations (model size, RAG settings, etc.)
  • This enables automated performance testing loops: send prompt → measure trace → report bottlenecks

4. Frontend Profile View (Optional)

  • Developer-mode toggle in settings to show pipeline trace per message
  • Waterfall visualization showing each stage's duration
  • Highlight stages exceeding configurable thresholds

Technical Approach

Use Python's time.perf_counter_ns() for high-resolution timing. Wrap each pipeline stage in a context manager that records start/end. Store traces in memory (ring buffer, last N traces) — no database dependency.

The lazy loading fix (boot everything at app start) was the first optimization found via manual profiling. This issue systematizes that approach so future bottlenecks are found automatically.

Files Likely Affected

  • src/gaia/agents/ — agent pipeline stages (ChatAgent, RAG mixin, tool execution)
  • src/gaia_ui/backend/routers/ — new profile-trace API endpoint
  • src/gaia/agents/ — MCP tool for exposing profiling data

Acceptance Criteria

  • Every pipeline stage is instrumented with timing data
  • Profile trace API endpoint returns structured timing data
  • Traces include metadata (document count, model name, tool count)
  • Ring buffer stores last 100 traces without persistence overhead
  • MCP tool exposes profiling data for external consumption

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain:agent-coreFramework, tools, registry, memory, skills, orchestrationenhancementNew feature or requestp2low prioritytrack:consumer-appHermes-competitor consumer product — mobile-first, voice + messaging + memory + skills

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions