Problem
The agent pipeline has multiple stages between receiving a user prompt and returning a response: library initialization, RAG lookup, prompt construction, LLM inference, tool execution, and response formatting. There is no instrumentation to identify which stages are bottlenecks. Optimization has been ad-hoc (e.g., fixing lazy loading) rather than data-driven.
Proposed Changes
1. Pipeline Stage Profiling
Add timing instrumentation at each stage of the agent pipeline:
- Session initialization (library loading, model warm-up)
- Document context retrieval (RAG pipeline)
- System prompt + context assembly
- LLM inference (first token, total generation)
- Tool selection and execution (per tool)
- Response formatting and streaming
Each stage should emit a structured timing record with:
{
"stage": "rag_retrieval",
"start_ms": 1234,
"end_ms": 5678,
"duration_ms": 4444,
"metadata": {"documents_searched": 3, "chunks_retrieved": 12}
}
2. Profile Trace API
- Expose pipeline traces via a new API endpoint:
GET /api/system/profile-trace?session_id=X
- Returns the full pipeline trace for a given message/session
- Can be consumed by the frontend for a developer/debug view or by external tools
3. MCP Integration for Automated Profiling
- Expose profiling data via MCP so external agents can:
- Query current pipeline performance
- Identify the top bottleneck stages
- Compare traces across different configurations (model size, RAG settings, etc.)
- This enables automated performance testing loops: send prompt → measure trace → report bottlenecks
4. Frontend Profile View (Optional)
- Developer-mode toggle in settings to show pipeline trace per message
- Waterfall visualization showing each stage's duration
- Highlight stages exceeding configurable thresholds
Technical Approach
Use Python's time.perf_counter_ns() for high-resolution timing. Wrap each pipeline stage in a context manager that records start/end. Store traces in memory (ring buffer, last N traces) — no database dependency.
The lazy loading fix (boot everything at app start) was the first optimization found via manual profiling. This issue systematizes that approach so future bottlenecks are found automatically.
Files Likely Affected
src/gaia/agents/ — agent pipeline stages (ChatAgent, RAG mixin, tool execution)
src/gaia_ui/backend/routers/ — new profile-trace API endpoint
src/gaia/agents/ — MCP tool for exposing profiling data
Acceptance Criteria
Problem
The agent pipeline has multiple stages between receiving a user prompt and returning a response: library initialization, RAG lookup, prompt construction, LLM inference, tool execution, and response formatting. There is no instrumentation to identify which stages are bottlenecks. Optimization has been ad-hoc (e.g., fixing lazy loading) rather than data-driven.
Proposed Changes
1. Pipeline Stage Profiling
Add timing instrumentation at each stage of the agent pipeline:
Each stage should emit a structured timing record with:
{ "stage": "rag_retrieval", "start_ms": 1234, "end_ms": 5678, "duration_ms": 4444, "metadata": {"documents_searched": 3, "chunks_retrieved": 12} }2. Profile Trace API
GET /api/system/profile-trace?session_id=X3. MCP Integration for Automated Profiling
4. Frontend Profile View (Optional)
Technical Approach
Use Python's
time.perf_counter_ns()for high-resolution timing. Wrap each pipeline stage in a context manager that records start/end. Store traces in memory (ring buffer, last N traces) — no database dependency.The lazy loading fix (boot everything at app start) was the first optimization found via manual profiling. This issue systematizes that approach so future bottlenecks are found automatically.
Files Likely Affected
src/gaia/agents/— agent pipeline stages (ChatAgent, RAG mixin, tool execution)src/gaia_ui/backend/routers/— new profile-trace API endpointsrc/gaia/agents/— MCP tool for exposing profiling dataAcceptance Criteria