Agent Pipeline Profiling and Latency Optimization

## Problem

The agent pipeline has multiple stages between receiving a user prompt and returning a response: library initialization, RAG lookup, prompt construction, LLM inference, tool execution, and response formatting. There is no instrumentation to identify which stages are bottlenecks. Optimization has been ad-hoc (e.g., fixing lazy loading) rather than data-driven.

## Proposed Changes

### 1. Pipeline Stage Profiling
Add timing instrumentation at each stage of the agent pipeline:
- Session initialization (library loading, model warm-up)
- Document context retrieval (RAG pipeline)
- System prompt + context assembly
- LLM inference (first token, total generation)
- Tool selection and execution (per tool)
- Response formatting and streaming

Each stage should emit a structured timing record with:
```json
{
  "stage": "rag_retrieval",
  "start_ms": 1234,
  "end_ms": 5678,
  "duration_ms": 4444,
  "metadata": {"documents_searched": 3, "chunks_retrieved": 12}
}
```

### 2. Profile Trace API
- Expose pipeline traces via a new API endpoint: `GET /api/system/profile-trace?session_id=X`
- Returns the full pipeline trace for a given message/session
- Can be consumed by the frontend for a developer/debug view or by external tools

### 3. MCP Integration for Automated Profiling
- Expose profiling data via MCP so external agents can:
  - Query current pipeline performance
  - Identify the top bottleneck stages
  - Compare traces across different configurations (model size, RAG settings, etc.)
- This enables automated performance testing loops: send prompt → measure trace → report bottlenecks

### 4. Frontend Profile View (Optional)
- Developer-mode toggle in settings to show pipeline trace per message
- Waterfall visualization showing each stage's duration
- Highlight stages exceeding configurable thresholds

## Technical Approach

Use Python's `time.perf_counter_ns()` for high-resolution timing. Wrap each pipeline stage in a context manager that records start/end. Store traces in memory (ring buffer, last N traces) — no database dependency.

The lazy loading fix (boot everything at app start) was the first optimization found via manual profiling. This issue systematizes that approach so future bottlenecks are found automatically.

## Files Likely Affected
- `src/gaia/agents/` — agent pipeline stages (ChatAgent, RAG mixin, tool execution)
- `src/gaia_ui/backend/routers/` — new profile-trace API endpoint
- `src/gaia/agents/` — MCP tool for exposing profiling data

## Acceptance Criteria
- [ ] Every pipeline stage is instrumented with timing data
- [ ] Profile trace API endpoint returns structured timing data
- [ ] Traces include metadata (document count, model name, tool count)
- [ ] Ring buffer stores last 100 traces without persistence overhead
- [ ] MCP tool exposes profiling data for external consumption

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Pipeline Profiling and Latency Optimization #611

Problem

Proposed Changes

1. Pipeline Stage Profiling

2. Profile Trace API

3. MCP Integration for Automated Profiling

4. Frontend Profile View (Optional)

Technical Approach

Files Likely Affected

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent Pipeline Profiling and Latency Optimization #611

Description

Problem

Proposed Changes

1. Pipeline Stage Profiling

2. Profile Trace API

3. MCP Integration for Automated Profiling

4. Frontend Profile View (Optional)

Technical Approach

Files Likely Affected

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions