Skip to content

Commit 8fbfaec

Browse files
committed
chore: spec update 260626
1 parent 30f5c59 commit 8fbfaec

12 files changed

Lines changed: 255 additions & 188 deletions

.ai/spec/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ AI agents. Content is optimized for precision and machine consumption.
4545

4646
## Conventions
4747

48-
- **Rule numbering:** behavioral rules are numbered sequentially within each what/ file.
48+
- **Rule format:** behavioral rules use bullet points (not numbered) within each what/ file to allow insertion without renumbering.
4949
- **Planned changes:** unimplemented behavior is marked with `[PLANNED]` or `[PLANNED: TICKET-XXXX]` inline next to the rule it affects.
5050
- **Constraints:** component-specific and cross-cutting constraints go in the relevant what/ file's Constraints section, co-located with behavioral rules. Development conventions go in CLAUDE.md.
5151
- **Authority:** what/ specs are authoritative for behavior. how/ specs are authoritative for implementation. When they conflict, what/ wins.

.ai/spec/how/agent-drivers.md

Lines changed: 43 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,45 +4,68 @@
44

55
| File | Key Symbols | Responsibility |
66
|---|---|---|
7-
| `pipeline/evaluation/driver.py` | `AgentDriver`, `HttpApiDriver`, `AgentDriverRegistry` | Driver abstraction, HTTP implementation, driver factory |
8-
| `pipeline/evaluation/amender.py` | `APIDataAmender` | Mutates turn data with agent response, tokens, latency, streaming metrics |
9-
| `core/api/client.py` | `APIClient` | HTTP client with caching, retries, streaming support |
7+
| `pipeline/evaluation/driver.py` | `AgentDriver`, `HttpApiDriver`, `ProposalDriver`, `TerminalOutcome` | Driver abstraction, HTTP and Proposal implementations |
8+
| `pipeline/evaluation/registry.py` | `AgentDriverRegistry`, `AGENT_DRIVERS` | Driver type registry and factory |
9+
| `pipeline/evaluation/amender.py` | `APIDataAmender` | Mutates turn data with HTTP agent response, tokens, latency, streaming metrics |
10+
| `pipeline/evaluation/proposal_amender.py` | `ProposalAmender` | Fetches child Result CRs, builds Markdown summary, amends proposal turn data |
11+
| `pipeline/evaluation/cli.py` | `CLIClient`, `KubeCLI` | Abstract CLI interface and Kubernetes (oc/kubectl) implementation |
12+
| `core/api/client.py` | `APIClient` | HTTP client with caching, retries; supports query/streaming/infer/responses endpoints |
1013
| `core/api/streaming_parser.py` | `parse_streaming_response()`, `StreamingContext` | SSE parsing with TTFT/throughput tracking |
11-
| `core/models/agents.py` | `HttpApiAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent configuration models; `AgentsConfig.resolve_agent_config()` handles config merge |
14+
| `core/proposal/phase.py` | `derive_phase()` | Proposal phase derivation from CRD conditions |
15+
| `core/metrics/custom/proposal_eval.py` | `evaluate_proposal_status()` | Proposal status assertion metric |
16+
| `core/models/agents.py` | `HttpApiAgentConfig`, `ProposalAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent configuration models; `AgentsConfig.resolve_agent_config()` handles config merge |
1217

1318
## Data Flow
1419

15-
1. `EvaluationPipeline._initialize_components()` creates an `AgentDriverRegistry` with registered driver types (default: `http_api``HttpApiDriver`).
16-
2. For each conversation, `_resolve_driver_for_conversation()` either reuses the default driver or creates a per-conversation driver if that conversation has agent config overrides.
17-
3. `ConversationProcessor._process_turn_api()` calls `driver.execute_turn(turn_data, conversation_id)` before metrics evaluation.
18-
4. `HttpApiDriver` delegates to `APIDataAmender.amend_single_turn()`, which calls `APIClient.query()`.
19-
5. `APIClient` sends the HTTP request (standard POST, streaming SSE, or RLSAPI /infer depending on endpoint type).
20-
6. `APIDataAmender` mutates `TurnData` in-place: response text, contexts, tool_calls, token counts, agent latency, and streaming metrics (TTFT, duration, throughput).
21-
7. The amended turn data is then passed to `MetricsEvaluator` for scoring.
20+
### HttpApiDriver Flow
21+
22+
- `EvaluationPipeline._initialize_components()` creates an `AgentDriverRegistry` with registered driver types (`http_api` → HttpApiDriver, `proposal` → ProposalDriver).
23+
- For each conversation, `_resolve_driver_for_conversation()` either reuses the default driver or creates a per-conversation driver if that conversation has agent config overrides.
24+
- `ConversationProcessor._process_turn_api()` calls `driver.execute_turn(turn_data, conversation_id)` before metrics evaluation.
25+
- `HttpApiDriver` delegates to `APIDataAmender.amend_single_turn()`, which calls `APIClient.query()`.
26+
- `APIClient` sends the HTTP request (standard POST, streaming SSE, RLSAPI /infer, or OpenAI Responses API depending on endpoint type).
27+
- `APIDataAmender` mutates `TurnData` in-place: response text, contexts, tool_calls, token counts, agent latency, and streaming metrics.
28+
29+
### ProposalDriver Flow
30+
31+
- `ProposalDriver.execute_turn()` builds a Proposal CR manifest from `turn_data.proposal_spec`.
32+
- `KubeCLI.apply()` creates the Proposal CR in the configured namespace.
33+
- If `auto_approve` is enabled, the driver polls until Analyzed=True, then creates a ProposalApproval CR.
34+
- The driver polls `KubeCLI.get_resource()` for the Proposal's status conditions until a terminal outcome is reached (Completed, Failed, Denied, Escalated) or timeout.
35+
- `derive_phase()` evaluates conditions to determine the current phase, handling retry logic (RetryingExecution reason).
36+
- `ProposalAmender.amend()` fetches child Result CRs (analysisresults, executionresults, verificationresults, escalationresults) and builds a Markdown summary.
37+
- Turn data is amended in-place: response (Markdown), proposal_status, proposal_results, proposal_phases.
38+
- If `cleanup_proposals` is enabled, the Proposal CR is deleted after processing.
2239

2340
## Key Abstractions
2441

25-
**AgentDriverRegistry** maps driver type strings to driver classes. Adding a new driver type means: (1) subclass `AgentDriver`, (2) register in the registry's `_driver_types` dict. Currently only `http_api` is registered.
42+
**AgentDriverRegistry** maps driver type strings to driver classes. Two types registered: `http_api` and `proposal`. Adding a new driver type: subclass `AgentDriver`, add to `AGENT_DRIVERS` dict in `registry.py`.
43+
44+
**AgentDriver** is the abstract interface with `execute_turn()`, `validate_config()`, `enabled`, and `close()`. Returns `(error_message, conversation_id)` tuple.
2645

27-
**AgentDriver** is the abstract interface with `execute_turn()`, `validate_config()`, `enabled`, and `close()`. The `execute_turn()` method returns a tuple of `(error_message, conversation_id)` — the error message is None on success, and the conversation_id may be updated by the agent (for multi-turn conversation tracking).
46+
**APIClient** handles four query modes: standard POST (`/query`), streaming SSE, RLSAPI `/infer`, and OpenAI Responses API (`/responses`). Manages disk-based caching and automatic retries on 429/5xx.
2847

29-
**APIClient** handles three query modes based on endpoint configuration: standard POST (`/query`), streaming SSE, and RLSAPI `/infer`. It manages disk-based caching (keyed by SHA256 of query+model+params) and automatic retries on 429/5xx responses.
48+
**ProposalAmender** maps CRD step names to resource types (`analysis``analysisresults`, etc.), fetches each via KubeCLI, and builds a structured Markdown response with sections for Analysis, Execution, Verification, and Escalation.
3049

31-
**Config resolution** follows three-tier priority: eval_data agent overrides > named agent config > system defaults. `resolve_agent_config()` merges these layers into the final config dict passed to the driver.
50+
**CLIClient** abstracts CLI operations (apply, get_resource, delete). `KubeCLI` resolves `oc` or `kubectl` on PATH, runs commands with namespace and JSON output flags.
3251

3352
## Integration Points
3453

3554
| Consumer | Provider | Mechanism |
3655
|---|---|---|
3756
| `EvaluationPipeline` | `AgentDriverRegistry` | Creates drivers from config |
3857
| `ConversationProcessor` | `AgentDriver.execute_turn()` | Invokes driver per turn |
39-
| `HttpApiDriver` | `APIDataAmender` | Delegates turn amendment |
40-
| `APIDataAmender` | `APIClient` | Sends HTTP requests |
41-
| `APIClient` | `StreamingParser` | Parses SSE responses |
58+
| `HttpApiDriver` | `APIDataAmender``APIClient` | HTTP request chain |
59+
| `ProposalDriver` | `KubeCLI` | CR lifecycle (apply, get, delete) |
60+
| `ProposalDriver` | `ProposalAmender` | Fetch child CRs and build summary |
61+
| `derive_phase()` | CRD conditions | Phase determination logic |
4262

4363
## Implementation Notes
4464

4565
- **Per-conversation drivers** are created when a conversation has agent config overrides and are cleaned up after that conversation completes. The default driver persists across all conversations.
4666
- **Disk caching** in `APIClient` uses `diskcache` with SHA256 keys. Cache can be disabled per-agent or globally via `core.cache_enabled`.
47-
- **Streaming metrics** (TTFT, duration, tokens/second) are only populated when the endpoint is configured for streaming. Non-streaming endpoints leave these fields as None.
48-
- **The amender mutates TurnData in-place** — there is no copy. The original response (if pre-populated in eval data) is overwritten by the agent's response.
67+
- **Streaming metrics** (TTFT, duration, tokens/second) are populated for streaming and responses endpoint types.
68+
- **The amender mutates TurnData in-place** — the original response is overwritten.
69+
- **Proposal CR naming** uses `eval-{safe_conv_id}-{uuid8}` to avoid namespace collisions.
70+
- **KubeCLI timeout** is per-command (`cli_timeout`), while ProposalDriver `timeout` is the overall lifecycle timeout for reaching a terminal state.
71+
- **Responses endpoint** uses OpenAI Responses API schema — maps query→input, system_prompt→instructions, extracts file_search_call for RAG contexts and mcp_call for tool calls.

.ai/spec/how/configuration-and-models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
|---|---|---|
77
| `core/models/system.py` | `SystemConfig` | Top-level system config Pydantic model |
88
| `core/models/data.py` | `EvaluationData`, `TurnData`, `MetricResult`, `EvaluationResult` | Evaluation dataset, turn, and result models |
9-
| `core/models/agents.py` | `AgentConfig` | Agent driver configuration |
9+
| `core/models/agents.py` | `HttpApiAgentConfig`, `ProposalAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent driver configuration models |
1010
| `core/models/api.py` | Legacy API config | Backward-compatible API config (deprecated) |
1111
| `core/models/llm.py` | LLM config models | LLM pool and judge panel models |
1212
| `core/models/summary.py` | `EvaluationSummary` | Result aggregation models |

.ai/spec/how/metrics-implementation.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
| `core/metrics/custom/custom.py` | `CustomMetrics` | Custom LLM-based metric handler |
1414
| `core/metrics/custom/keywords_eval.py` || Keyword matching evaluation logic |
1515
| `core/metrics/custom/tool_eval.py` || Tool use evaluation logic |
16+
| `core/metrics/custom/proposal_eval.py` | `evaluate_proposal_status()` | Proposal status assertion metric (phase, duration, attempts, conditions) |
1617
| `core/metrics/custom/prompts.py` || Prompt templates for custom metrics |
1718
| `pipeline/evaluation/evaluator.py` | `MetricsEvaluator` | Metric dispatch, multi-expected-response logic, status determination |
1819
| `pipeline/evaluation/judges.py` | `JudgeOrchestrator` | Panel scoring, aggregation strategies |

.ai/spec/how/output-and-storage.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,32 +4,34 @@
44

55
| File | Key Symbols | Responsibility |
66
|---|---|---|
7-
| `core/output/generator.py` | `OutputHandler` | Orchestrates report generation |
8-
| `core/output/visualization.py` || Graph generation (matplotlib, seaborn) |
7+
| `core/output/generator.py` | `OutputHandler` | Orchestrates report generation (CSV, JSON, TXT, quality report) |
8+
| `core/output/visualization.py` || Graph generation (matplotlib, seaborn): pass_rates, score_distribution, status_breakdown, conversation_heatmap |
99
| `core/output/statistics.py` || Statistical computations (bootstrap CI, distributions) |
1010
| `core/output/data_persistence.py` || File writing (CSV, JSON, TXT) |
1111
| `core/storage/protocol.py` | `BaseStorageBackend` | Abstract storage interface |
12-
| `core/storage/factory.py` | `create_pipeline_storage_backend()` | Backend instantiation |
12+
| `core/storage/factory.py` | `create_pipeline_storage_backend()` | Backend instantiation from config |
1313
| `core/storage/file_storage.py` | `FileStorageBackend` | File output + report generation |
1414
| `core/storage/sql_storage.py` | `SQLStorageBackend` | Database persistence |
15+
| `core/storage/langfuse_storage.py` | `LangfuseStorageBackend` | Langfuse observability platform persistence |
1516
| `core/storage/composite_storage.py` | `CompositeStorageBackend` | Multi-backend chaining |
16-
| `core/storage/config.py` | | Storage configuration models |
17+
| `core/storage/config.py` | `FileBackendConfig`, `DatabaseBackendConfig`, `LangfuseBackendConfig` | Storage configuration models |
1718

1819
## Data Flow
1920

20-
1. During evaluation, `EvaluationPipeline` calls `storage.save_run(results)` after each conversation completes.
21-
2. After all conversations finish, `set_evaluation_context()` provides the full dataset, then `finalize()` is called.
22-
3. `FileStorageBackend.save_run()` accumulates results in memory only — no disk writes. `finalize()` triggers `OutputHandler` to generate all reports from accumulated results.
23-
4. `SQLStorageBackend.save_run()` commits results to the database immediately per conversation. `finalize()` is a no-op (logs a count).
24-
5. `CompositeStorageBackend` delegates all calls to its child backends in order.
21+
- During evaluation, `EvaluationPipeline` calls `storage.save_run(results)` after each conversation completes.
22+
- After all conversations finish, `set_evaluation_context()` provides the full dataset, then `finalize()` is called.
23+
- `FileStorageBackend.save_run()` accumulates results in memory only — no disk writes. `finalize()` triggers `OutputHandler` to generate all reports from accumulated results.
24+
- `SQLStorageBackend.save_run()` commits results to the database immediately per conversation. `finalize()` is a no-op (logs a count).
25+
- `LangfuseStorageBackend` accumulates results during `save_run()`, then creates a trace span and writes individual scores via `create_score()` on `finalize()`.
26+
- `CompositeStorageBackend` delegates all calls to its child backends in order.
2527

2628
## Key Abstractions
2729

28-
**Storage lifecycle** is protocol-driven: `initialize()``save_run()` (repeated per conversation) → `set_evaluation_context()``finalize()``close()`. File and SQL backends implement this lifecycle differently: file storage defers all writes to `finalize()`, while SQL storage commits immediately in each `save_run()`.
30+
**Storage lifecycle** is protocol-driven: `initialize()``save_run()` (repeated per conversation) → `set_evaluation_context()``finalize()``close()`. Each backend implements this differently: file defers writes, SQL commits incrementally, Langfuse accumulates then flushes.
2931

30-
**The factory pattern** in `create_pipeline_storage_backend()` reads the config's storage list and instantiates the appropriate backends. If multiple backends are configured, they're wrapped in a `CompositeStorageBackend`. When no storage is configured, a `NoOpStorageBackend` is returned.
32+
**The factory pattern** in `create_pipeline_storage_backend()` reads the config's storage list and instantiates the appropriate backends (file, sql, langfuse). If multiple backends are configured, they're wrapped in a `CompositeStorageBackend`. When no storage is configured, a `NoOpStorageBackend` is returned.
3133

32-
**FileStorageBackend** accumulates results in memory during `save_run()` and needs `SystemConfig` plus the full evaluation dataset (`set_evaluation_context()`) to generate reports in `finalize()`. **SQLStorageBackend** commits to the database immediately per conversation and its `finalize()` is a no-op.
34+
**FileStorageBackend** accumulates results in memory during `save_run()` and needs `SystemConfig` plus the full evaluation dataset (`set_evaluation_context()`) to generate reports in `finalize()`. **SQLStorageBackend** commits to the database immediately per conversation and its `finalize()` is a no-op. **LangfuseStorageBackend** accumulates results and writes traces/scores to Langfuse on `finalize()`.
3335

3436
## Integration Points
3537

@@ -39,6 +41,7 @@
3941
| `FileStorageBackend` | `OutputHandler` | Delegates report generation on finalize |
4042
| `OutputHandler` | `EvaluationSummary` | Computes statistics for reports |
4143
| `SQLStorageBackend` | SQLAlchemy | Database operations |
44+
| `LangfuseStorageBackend` | Langfuse SDK | Trace and score creation |
4245

4346
## Implementation Notes
4447

@@ -47,3 +50,4 @@
4750
- **Graph generation** imports matplotlib and seaborn at call time, not at module level, because they're slow to import and not always needed.
4851
- **Report paths**: Output files are written to the directory specified in config, with timestamped subdirectories per run.
4952
- **File storage memory pressure**: Because file storage accumulates all results in memory until `finalize()`, very large evaluation runs may consume significant memory. SQL storage does not have this issue since it commits incrementally.
53+
- **Langfuse** requires the `langfuse` optional dependency (>=4.0.0). Config supports inline credentials or environment variable fallback.

0 commit comments

Comments
 (0)