lightspeed-core
diff --git a/‎.ai/spec/README.md‎
Lines changed: 1 addition & 1 deletion b/‎.ai/spec/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.ai/spec/how/agent-drivers.md‎
Lines changed: 43 additions & 20 deletions b/‎.ai/spec/how/agent-drivers.md‎
Lines changed: 43 additions & 20 deletions
diff --git a/‎.ai/spec/how/configuration-and-models.md‎
Lines changed: 1 addition & 1 deletion b/‎.ai/spec/how/configuration-and-models.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.ai/spec/how/metrics-implementation.md‎
Lines changed: 1 addition & 0 deletions b/‎.ai/spec/how/metrics-implementation.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.ai/spec/how/output-and-storage.md‎
Lines changed: 16 additions & 12 deletions b/‎.ai/spec/how/output-and-storage.md‎
Lines changed: 16 additions & 12 deletions
@@ -45,7 +45,7 @@ AI agents. Content is optimized for precision and machine consumption.
 
 ## Conventions
 
-- **Rule numbering:** behavioral rules are numbered sequentially within each what/ file.
+- **Rule format:** behavioral rules use bullet points (not numbered) within each what/ file to allow insertion without renumbering.
 - **Planned changes:** unimplemented behavior is marked with `[PLANNED]` or `[PLANNED: TICKET-XXXX]` inline next to the rule it affects.
 - **Constraints:** component-specific and cross-cutting constraints go in the relevant what/ file's Constraints section, co-located with behavioral rules. Development conventions go in CLAUDE.md.
 - **Authority:** what/ specs are authoritative for behavior. how/ specs are authoritative for implementation. When they conflict, what/ wins.
 
@@ -4,45 +4,68 @@
 
 | File | Key Symbols | Responsibility |
 |---|---|---|
-| `pipeline/evaluation/driver.py` | `AgentDriver`, `HttpApiDriver`, `AgentDriverRegistry` | Driver abstraction, HTTP implementation, driver factory |
-| `pipeline/evaluation/amender.py` | `APIDataAmender` | Mutates turn data with agent response, tokens, latency, streaming metrics |
-| `core/api/client.py` | `APIClient` | HTTP client with caching, retries, streaming support |
+| `pipeline/evaluation/driver.py` | `AgentDriver`, `HttpApiDriver`, `ProposalDriver`, `TerminalOutcome` | Driver abstraction, HTTP and Proposal implementations |
+| `pipeline/evaluation/registry.py` | `AgentDriverRegistry`, `AGENT_DRIVERS` | Driver type registry and factory |
+| `pipeline/evaluation/amender.py` | `APIDataAmender` | Mutates turn data with HTTP agent response, tokens, latency, streaming metrics |
+| `pipeline/evaluation/proposal_amender.py` | `ProposalAmender` | Fetches child Result CRs, builds Markdown summary, amends proposal turn data |
+| `pipeline/evaluation/cli.py` | `CLIClient`, `KubeCLI` | Abstract CLI interface and Kubernetes (oc/kubectl) implementation |
+| `core/api/client.py` | `APIClient` | HTTP client with caching, retries; supports query/streaming/infer/responses endpoints |
 | `core/api/streaming_parser.py` | `parse_streaming_response()`, `StreamingContext` | SSE parsing with TTFT/throughput tracking |
-| `core/models/agents.py` | `HttpApiAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent configuration models; `AgentsConfig.resolve_agent_config()` handles config merge |
+| `core/proposal/phase.py` | `derive_phase()` | Proposal phase derivation from CRD conditions |
+| `core/metrics/custom/proposal_eval.py` | `evaluate_proposal_status()` | Proposal status assertion metric |
+| `core/models/agents.py` | `HttpApiAgentConfig`, `ProposalAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent configuration models; `AgentsConfig.resolve_agent_config()` handles config merge |
 
 ## Data Flow
 
-1. `EvaluationPipeline._initialize_components()` creates an `AgentDriverRegistry` with registered driver types (default: `http_api` → `HttpApiDriver`).
-2. For each conversation, `_resolve_driver_for_conversation()` either reuses the default driver or creates a per-conversation driver if that conversation has agent config overrides.
-3. `ConversationProcessor._process_turn_api()` calls `driver.execute_turn(turn_data, conversation_id)` before metrics evaluation.
-4. `HttpApiDriver` delegates to `APIDataAmender.amend_single_turn()`, which calls `APIClient.query()`.
-5. `APIClient` sends the HTTP request (standard POST, streaming SSE, or RLSAPI /infer depending on endpoint type).
-6. `APIDataAmender` mutates `TurnData` in-place: response text, contexts, tool_calls, token counts, agent latency, and streaming metrics (TTFT, duration, throughput).
-7. The amended turn data is then passed to `MetricsEvaluator` for scoring.
+### HttpApiDriver Flow
+
+- `EvaluationPipeline._initialize_components()` creates an `AgentDriverRegistry` with registered driver types (`http_api` → HttpApiDriver, `proposal` → ProposalDriver).
+- For each conversation, `_resolve_driver_for_conversation()` either reuses the default driver or creates a per-conversation driver if that conversation has agent config overrides.
+- `ConversationProcessor._process_turn_api()` calls `driver.execute_turn(turn_data, conversation_id)` before metrics evaluation.
+- `HttpApiDriver` delegates to `APIDataAmender.amend_single_turn()`, which calls `APIClient.query()`.
+- `APIClient` sends the HTTP request (standard POST, streaming SSE, RLSAPI /infer, or OpenAI Responses API depending on endpoint type).
+- `APIDataAmender` mutates `TurnData` in-place: response text, contexts, tool_calls, token counts, agent latency, and streaming metrics.
+
+### ProposalDriver Flow
+
+- `ProposalDriver.execute_turn()` builds a Proposal CR manifest from `turn_data.proposal_spec`.
+- `KubeCLI.apply()` creates the Proposal CR in the configured namespace.
+- If `auto_approve` is enabled, the driver polls until Analyzed=True, then creates a ProposalApproval CR.
+- The driver polls `KubeCLI.get_resource()` for the Proposal's status conditions until a terminal outcome is reached (Completed, Failed, Denied, Escalated) or timeout.
+- `derive_phase()` evaluates conditions to determine the current phase, handling retry logic (RetryingExecution reason).
+- `ProposalAmender.amend()` fetches child Result CRs (analysisresults, executionresults, verificationresults, escalationresults) and builds a Markdown summary.
+- Turn data is amended in-place: response (Markdown), proposal_status, proposal_results, proposal_phases.
+- If `cleanup_proposals` is enabled, the Proposal CR is deleted after processing.
 
 ## Key Abstractions
 
-**AgentDriverRegistry** maps driver type strings to driver classes. Adding a new driver type means: (1) subclass `AgentDriver`, (2) register in the registry's `_driver_types` dict. Currently only `http_api` is registered.
+**AgentDriverRegistry** maps driver type strings to driver classes. Two types registered: `http_api` and `proposal`. Adding a new driver type: subclass `AgentDriver`, add to `AGENT_DRIVERS` dict in `registry.py`.
+
+**AgentDriver** is the abstract interface with `execute_turn()`, `validate_config()`, `enabled`, and `close()`. Returns `(error_message, conversation_id)` tuple.
 
-**AgentDriver** is the abstract interface with `execute_turn()`, `validate_config()`, `enabled`, and `close()`. The `execute_turn()` method returns a tuple of `(error_message, conversation_id)` — the error message is None on success, and the conversation_id may be updated by the agent (for multi-turn conversation tracking).
+**APIClient** handles four query modes: standard POST (`/query`), streaming SSE, RLSAPI `/infer`, and OpenAI Responses API (`/responses`). Manages disk-based caching and automatic retries on 429/5xx.
 
-**APIClient** handles three query modes based on endpoint configuration: standard POST (`/query`), streaming SSE, and RLSAPI `/infer`. It manages disk-based caching (keyed by SHA256 of query+model+params) and automatic retries on 429/5xx responses.
+**ProposalAmender** maps CRD step names to resource types (`analysis` → `analysisresults`, etc.), fetches each via KubeCLI, and builds a structured Markdown response with sections for Analysis, Execution, Verification, and Escalation.
 
-**Config resolution** follows three-tier priority: eval_data agent overrides > named agent config > system defaults. `resolve_agent_config()` merges these layers into the final config dict passed to the driver.
+**CLIClient** abstracts CLI operations (apply, get_resource, delete). `KubeCLI` resolves `oc` or `kubectl` on PATH, runs commands with namespace and JSON output flags.
 
 ## Integration Points
 
 | Consumer | Provider | Mechanism |
 |---|---|---|
 | `EvaluationPipeline` | `AgentDriverRegistry` | Creates drivers from config |
 | `ConversationProcessor` | `AgentDriver.execute_turn()` | Invokes driver per turn |
-| `HttpApiDriver` | `APIDataAmender` | Delegates turn amendment |
-| `APIDataAmender` | `APIClient` | Sends HTTP requests |
-| `APIClient` | `StreamingParser` | Parses SSE responses |
+| `HttpApiDriver` | `APIDataAmender` → `APIClient` | HTTP request chain |
+| `ProposalDriver` | `KubeCLI` | CR lifecycle (apply, get, delete) |
+| `ProposalDriver` | `ProposalAmender` | Fetch child CRs and build summary |
+| `derive_phase()` | CRD conditions | Phase determination logic |
 
 ## Implementation Notes
 
 - **Per-conversation drivers** are created when a conversation has agent config overrides and are cleaned up after that conversation completes. The default driver persists across all conversations.
 - **Disk caching** in `APIClient` uses `diskcache` with SHA256 keys. Cache can be disabled per-agent or globally via `core.cache_enabled`.
-- **Streaming metrics** (TTFT, duration, tokens/second) are only populated when the endpoint is configured for streaming. Non-streaming endpoints leave these fields as None.
-- **The amender mutates TurnData in-place** — there is no copy. The original response (if pre-populated in eval data) is overwritten by the agent's response.
+- **Streaming metrics** (TTFT, duration, tokens/second) are populated for streaming and responses endpoint types.
+- **The amender mutates TurnData in-place** — the original response is overwritten.
+- **Proposal CR naming** uses `eval-{safe_conv_id}-{uuid8}` to avoid namespace collisions.
+- **KubeCLI timeout** is per-command (`cli_timeout`), while ProposalDriver `timeout` is the overall lifecycle timeout for reaching a terminal state.
+- **Responses endpoint** uses OpenAI Responses API schema — maps query→input, system_prompt→instructions, extracts file_search_call for RAG contexts and mcp_call for tool calls.
@@ -6,7 +6,7 @@
 |---|---|---|
 | `core/models/system.py` | `SystemConfig` | Top-level system config Pydantic model |
 | `core/models/data.py` | `EvaluationData`, `TurnData`, `MetricResult`, `EvaluationResult` | Evaluation dataset, turn, and result models |
-| `core/models/agents.py` | `AgentConfig` | Agent driver configuration |
+| `core/models/agents.py` | `HttpApiAgentConfig`, `ProposalAgentConfig`, `AgentsConfig`, `AgentDefaultConfig` | Agent driver configuration models |
 | `core/models/api.py` | Legacy API config | Backward-compatible API config (deprecated) |
 | `core/models/llm.py` | LLM config models | LLM pool and judge panel models |
 | `core/models/summary.py` | `EvaluationSummary` | Result aggregation models |
 
@@ -13,6 +13,7 @@
 | `core/metrics/custom/custom.py` | `CustomMetrics` | Custom LLM-based metric handler |
 | `core/metrics/custom/keywords_eval.py` | — | Keyword matching evaluation logic |
 | `core/metrics/custom/tool_eval.py` | — | Tool use evaluation logic |
+| `core/metrics/custom/proposal_eval.py` | `evaluate_proposal_status()` | Proposal status assertion metric (phase, duration, attempts, conditions) |
 | `core/metrics/custom/prompts.py` | — | Prompt templates for custom metrics |
 | `pipeline/evaluation/evaluator.py` | `MetricsEvaluator` | Metric dispatch, multi-expected-response logic, status determination |
 | `pipeline/evaluation/judges.py` | `JudgeOrchestrator` | Panel scoring, aggregation strategies |
 
@@ -4,32 +4,34 @@
 
 | File | Key Symbols | Responsibility |
 |---|---|---|
-| `core/output/generator.py` | `OutputHandler` | Orchestrates report generation |
-| `core/output/visualization.py` | — | Graph generation (matplotlib, seaborn) |
+| `core/output/generator.py` | `OutputHandler` | Orchestrates report generation (CSV, JSON, TXT, quality report) |
+| `core/output/visualization.py` | — | Graph generation (matplotlib, seaborn): pass_rates, score_distribution, status_breakdown, conversation_heatmap |
 | `core/output/statistics.py` | — | Statistical computations (bootstrap CI, distributions) |
 | `core/output/data_persistence.py` | — | File writing (CSV, JSON, TXT) |
 | `core/storage/protocol.py` | `BaseStorageBackend` | Abstract storage interface |
-| `core/storage/factory.py` | `create_pipeline_storage_backend()` | Backend instantiation |
+| `core/storage/factory.py` | `create_pipeline_storage_backend()` | Backend instantiation from config |
 | `core/storage/file_storage.py` | `FileStorageBackend` | File output + report generation |
 | `core/storage/sql_storage.py` | `SQLStorageBackend` | Database persistence |
+| `core/storage/langfuse_storage.py` | `LangfuseStorageBackend` | Langfuse observability platform persistence |
 | `core/storage/composite_storage.py` | `CompositeStorageBackend` | Multi-backend chaining |
-| `core/storage/config.py` | — | Storage configuration models |
+| `core/storage/config.py` | `FileBackendConfig`, `DatabaseBackendConfig`, `LangfuseBackendConfig` | Storage configuration models |
 
 ## Data Flow
 
-1. During evaluation, `EvaluationPipeline` calls `storage.save_run(results)` after each conversation completes.
-2. After all conversations finish, `set_evaluation_context()` provides the full dataset, then `finalize()` is called.
-3. `FileStorageBackend.save_run()` accumulates results in memory only — no disk writes. `finalize()` triggers `OutputHandler` to generate all reports from accumulated results.
-4. `SQLStorageBackend.save_run()` commits results to the database immediately per conversation. `finalize()` is a no-op (logs a count).
-5. `CompositeStorageBackend` delegates all calls to its child backends in order.
+- During evaluation, `EvaluationPipeline` calls `storage.save_run(results)` after each conversation completes.
+- After all conversations finish, `set_evaluation_context()` provides the full dataset, then `finalize()` is called.
+- `FileStorageBackend.save_run()` accumulates results in memory only — no disk writes. `finalize()` triggers `OutputHandler` to generate all reports from accumulated results.
+- `SQLStorageBackend.save_run()` commits results to the database immediately per conversation. `finalize()` is a no-op (logs a count).
+- `LangfuseStorageBackend` accumulates results during `save_run()`, then creates a trace span and writes individual scores via `create_score()` on `finalize()`.
+- `CompositeStorageBackend` delegates all calls to its child backends in order.
 
 ## Key Abstractions
 
-**Storage lifecycle** is protocol-driven: `initialize()` → `save_run()` (repeated per conversation) → `set_evaluation_context()` → `finalize()` → `close()`. File and SQL backends implement this lifecycle differently: file storage defers all writes to `finalize()`, while SQL storage commits immediately in each `save_run()`.
+**Storage lifecycle** is protocol-driven: `initialize()` → `save_run()` (repeated per conversation) → `set_evaluation_context()` → `finalize()` → `close()`. Each backend implements this differently: file defers writes, SQL commits incrementally, Langfuse accumulates then flushes.
 
-**The factory pattern** in `create_pipeline_storage_backend()` reads the config's storage list and instantiates the appropriate backends. If multiple backends are configured, they're wrapped in a `CompositeStorageBackend`. When no storage is configured, a `NoOpStorageBackend` is returned.
+**The factory pattern** in `create_pipeline_storage_backend()` reads the config's storage list and instantiates the appropriate backends (file, sql, langfuse). If multiple backends are configured, they're wrapped in a `CompositeStorageBackend`. When no storage is configured, a `NoOpStorageBackend` is returned.
 
-**FileStorageBackend** accumulates results in memory during `save_run()` and needs `SystemConfig` plus the full evaluation dataset (`set_evaluation_context()`) to generate reports in `finalize()`. **SQLStorageBackend** commits to the database immediately per conversation and its `finalize()` is a no-op.
+**FileStorageBackend** accumulates results in memory during `save_run()` and needs `SystemConfig` plus the full evaluation dataset (`set_evaluation_context()`) to generate reports in `finalize()`. **SQLStorageBackend** commits to the database immediately per conversation and its `finalize()` is a no-op. **LangfuseStorageBackend** accumulates results and writes traces/scores to Langfuse on `finalize()`.
 
 ## Integration Points
 
@@ -39,6 +41,7 @@
 | `FileStorageBackend` | `OutputHandler` | Delegates report generation on finalize |
 | `OutputHandler` | `EvaluationSummary` | Computes statistics for reports |
 | `SQLStorageBackend` | SQLAlchemy | Database operations |
+| `LangfuseStorageBackend` | Langfuse SDK | Trace and score creation |
 
 ## Implementation Notes
 
@@ -47,3 +50,4 @@
 - **Graph generation** imports matplotlib and seaborn at call time, not at module level, because they're slow to import and not always needed.
 - **Report paths**: Output files are written to the directory specified in config, with timestamped subdirectories per run.
 - **File storage memory pressure**: Because file storage accumulates all results in memory until `finalize()`, very large evaluation runs may consume significant memory. SQL storage does not have this issue since it commits incrementally.
+- **Langfuse** requires the `langfuse` optional dependency (>=4.0.0). Config supports inline credentials or environment variable fallback.