[agent-idea] Agent QA Pipeline — Meta-Circular Testing with Framework Evolution Proposals

## Summary

A goal-driven agent that performs quality assessment on **other Hive agents** — static analysis, functional testing, resilience testing, and security auditing — with a PASS / CONDITIONAL / FAIL verdict and iterative fix/re-test cycles.

This proposal deliberately pushes the framework's boundaries. It works **today** for static analysis and spec-level reasoning, and proposes **3 concrete framework additions** (with API designs and code locations) to enable full runtime testing.

Per the contribution guidelines: _"Proposals that reveal missing capabilities, integrations, or tools are just as valuable as working code. They help us shape the roadmap."_

---

## Problem Statement

Agent developers building with Hive have no systematic way to validate their agents before deployment. Questions like:
- Does my graph have unreachable nodes or broken edge conditions?
- Will my agent handle tool failures gracefully?
- Is my agent vulnerable to prompt injection via tool results?
- Does my fan-out/fan-in pattern actually converge correctly?

Currently answered by: manual testing, hoping for the best.

---

## Target Users

- Agent developers validating their graphs before deployment
- Teams running CI/CD pipelines for agent quality gates
- The Hive framework team itself, for validating sample templates

---

## Graph Architecture

```
intake → load-agent → static-analysis → generate-test-plan → review-test-plan (HITL pause)
  → [fan-out: run-functional | run-resilience | run-security]
  → [fan-in: aggregate-results]
  → judge-quality
    → PASS/FAIL → generate-report → deliver-report
    → CONDITIONAL → request-fixes → load-agent (feedback cycle, max 3x)
```

### Nodes (13)

| Node | Type | Purpose |
|------|------|---------|
| `intake` | event_loop (client_facing) | Collect agent spec from user (file path, URL, or raw JSON) |
| `load-agent` | **function** | Parse and validate agent spec (deterministic, no LLM) |
| `static-analysis` | **function** | Structural analysis: topology, patterns, edge consistency |
| `generate-test-plan` | event_loop | LLM generates test plan across 3 categories |
| `review-test-plan` | event_loop (client_facing, **HITL pause**) | User reviews and approves test plan |
| `run-functional` | event_loop | Execute functional correctness tests |
| `run-resilience` | event_loop | Execute resilience and fault tolerance tests |
| `run-security` | event_loop | Execute security tests (OWASP LLM Top 10) |
| `aggregate-results` | **function** | Merge results from 3 parallel runners (deterministic) |
| `judge-quality` | event_loop | Evaluate results, produce verdict |
| `generate-report` | event_loop | Generate HTML quality report |
| `deliver-report` | event_loop (client_facing) | Present report with download link |
| `request-fixes` | event_loop (client_facing) | Present fix suggestions, collect updated spec |

### Edges (17)

| Pattern | Edges | Description |
|---------|-------|-------------|
| Sequential | 7 | intake→load, load→analysis, analysis→testplan, testplan→review, aggregate→judge, report→deliver |
| **Fan-out** | 3 | review-test-plan → run-functional, run-resilience, run-security |
| **Fan-in** | 3 | run-functional, run-resilience, run-security → aggregate-results |
| **Conditional routing** | 2 | judge → generate-report (`verdict in PASS,FAIL`), judge → request-fixes (`verdict == CONDITIONAL`) |
| **Feedback cycle** | 2 | request-fixes → load-agent (re-test), request-fixes → generate-report (skip) |
| **On-failure** | 1 | load-agent → generate-report (graceful error handling) |

---

## Framework Features Demonstrated

This would be the **first template** to demonstrate these features (all currently at 0% template coverage):

| Feature | Where in this agent | Current coverage |
|---------|-------------------|-----------------|
| `function` node type | load-agent, static-analysis, aggregate-results | 0/4 templates |
| Fan-out / fan-in | 3 parallel test runners | 0/4 templates |
| `on_failure` edge | load-agent → generate-report | 0/4 templates |
| HITL `pause_nodes` | review-test-plan | 0/4 templates |
| Conditional routing (multi-path) | judge → PASS/FAIL vs CONDITIONAL | 0/4 templates |
| Feedback loop with `max_node_visits` | request-fixes → load-agent (max 3) | 0/4 templates |
| `nullable_output_keys` | test_preferences, load_errors, fix_suggestions | 0/4 templates |
| `prompt_injection_shield` | Graph-level "warn" mode | 0/4 templates |

---

## What Works Today vs What Needs Framework Additions

### Works Today (no changes needed)

- **Static analysis of agent specs** — function nodes can parse `agent.json`, validate graph topology, detect patterns, check edge consistency
- **LLM-powered test plan generation** — event_loop nodes can reason about agent specs and generate test scenarios
- **Spec-level security auditing** — check for missing prompt_injection_shield, overly broad tool access, missing HITL gates
- **Fan-out/fan-in execution** — 3 parallel test runners converging to aggregator
- **HITL approval gates** — user reviews test plan before execution
- **Feedback loop** — fix/re-test cycle with max 3 iterations
- **Quality verdicting and HTML reporting** — judge + report generation

### Needs 3 Framework Additions (proposals below)

To move from spec-level reasoning to **actual runtime testing** of target agents, 3 capabilities are needed. Each has existing code to build on:

---

## Framework Addition Proposals

### Proposal 1: Sub-Graph Execution Node

**What**: A node type that executes another agent's graph as a child process within the current execution.

**Why it matters**: The test runners need to actually *run* the target agent to test it, not just reason about its JSON spec.

**Existing foundation**: This is **already partially designed**:
- `ActionType.SUB_GRAPH = "sub_graph"` exists in `core/framework/graph/plan.py:25`
- `WorkerNode._execute_sub_graph()` exists in `core/framework/graph/worker_node.py:479-520`
- `ActionSpec.graph_id` field exists in `core/framework/graph/plan.py:122`

What's missing is wiring this into `GraphExecutor`:

```python
# New node type (extends existing SubGraphNode skeleton)
class SubGraphNode(NodeProtocol):
    async def execute(self, ctx: NodeContext) -> NodeResult:
        graph, goal = self._loader(ctx.node_spec.sub_graph_path)
        child_executor = GraphExecutor(
            runtime=self._runtime,
            llm=ctx.llm,
            tools=self._tools,
            tool_executor=self._tool_executor,
        )
        result = await child_executor.execute(graph=graph, goal=goal, input_data=ctx.input_data)
        return NodeResult(success=result.success, output=result.output, tokens_used=result.total_tokens)
```

**Key insertion points**: `executor.py` `_get_node_implementation()`, `node.py` new `SubGraphNode` class, add `"sub_graph"` to `VALID_NODE_TYPES`.

**Edge cases to handle**: Recursive depth control, token budget sharing, memory isolation between parent/child.

---

### Proposal 2: Tool Interception / Mocking

**What**: A middleware layer in `ToolRegistry` that intercepts tool calls to inject failures, mock responses, or modify inputs.

**Why it matters**: Resilience testing needs to simulate tool failures (timeouts, rate limits, connection errors) without actually breaking things.

**Existing foundation**: `ToolRegistry.get_executor()` (`tool_registry.py:222-254`) is the **single dispatch point** for all tool calls. Adding an interceptor chain here is clean:

```python
class ToolInterceptor(Protocol):
    def intercept(self, tool_use: ToolUse) -> ToolInterception: ...

@dataclass
class ToolInterception:
    intercept: bool = False
    mock_result: ToolResult | None = None
    inject_error: str | None = None

# In ToolRegistry.get_executor():
def executor(tool_use: ToolUse) -> ToolResult:
    for interceptor in self._interceptors:
        interception = interceptor.intercept(tool_use)
        if interception.intercept:
            if interception.inject_error:
                return ToolResult(content=json.dumps({"error": interception.inject_error}), is_error=True)
            return interception.mock_result
    # ... existing dispatch ...
```

**Key insertion point**: `tool_registry.py` after the `_tools` dict initialization.

**Bonus**: This pattern complements the existing `PromptInjectionShield` (post-execution scan) by adding pre-execution interception.

---

### Proposal 3: Execution Snapshot & Comparison

**What**: Capture a structured recording of every node execution during a graph run, and diff two recordings.

**Why it matters**: Functional testing needs to compare "expected execution" against "actual execution" — same path? Same outputs? Same quality?

**Existing foundation**: `ExecutionResult` (`executor.py:42-68`) already captures `path`, `node_visit_counts`, `execution_quality`, `retry_details`. Extending it:

```python
@dataclass
class NodeSnapshot:
    node_id: str
    success: bool
    output: dict[str, Any]
    tokens_used: int
    latency_ms: int

@dataclass
class SnapshotDiff:
    path_matches: bool
    output_matches: bool
    node_diffs: list[NodeDiff]
    quality_change: str  # "same", "improved", "degraded"

    @property
    def is_regression(self) -> bool:
        return not self.output_matches or self.quality_change == "degraded"
```

**Key insertion point**: `executor.py` main execution loop (line ~579, after `result = await node_impl.execute(ctx)`), new `execution_snapshot.py` module.

---

## Non-Overlap Verification

| Existing Proposal | Overlap? | Why Not |
|---|---|---|
| #4050 Support Debugger | ❌ | Debugs customer issues, not agent specs |
| #1851 Code Review Agent | ❌ | Reviews source code quality, not agent graphs |
| #4224 Vulnerability Auditor | ❌ | Scans project dependencies for CVEs, not agent architecture |
| #3803 Agent Unit Testing Harness | Partial | #3803 proposes a CI test runner; this proposes an agent-based QA pipeline with HITL. Different execution model, complementary goals |

---

## Expected Behavior

**Input**: Path to any `agent.json` or `agent.py`

**Output**:
1. Static analysis report (topology, patterns, issues)
2. Test plan (approved by user via HITL)
3. Test results (functional + resilience + security)
4. Quality verdict (PASS / CONDITIONAL / FAIL with score 0-100)
5. HTML report with detailed findings
6. If CONDITIONAL: fix suggestions + re-test cycle (max 3x)

**Example interaction**:
```
User: Test my agent at examples/templates/tech_news_reporter/agent.json

Agent: I'll analyze your Tech News Reporter agent.

[load-agent: Parsed 3 nodes, 2 edges]
[static-analysis: Linear pipeline, no error recovery, no HITL gates]

Here's the test plan (12 tests across 3 categories):
- Functional (5): output key validation, edge routing, goal criteria coverage...
- Resilience (4): tool failure handling, missing web_search results, retry behavior...
- Security (3): prompt injection via scraped content, data exposure in reports...

Approve this plan? [HITL pause — user reviews and approves]

[fan-out: running 3 test categories in parallel...]
[fan-in: aggregating results...]

Verdict: CONDITIONAL (score: 62/100)
- ✅ Functional: 5/5 passed
- ⚠️ Resilience: 2/4 passed — no on_failure edges, no retry_on configuration
- ⚠️ Security: 1/3 passed — no prompt_injection_shield, web_scrape results unscanned

Fix suggestions:
1. Add on_failure edge from research → compile-report for graceful degradation
2. Add prompt_injection_shield: "warn" to graph spec
3. Add retry_on: ["ToolExecutionError"] to research node

Apply fixes and re-test? [HITL — user provides updated spec]
[feedback cycle: reload → re-analyze → re-test...]
```

---

## Scope & Implementation Path

**Phase 1 (works today)**: Static analysis + spec-level LLM reasoning for all 3 test categories. Demonstrates all graph patterns. Useful as-is for structural validation.

**Phase 2 (needs Proposal 1)**: Sub-graph execution enables actual runtime testing of target agents.

**Phase 3 (needs Proposals 2+3)**: Tool interception enables resilience testing. Snapshot comparison enables regression testing.

Each phase is independently valuable and can be implemented incrementally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[agent-idea] Agent QA Pipeline — Meta-Circular Testing with Framework Evolution Proposals #4286

Summary

Problem Statement

Target Users

Graph Architecture

Nodes (13)

Edges (17)

Framework Features Demonstrated

What Works Today vs What Needs Framework Additions

Works Today (no changes needed)

Needs 3 Framework Additions (proposals below)

Framework Addition Proposals

Proposal 1: Sub-Graph Execution Node

Proposal 2: Tool Interception / Mocking

Proposal 3: Execution Snapshot & Comparison

Non-Overlap Verification

Expected Behavior

Scope & Implementation Path

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node	Type	Purpose
`intake`	event_loop (client_facing)	Collect agent spec from user (file path, URL, or raw JSON)
`load-agent`	function	Parse and validate agent spec (deterministic, no LLM)
`static-analysis`	function	Structural analysis: topology, patterns, edge consistency
`generate-test-plan`	event_loop	LLM generates test plan across 3 categories
`review-test-plan`	event_loop (client_facing, HITL pause)	User reviews and approves test plan
`run-functional`	event_loop	Execute functional correctness tests
`run-resilience`	event_loop	Execute resilience and fault tolerance tests
`run-security`	event_loop	Execute security tests (OWASP LLM Top 10)
`aggregate-results`	function	Merge results from 3 parallel runners (deterministic)
`judge-quality`	event_loop	Evaluate results, produce verdict
`generate-report`	event_loop	Generate HTML quality report
`deliver-report`	event_loop (client_facing)	Present report with download link
`request-fixes`	event_loop (client_facing)	Present fix suggestions, collect updated spec

Pattern	Edges	Description
Sequential	7	intake→load, load→analysis, analysis→testplan, testplan→review, aggregate→judge, report→deliver
Fan-out	3	review-test-plan → run-functional, run-resilience, run-security
Fan-in	3	run-functional, run-resilience, run-security → aggregate-results
Conditional routing	2	judge → generate-report (`verdict in PASS,FAIL`), judge → request-fixes (`verdict == CONDITIONAL`)
Feedback cycle	2	request-fixes → load-agent (re-test), request-fixes → generate-report (skip)
On-failure	1	load-agent → generate-report (graceful error handling)

Feature	Where in this agent	Current coverage
`function` node type	load-agent, static-analysis, aggregate-results	0/4 templates
Fan-out / fan-in	3 parallel test runners	0/4 templates
`on_failure` edge	load-agent → generate-report	0/4 templates
HITL `pause_nodes`	review-test-plan	0/4 templates
Conditional routing (multi-path)	judge → PASS/FAIL vs CONDITIONAL	0/4 templates
Feedback loop with `max_node_visits`	request-fixes → load-agent (max 3)	0/4 templates
`nullable_output_keys`	test_preferences, load_errors, fix_suggestions	0/4 templates
`prompt_injection_shield`	Graph-level "warn" mode	0/4 templates

Existing Proposal	Overlap?	Why Not
#4050 Support Debugger	❌	Debugs customer issues, not agent specs
#1851 Code Review Agent	❌	Reviews source code quality, not agent graphs
#4224 Vulnerability Auditor	❌	Scans project dependencies for CVEs, not agent architecture
#3803 Agent Unit Testing Harness	Partial	#3803 proposes a CI test runner; this proposes an agent-based QA pipeline with HITL. Different execution model, complementary goals

[agent-idea] Agent QA Pipeline — Meta-Circular Testing with Framework Evolution Proposals #4286

Description

Summary

Problem Statement

Target Users

Graph Architecture

Nodes (13)

Edges (17)

Framework Features Demonstrated

What Works Today vs What Needs Framework Additions

Works Today (no changes needed)

Needs 3 Framework Additions (proposals below)

Framework Addition Proposals

Proposal 1: Sub-Graph Execution Node

Proposal 2: Tool Interception / Mocking

Proposal 3: Execution Snapshot & Comparison

Non-Overlap Verification

Expected Behavior

Scope & Implementation Path

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions