Skip to content

[agent-idea] Agent QA Pipeline — Meta-Circular Testing with Framework Evolution Proposals #4286

@felipestenzel

Description

@felipestenzel

Summary

A goal-driven agent that performs quality assessment on other Hive agents — static analysis, functional testing, resilience testing, and security auditing — with a PASS / CONDITIONAL / FAIL verdict and iterative fix/re-test cycles.

This proposal deliberately pushes the framework's boundaries. It works today for static analysis and spec-level reasoning, and proposes 3 concrete framework additions (with API designs and code locations) to enable full runtime testing.

Per the contribution guidelines: "Proposals that reveal missing capabilities, integrations, or tools are just as valuable as working code. They help us shape the roadmap."


Problem Statement

Agent developers building with Hive have no systematic way to validate their agents before deployment. Questions like:

  • Does my graph have unreachable nodes or broken edge conditions?
  • Will my agent handle tool failures gracefully?
  • Is my agent vulnerable to prompt injection via tool results?
  • Does my fan-out/fan-in pattern actually converge correctly?

Currently answered by: manual testing, hoping for the best.


Target Users

  • Agent developers validating their graphs before deployment
  • Teams running CI/CD pipelines for agent quality gates
  • The Hive framework team itself, for validating sample templates

Graph Architecture

intake → load-agent → static-analysis → generate-test-plan → review-test-plan (HITL pause)
  → [fan-out: run-functional | run-resilience | run-security]
  → [fan-in: aggregate-results]
  → judge-quality
    → PASS/FAIL → generate-report → deliver-report
    → CONDITIONAL → request-fixes → load-agent (feedback cycle, max 3x)

Nodes (13)

Node Type Purpose
intake event_loop (client_facing) Collect agent spec from user (file path, URL, or raw JSON)
load-agent function Parse and validate agent spec (deterministic, no LLM)
static-analysis function Structural analysis: topology, patterns, edge consistency
generate-test-plan event_loop LLM generates test plan across 3 categories
review-test-plan event_loop (client_facing, HITL pause) User reviews and approves test plan
run-functional event_loop Execute functional correctness tests
run-resilience event_loop Execute resilience and fault tolerance tests
run-security event_loop Execute security tests (OWASP LLM Top 10)
aggregate-results function Merge results from 3 parallel runners (deterministic)
judge-quality event_loop Evaluate results, produce verdict
generate-report event_loop Generate HTML quality report
deliver-report event_loop (client_facing) Present report with download link
request-fixes event_loop (client_facing) Present fix suggestions, collect updated spec

Edges (17)

Pattern Edges Description
Sequential 7 intake→load, load→analysis, analysis→testplan, testplan→review, aggregate→judge, report→deliver
Fan-out 3 review-test-plan → run-functional, run-resilience, run-security
Fan-in 3 run-functional, run-resilience, run-security → aggregate-results
Conditional routing 2 judge → generate-report (verdict in PASS,FAIL), judge → request-fixes (verdict == CONDITIONAL)
Feedback cycle 2 request-fixes → load-agent (re-test), request-fixes → generate-report (skip)
On-failure 1 load-agent → generate-report (graceful error handling)

Framework Features Demonstrated

This would be the first template to demonstrate these features (all currently at 0% template coverage):

Feature Where in this agent Current coverage
function node type load-agent, static-analysis, aggregate-results 0/4 templates
Fan-out / fan-in 3 parallel test runners 0/4 templates
on_failure edge load-agent → generate-report 0/4 templates
HITL pause_nodes review-test-plan 0/4 templates
Conditional routing (multi-path) judge → PASS/FAIL vs CONDITIONAL 0/4 templates
Feedback loop with max_node_visits request-fixes → load-agent (max 3) 0/4 templates
nullable_output_keys test_preferences, load_errors, fix_suggestions 0/4 templates
prompt_injection_shield Graph-level "warn" mode 0/4 templates

What Works Today vs What Needs Framework Additions

Works Today (no changes needed)

  • Static analysis of agent specs — function nodes can parse agent.json, validate graph topology, detect patterns, check edge consistency
  • LLM-powered test plan generation — event_loop nodes can reason about agent specs and generate test scenarios
  • Spec-level security auditing — check for missing prompt_injection_shield, overly broad tool access, missing HITL gates
  • Fan-out/fan-in execution — 3 parallel test runners converging to aggregator
  • HITL approval gates — user reviews test plan before execution
  • Feedback loop — fix/re-test cycle with max 3 iterations
  • Quality verdicting and HTML reporting — judge + report generation

Needs 3 Framework Additions (proposals below)

To move from spec-level reasoning to actual runtime testing of target agents, 3 capabilities are needed. Each has existing code to build on:


Framework Addition Proposals

Proposal 1: Sub-Graph Execution Node

What: A node type that executes another agent's graph as a child process within the current execution.

Why it matters: The test runners need to actually run the target agent to test it, not just reason about its JSON spec.

Existing foundation: This is already partially designed:

  • ActionType.SUB_GRAPH = "sub_graph" exists in core/framework/graph/plan.py:25
  • WorkerNode._execute_sub_graph() exists in core/framework/graph/worker_node.py:479-520
  • ActionSpec.graph_id field exists in core/framework/graph/plan.py:122

What's missing is wiring this into GraphExecutor:

# New node type (extends existing SubGraphNode skeleton)
class SubGraphNode(NodeProtocol):
    async def execute(self, ctx: NodeContext) -> NodeResult:
        graph, goal = self._loader(ctx.node_spec.sub_graph_path)
        child_executor = GraphExecutor(
            runtime=self._runtime,
            llm=ctx.llm,
            tools=self._tools,
            tool_executor=self._tool_executor,
        )
        result = await child_executor.execute(graph=graph, goal=goal, input_data=ctx.input_data)
        return NodeResult(success=result.success, output=result.output, tokens_used=result.total_tokens)

Key insertion points: executor.py _get_node_implementation(), node.py new SubGraphNode class, add "sub_graph" to VALID_NODE_TYPES.

Edge cases to handle: Recursive depth control, token budget sharing, memory isolation between parent/child.


Proposal 2: Tool Interception / Mocking

What: A middleware layer in ToolRegistry that intercepts tool calls to inject failures, mock responses, or modify inputs.

Why it matters: Resilience testing needs to simulate tool failures (timeouts, rate limits, connection errors) without actually breaking things.

Existing foundation: ToolRegistry.get_executor() (tool_registry.py:222-254) is the single dispatch point for all tool calls. Adding an interceptor chain here is clean:

class ToolInterceptor(Protocol):
    def intercept(self, tool_use: ToolUse) -> ToolInterception: ...

@dataclass
class ToolInterception:
    intercept: bool = False
    mock_result: ToolResult | None = None
    inject_error: str | None = None

# In ToolRegistry.get_executor():
def executor(tool_use: ToolUse) -> ToolResult:
    for interceptor in self._interceptors:
        interception = interceptor.intercept(tool_use)
        if interception.intercept:
            if interception.inject_error:
                return ToolResult(content=json.dumps({"error": interception.inject_error}), is_error=True)
            return interception.mock_result
    # ... existing dispatch ...

Key insertion point: tool_registry.py after the _tools dict initialization.

Bonus: This pattern complements the existing PromptInjectionShield (post-execution scan) by adding pre-execution interception.


Proposal 3: Execution Snapshot & Comparison

What: Capture a structured recording of every node execution during a graph run, and diff two recordings.

Why it matters: Functional testing needs to compare "expected execution" against "actual execution" — same path? Same outputs? Same quality?

Existing foundation: ExecutionResult (executor.py:42-68) already captures path, node_visit_counts, execution_quality, retry_details. Extending it:

@dataclass
class NodeSnapshot:
    node_id: str
    success: bool
    output: dict[str, Any]
    tokens_used: int
    latency_ms: int

@dataclass
class SnapshotDiff:
    path_matches: bool
    output_matches: bool
    node_diffs: list[NodeDiff]
    quality_change: str  # "same", "improved", "degraded"

    @property
    def is_regression(self) -> bool:
        return not self.output_matches or self.quality_change == "degraded"

Key insertion point: executor.py main execution loop (line ~579, after result = await node_impl.execute(ctx)), new execution_snapshot.py module.


Non-Overlap Verification

Existing Proposal Overlap? Why Not
#4050 Support Debugger Debugs customer issues, not agent specs
#1851 Code Review Agent Reviews source code quality, not agent graphs
#4224 Vulnerability Auditor Scans project dependencies for CVEs, not agent architecture
#3803 Agent Unit Testing Harness Partial #3803 proposes a CI test runner; this proposes an agent-based QA pipeline with HITL. Different execution model, complementary goals

Expected Behavior

Input: Path to any agent.json or agent.py

Output:

  1. Static analysis report (topology, patterns, issues)
  2. Test plan (approved by user via HITL)
  3. Test results (functional + resilience + security)
  4. Quality verdict (PASS / CONDITIONAL / FAIL with score 0-100)
  5. HTML report with detailed findings
  6. If CONDITIONAL: fix suggestions + re-test cycle (max 3x)

Example interaction:

User: Test my agent at examples/templates/tech_news_reporter/agent.json

Agent: I'll analyze your Tech News Reporter agent.

[load-agent: Parsed 3 nodes, 2 edges]
[static-analysis: Linear pipeline, no error recovery, no HITL gates]

Here's the test plan (12 tests across 3 categories):
- Functional (5): output key validation, edge routing, goal criteria coverage...
- Resilience (4): tool failure handling, missing web_search results, retry behavior...
- Security (3): prompt injection via scraped content, data exposure in reports...

Approve this plan? [HITL pause — user reviews and approves]

[fan-out: running 3 test categories in parallel...]
[fan-in: aggregating results...]

Verdict: CONDITIONAL (score: 62/100)
- ✅ Functional: 5/5 passed
- ⚠️ Resilience: 2/4 passed — no on_failure edges, no retry_on configuration
- ⚠️ Security: 1/3 passed — no prompt_injection_shield, web_scrape results unscanned

Fix suggestions:
1. Add on_failure edge from research → compile-report for graceful degradation
2. Add prompt_injection_shield: "warn" to graph spec
3. Add retry_on: ["ToolExecutionError"] to research node

Apply fixes and re-test? [HITL — user provides updated spec]
[feedback cycle: reload → re-analyze → re-test...]

Scope & Implementation Path

Phase 1 (works today): Static analysis + spec-level LLM reasoning for all 3 test categories. Demonstrates all graph patterns. Useful as-is for structural validation.

Phase 2 (needs Proposal 1): Sub-graph execution enables actual runtime testing of target agents.

Phase 3 (needs Proposals 2+3): Tool interception enables resilience testing. Snapshot comparison enables regression testing.

Each phase is independently valuable and can be implemented incrementally.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions