-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Summary
A goal-driven agent that performs quality assessment on other Hive agents — static analysis, functional testing, resilience testing, and security auditing — with a PASS / CONDITIONAL / FAIL verdict and iterative fix/re-test cycles.
This proposal deliberately pushes the framework's boundaries. It works today for static analysis and spec-level reasoning, and proposes 3 concrete framework additions (with API designs and code locations) to enable full runtime testing.
Per the contribution guidelines: "Proposals that reveal missing capabilities, integrations, or tools are just as valuable as working code. They help us shape the roadmap."
Problem Statement
Agent developers building with Hive have no systematic way to validate their agents before deployment. Questions like:
- Does my graph have unreachable nodes or broken edge conditions?
- Will my agent handle tool failures gracefully?
- Is my agent vulnerable to prompt injection via tool results?
- Does my fan-out/fan-in pattern actually converge correctly?
Currently answered by: manual testing, hoping for the best.
Target Users
- Agent developers validating their graphs before deployment
- Teams running CI/CD pipelines for agent quality gates
- The Hive framework team itself, for validating sample templates
Graph Architecture
intake → load-agent → static-analysis → generate-test-plan → review-test-plan (HITL pause)
→ [fan-out: run-functional | run-resilience | run-security]
→ [fan-in: aggregate-results]
→ judge-quality
→ PASS/FAIL → generate-report → deliver-report
→ CONDITIONAL → request-fixes → load-agent (feedback cycle, max 3x)
Nodes (13)
| Node | Type | Purpose |
|---|---|---|
intake |
event_loop (client_facing) | Collect agent spec from user (file path, URL, or raw JSON) |
load-agent |
function | Parse and validate agent spec (deterministic, no LLM) |
static-analysis |
function | Structural analysis: topology, patterns, edge consistency |
generate-test-plan |
event_loop | LLM generates test plan across 3 categories |
review-test-plan |
event_loop (client_facing, HITL pause) | User reviews and approves test plan |
run-functional |
event_loop | Execute functional correctness tests |
run-resilience |
event_loop | Execute resilience and fault tolerance tests |
run-security |
event_loop | Execute security tests (OWASP LLM Top 10) |
aggregate-results |
function | Merge results from 3 parallel runners (deterministic) |
judge-quality |
event_loop | Evaluate results, produce verdict |
generate-report |
event_loop | Generate HTML quality report |
deliver-report |
event_loop (client_facing) | Present report with download link |
request-fixes |
event_loop (client_facing) | Present fix suggestions, collect updated spec |
Edges (17)
| Pattern | Edges | Description |
|---|---|---|
| Sequential | 7 | intake→load, load→analysis, analysis→testplan, testplan→review, aggregate→judge, report→deliver |
| Fan-out | 3 | review-test-plan → run-functional, run-resilience, run-security |
| Fan-in | 3 | run-functional, run-resilience, run-security → aggregate-results |
| Conditional routing | 2 | judge → generate-report (verdict in PASS,FAIL), judge → request-fixes (verdict == CONDITIONAL) |
| Feedback cycle | 2 | request-fixes → load-agent (re-test), request-fixes → generate-report (skip) |
| On-failure | 1 | load-agent → generate-report (graceful error handling) |
Framework Features Demonstrated
This would be the first template to demonstrate these features (all currently at 0% template coverage):
| Feature | Where in this agent | Current coverage |
|---|---|---|
function node type |
load-agent, static-analysis, aggregate-results | 0/4 templates |
| Fan-out / fan-in | 3 parallel test runners | 0/4 templates |
on_failure edge |
load-agent → generate-report | 0/4 templates |
HITL pause_nodes |
review-test-plan | 0/4 templates |
| Conditional routing (multi-path) | judge → PASS/FAIL vs CONDITIONAL | 0/4 templates |
Feedback loop with max_node_visits |
request-fixes → load-agent (max 3) | 0/4 templates |
nullable_output_keys |
test_preferences, load_errors, fix_suggestions | 0/4 templates |
prompt_injection_shield |
Graph-level "warn" mode | 0/4 templates |
What Works Today vs What Needs Framework Additions
Works Today (no changes needed)
- Static analysis of agent specs — function nodes can parse
agent.json, validate graph topology, detect patterns, check edge consistency - LLM-powered test plan generation — event_loop nodes can reason about agent specs and generate test scenarios
- Spec-level security auditing — check for missing prompt_injection_shield, overly broad tool access, missing HITL gates
- Fan-out/fan-in execution — 3 parallel test runners converging to aggregator
- HITL approval gates — user reviews test plan before execution
- Feedback loop — fix/re-test cycle with max 3 iterations
- Quality verdicting and HTML reporting — judge + report generation
Needs 3 Framework Additions (proposals below)
To move from spec-level reasoning to actual runtime testing of target agents, 3 capabilities are needed. Each has existing code to build on:
Framework Addition Proposals
Proposal 1: Sub-Graph Execution Node
What: A node type that executes another agent's graph as a child process within the current execution.
Why it matters: The test runners need to actually run the target agent to test it, not just reason about its JSON spec.
Existing foundation: This is already partially designed:
ActionType.SUB_GRAPH = "sub_graph"exists incore/framework/graph/plan.py:25WorkerNode._execute_sub_graph()exists incore/framework/graph/worker_node.py:479-520ActionSpec.graph_idfield exists incore/framework/graph/plan.py:122
What's missing is wiring this into GraphExecutor:
# New node type (extends existing SubGraphNode skeleton)
class SubGraphNode(NodeProtocol):
async def execute(self, ctx: NodeContext) -> NodeResult:
graph, goal = self._loader(ctx.node_spec.sub_graph_path)
child_executor = GraphExecutor(
runtime=self._runtime,
llm=ctx.llm,
tools=self._tools,
tool_executor=self._tool_executor,
)
result = await child_executor.execute(graph=graph, goal=goal, input_data=ctx.input_data)
return NodeResult(success=result.success, output=result.output, tokens_used=result.total_tokens)Key insertion points: executor.py _get_node_implementation(), node.py new SubGraphNode class, add "sub_graph" to VALID_NODE_TYPES.
Edge cases to handle: Recursive depth control, token budget sharing, memory isolation between parent/child.
Proposal 2: Tool Interception / Mocking
What: A middleware layer in ToolRegistry that intercepts tool calls to inject failures, mock responses, or modify inputs.
Why it matters: Resilience testing needs to simulate tool failures (timeouts, rate limits, connection errors) without actually breaking things.
Existing foundation: ToolRegistry.get_executor() (tool_registry.py:222-254) is the single dispatch point for all tool calls. Adding an interceptor chain here is clean:
class ToolInterceptor(Protocol):
def intercept(self, tool_use: ToolUse) -> ToolInterception: ...
@dataclass
class ToolInterception:
intercept: bool = False
mock_result: ToolResult | None = None
inject_error: str | None = None
# In ToolRegistry.get_executor():
def executor(tool_use: ToolUse) -> ToolResult:
for interceptor in self._interceptors:
interception = interceptor.intercept(tool_use)
if interception.intercept:
if interception.inject_error:
return ToolResult(content=json.dumps({"error": interception.inject_error}), is_error=True)
return interception.mock_result
# ... existing dispatch ...Key insertion point: tool_registry.py after the _tools dict initialization.
Bonus: This pattern complements the existing PromptInjectionShield (post-execution scan) by adding pre-execution interception.
Proposal 3: Execution Snapshot & Comparison
What: Capture a structured recording of every node execution during a graph run, and diff two recordings.
Why it matters: Functional testing needs to compare "expected execution" against "actual execution" — same path? Same outputs? Same quality?
Existing foundation: ExecutionResult (executor.py:42-68) already captures path, node_visit_counts, execution_quality, retry_details. Extending it:
@dataclass
class NodeSnapshot:
node_id: str
success: bool
output: dict[str, Any]
tokens_used: int
latency_ms: int
@dataclass
class SnapshotDiff:
path_matches: bool
output_matches: bool
node_diffs: list[NodeDiff]
quality_change: str # "same", "improved", "degraded"
@property
def is_regression(self) -> bool:
return not self.output_matches or self.quality_change == "degraded"Key insertion point: executor.py main execution loop (line ~579, after result = await node_impl.execute(ctx)), new execution_snapshot.py module.
Non-Overlap Verification
| Existing Proposal | Overlap? | Why Not |
|---|---|---|
| #4050 Support Debugger | ❌ | Debugs customer issues, not agent specs |
| #1851 Code Review Agent | ❌ | Reviews source code quality, not agent graphs |
| #4224 Vulnerability Auditor | ❌ | Scans project dependencies for CVEs, not agent architecture |
| #3803 Agent Unit Testing Harness | Partial | #3803 proposes a CI test runner; this proposes an agent-based QA pipeline with HITL. Different execution model, complementary goals |
Expected Behavior
Input: Path to any agent.json or agent.py
Output:
- Static analysis report (topology, patterns, issues)
- Test plan (approved by user via HITL)
- Test results (functional + resilience + security)
- Quality verdict (PASS / CONDITIONAL / FAIL with score 0-100)
- HTML report with detailed findings
- If CONDITIONAL: fix suggestions + re-test cycle (max 3x)
Example interaction:
User: Test my agent at examples/templates/tech_news_reporter/agent.json
Agent: I'll analyze your Tech News Reporter agent.
[load-agent: Parsed 3 nodes, 2 edges]
[static-analysis: Linear pipeline, no error recovery, no HITL gates]
Here's the test plan (12 tests across 3 categories):
- Functional (5): output key validation, edge routing, goal criteria coverage...
- Resilience (4): tool failure handling, missing web_search results, retry behavior...
- Security (3): prompt injection via scraped content, data exposure in reports...
Approve this plan? [HITL pause — user reviews and approves]
[fan-out: running 3 test categories in parallel...]
[fan-in: aggregating results...]
Verdict: CONDITIONAL (score: 62/100)
- ✅ Functional: 5/5 passed
- ⚠️ Resilience: 2/4 passed — no on_failure edges, no retry_on configuration
- ⚠️ Security: 1/3 passed — no prompt_injection_shield, web_scrape results unscanned
Fix suggestions:
1. Add on_failure edge from research → compile-report for graceful degradation
2. Add prompt_injection_shield: "warn" to graph spec
3. Add retry_on: ["ToolExecutionError"] to research node
Apply fixes and re-test? [HITL — user provides updated spec]
[feedback cycle: reload → re-analyze → re-test...]
Scope & Implementation Path
Phase 1 (works today): Static analysis + spec-level LLM reasoning for all 3 test categories. Demonstrates all graph patterns. Useful as-is for structural validation.
Phase 2 (needs Proposal 1): Sub-graph execution enables actual runtime testing of target agents.
Phase 3 (needs Proposals 2+3): Tool interception enables resilience testing. Snapshot comparison enables regression testing.
Each phase is independently valuable and can be implemented incrementally.