feat: add behavioral tests and EvalHub integration for agentic_rag agent#102
Conversation
📝 WalkthroughWalkthroughAdded a comprehensive behavioral test suite for the LangGraph Agentic RAG agent. The PR introduces pytest configuration, shared test fixtures with golden query datasets, four test modules validating tool usage, latency, reliability, and response quality, plus full EvalHub container and e2e test integration. ChangesAgentic RAG Behavioral Testing
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
agents/langgraph/agentic_rag/tests/behavioral/conftest.py (1)
120-130: ⚡ Quick win
streamis accepted but silently ignored inrun_eval.This can hide test intent bugs when callers pass
stream=True. If the interface must stay compatible, make the override explicit.Suggested guard
async def _run( query: str, expected_tools: list[str] | None = None, timeout_seconds: float = 30.0, max_tokens_budget: int | None = None, model: str | None = None, stream: bool = False, ) -> TaskResult: + if stream: + warnings.warn( + "agentic_rag run_eval forces stream=False; ignoring stream=True", + stacklevel=2, + ) config = TaskConfig( agent_url=agent_url, query=query, expected_tools=expected_tools, timeout_seconds=timeout_seconds, max_tokens_budget=max_tokens_budget, model=model, stream=False, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py` around lines 120 - 130, The run_eval function accepts a stream parameter but currently ignores it by hardcoding stream=False in the TaskConfig; update the TaskConfig instantiation in conftest.py (the call that builds TaskConfig) to use the passed stream variable (stream=stream) or, if you must preserve a default override, make that explicit by asserting or logging the override when stream is True so callers are not silently ignored; locate the TaskConfig creation near the function that builds task results (the run_eval/task runner in this file) and replace the hardcoded value with the passed-in variable or add an explicit guard/notice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 24-30: RETRIEVER_EVIDENCE contains overly generic tokens (e.g.,
"information", "relevant") that cause heuristic false positives; update the list
used by tests (RETRIEVER_EVIDENCE) to remove or replace generic terms with more
specific gating phrases such as "retrieved from knowledge base", "source:",
"document id:", "Retrieved document:", "evidence:", or other explicit retrieval
markers so the heuristic only matches clear retrieval responses rather than
common language.
In `@agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py`:
- Around line 36-37: Validate the retrieved pass_at_k value (k =
agentic_rag_thresholds.get("pass_at_k", 8)) before any loop or division that
computes pass rates: add a guard that raises or skips the test if k <= 0 to
avoid division by zero or invalid rates, and apply the same guard near other
occurrences where pass_at_k is read (the other test blocks that use k to compute
pass fractions). Ensure the guard runs once per test before entering loops or
performing divisions so subsequent code that uses k can assume k > 0.
---
Nitpick comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 120-130: The run_eval function accepts a stream parameter but
currently ignores it by hardcoding stream=False in the TaskConfig; update the
TaskConfig instantiation in conftest.py (the call that builds TaskConfig) to use
the passed stream variable (stream=stream) or, if you must preserve a default
override, make that explicit by asserting or logging the override when stream is
True so callers are not silently ignored; locate the TaskConfig creation near
the function that builds task results (the run_eval/task runner in this file)
and replace the hardcoded value with the passed-in variable or add an explicit
guard/notice.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 66159ad6-ea1a-4dcd-b598-3cbfc5480a66
📒 Files selected for processing (18)
.gitignoreREADME.mdagents/langgraph/agentic_rag/README.mdagents/langgraph/agentic_rag/evalhub/tool_use.yamlagents/langgraph/agentic_rag/tests/behavioral/conftest.pyagents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yamlagents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.pyagents/langgraph/agentic_rag/tests/behavioral/test_reliability.pyagents/langgraph/agentic_rag/tests/behavioral/test_response_quality.pyagents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.pydocs/adding-behavioral-tests.mddocs/adding-evalhub-agent-integration.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/tests/run-e2e.shpyproject.tomltests/behavioral/configs/thresholds.yamltests/behavioral/conftest.py
Replace generic terms like "information" and "relevant" with multi-word phrases that only match actual retrieval output. Addresses CodeRabbit review comment on PR #102. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Missing evidence of API contract and adversarial test execution The Jira acceptance criterion #4 requires "API contract (7) and adversarial (4) tests pass," but the PR only documents the 11 agent-specific behavioral tests. The validation section mentions "Cross-agent consistency 13/13" which may cover this, but it's ambiguous — could you confirm the shared Assisted by Claude Opus 4.6 (1M context) |
|
Test plan checkboxes unchecked The three test plan checkboxes are unchecked, but the validation section above reports all gates passed. If the tests were run, please tick the boxes. Assisted by Claude Opus 4.6 (1M context) |
mpk-droid
left a comment
There was a problem hiding this comment.
overall looks good. added few comments.
|
Hey @andrewdonheiser - a repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team. This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance |
Add pytest behavioral test suite (tool usage, response quality, cost/latency, reliability) with MLflow trace enrichment and EvalHub fixture for the LangGraph agentic_rag agent. Update shared configs, Containerfile, run-e2e.sh, and documentation. Ref: RHAIENG-4223 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… tests Fixes from PR review: - Add agentic_rag to root conftest _AGENT_URL_MAP and report header - Remove duplicated _load_golden, import load_golden from conftest - Add test_response_completeness (parametrized, +4 test cases) - Add used_fallback tracking and warning in test_reliability - Sync evalhub/tool_use.yaml with golden_queries.yaml - Centralize RETRIEVER_EVIDENCE in conftest, use in greeting test - Update run-e2e.sh header comment (four -> five agents) - Keep stream parameter in run_eval for interface consistency (RHAIENG-5146) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace generic terms like "information" and "relevant" with multi-word phrases that only match actual retrieval output. Addresses CodeRabbit review comment on PR #102. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… test Address PR review feedback: - Replace generic RETRIEVER_EVIDENCE terms with domain-specific terms from the agent's knowledge base (langchain, langgraph, milvus, etc.) to avoid false positives matching non-retrieval responses - Add rejected_elements to adversarial golden query and a dedicated test_adversarial_prompt_injection_resistance test to verify the agent doesn't leak system prompt content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1afcbb2 to
6f789a8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py`:
- Around line 62-73: The tests currently allow missing telemetry by warning when
result.tool_calls is absent; change this to a hard failure so missing
MLflow-enriched tool_calls fails the test: replace the warnings.warn branch with
an assertion or raise (e.g., assert False or raise AssertionError) that mentions
result and golden["expected_tools"], and do the same fix for the other
occurrences around lines 85-87, 102-103, and 143-147; keep the
score_tool_selection(result, golden["expected_tools"]) flow intact when
result.tool_calls exists so only the absence of result.tool_calls triggers the
failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 5b8d3ecc-1459-4a80-b6b0-6a2477879eab
📒 Files selected for processing (18)
.gitignoreREADME.mdagents/langgraph/agentic_rag/README.mdagents/langgraph/agentic_rag/evalhub/tool_use.yamlagents/langgraph/agentic_rag/tests/behavioral/conftest.pyagents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yamlagents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.pyagents/langgraph/agentic_rag/tests/behavioral/test_reliability.pyagents/langgraph/agentic_rag/tests/behavioral/test_response_quality.pyagents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.pydocs/adding-behavioral-tests.mddocs/adding-evalhub-agent-integration.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/tests/run-e2e.shpyproject.tomltests/behavioral/configs/thresholds.yamltests/behavioral/conftest.py
✅ Files skipped from review due to trivial changes (7)
- README.md
- docs/adding-evalhub-agent-integration.md
- agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
- .gitignore
- agents/langgraph/agentic_rag/README.md
- docs/adding-behavioral-tests.md
- evals/evalhub_adapter/README.md
🚧 Files skipped from review as they are similar to previous changes (10)
- agents/langgraph/agentic_rag/evalhub/tool_use.yaml
- pyproject.toml
- evals/evalhub_adapter/Containerfile
- evals/evalhub_adapter/tests/run-e2e.sh
- agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
- tests/behavioral/configs/thresholds.yaml
- agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
- tests/behavioral/conftest.py
- agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
- agents/langgraph/agentic_rag/tests/behavioral/conftest.py
Yes they were run. I was not even looking at these checkbox items I was treating them as claude bloat. In future tickets I will make sure these are clarified |
Summary
langgraph/agentic_ragagent: tool usage, response quality, cost/latency, and reliability tests with MLflow trace enrichmentevalhub/tool_use.yaml) and update Containerfile,run-e2e.shfor orchestrated evaluationJira
RHAIENG-4223 — Eval coverage: Agentic RAG — deploy agent and validate test suite
Validation (Phase 11 — all gates passed)
Test plan
pytest agents/langgraph/agentic_rag/tests/behavioral/ --collect-only— 11 tests collectedevals/evalhub_adapter/tests/run-e2e.sh🤖 Generated with Claude Code