Skip to content

feat: add behavioral tests and EvalHub integration for agentic_rag agent#102

Merged
andrewdonheiser merged 5 commits into
mainfrom
RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite
May 19, 2026
Merged

feat: add behavioral tests and EvalHub integration for agentic_rag agent#102
andrewdonheiser merged 5 commits into
mainfrom
RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite

Conversation

@andrewdonheiser
Copy link
Copy Markdown
Contributor

@andrewdonheiser andrewdonheiser commented May 14, 2026

Summary

  • Add pytest behavioral test suite for langgraph/agentic_rag agent: tool usage, response quality, cost/latency, and reliability tests with MLflow trace enrichment
  • Add EvalHub fixture (evalhub/tool_use.yaml) and update Containerfile, run-e2e.sh for orchestrated evaluation
  • Update shared configs (thresholds, pyproject markers), documentation, and agent README

Jira

RHAIENG-4223 — Eval coverage: Agentic RAG — deploy agent and validate test suite

Validation (Phase 11 — all gates passed)

  • 11a: 11/11 pytest tests passed (188.69s) with MLflow enrichment active
  • 11b: HARD GATE passed — tool_calls populated from MLflow traces (F1 scoring, not content heuristics)
  • 11c: MLflow trace structure verified — TOOL, CHAT_MODEL, CHAIN spans confirmed
  • 11d: Agent pod logs clean
  • 11e: EvalHub E2E completed — all scores 1.0 (tool_selection, tool_sequence, hallucinated_tools, tool_call_validity)
  • 11f: Cross-agent consistency 13/13 (2 pre-existing deviations in other agents tracked in RHAIENG-5126 and RHAIENG-5127)

Test plan

  • [ x] pytest agents/langgraph/agentic_rag/tests/behavioral/ --collect-only — 11 tests collected
  • [ x] Run full suite against deployed agent with MLflow env vars set
  • [ x] Verify EvalHub E2E via evals/evalhub_adapter/tests/run-e2e.sh

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Added a comprehensive behavioral test suite for the LangGraph Agentic RAG agent. The PR introduces pytest configuration, shared test fixtures with golden query datasets, four test modules validating tool usage, latency, reliability, and response quality, plus full EvalHub container and e2e test integration.

Changes

Agentic RAG Behavioral Testing

Layer / File(s) Summary
Configuration and Evaluation Thresholds
pyproject.toml, tests/behavioral/configs/thresholds.yaml, .gitignore, tests/behavioral/conftest.py
Registers agentic_rag pytest marker, adds thresholds for tool accuracy, response coherence, latency p95, and pass@k metrics; maps marker to AGENTIC_RAG_AGENT_URL environment variable; excludes test validation reports from version control.
Test Harness and Fixtures
agents/langgraph/agentic_rag/tests/behavioral/conftest.py, agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
Pytest fixtures provide agent URL resolution (with localhost default), async HTTP client, threshold config loading, known tools list, and core run_eval runner that builds TaskConfig, forces stream=False, and optionally enriches results via MLflow. Golden queries fixture defines test cases across difficulty levels with expected tools and response elements.
Tool Usage Validation Tests
agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
Parametrized and standalone tests validate tool selection accuracy (with response-content fallback), prevent hallucinated tool calls, verify tool argument JSON validity, ensure adversarial prompts don't expose rejected elements, and confirm greeting input does not trigger tools.
Latency and Reliability Tests
agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py, agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
Latency test scores against p95 threshold; pass@k tests run repeated queries and measure tool selection and response coherence consistency with computed pass rates against thresholds.
Response Quality Tests
agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
Plan coherence and response completeness tests validate that generated responses meet expected semantic quality and contain required elements from golden query specs.
Documentation and Guides
agents/langgraph/agentic_rag/README.md, README.md, docs/adding-behavioral-tests.md, docs/adding-evalhub-agent-integration.md
Agent README documents test setup and pytest commands; root README lists environment variable; behavioral testing guide references conftest example; EvalHub guide lists fixture paths.
EvalHub Integration and E2E Tests
agents/langgraph/agentic_rag/evalhub/tool_use.yaml, evals/evalhub_adapter/Containerfile, evals/evalhub_adapter/README.md, evals/evalhub_adapter/tests/run-e2e.sh
Tool-use fixture added for EvalHub; container build extended to include agentic_rag fixtures and validate YAML; e2e script discovers agent route via env and heuristics, performs health checks, generates eval config with MLflow parameters, submits job, and reports results.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding behavioral tests and EvalHub integration for the agentic_rag agent.
Docstring Coverage ✅ Passed Docstring coverage is 95.24% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides a comprehensive overview of the changes including test suite additions, EvalHub integration, and validation results that directly correspond to the file modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
agents/langgraph/agentic_rag/tests/behavioral/conftest.py (1)

120-130: ⚡ Quick win

stream is accepted but silently ignored in run_eval.

This can hide test intent bugs when callers pass stream=True. If the interface must stay compatible, make the override explicit.

Suggested guard
     async def _run(
         query: str,
         expected_tools: list[str] | None = None,
         timeout_seconds: float = 30.0,
         max_tokens_budget: int | None = None,
         model: str | None = None,
         stream: bool = False,
     ) -> TaskResult:
+        if stream:
+            warnings.warn(
+                "agentic_rag run_eval forces stream=False; ignoring stream=True",
+                stacklevel=2,
+            )
         config = TaskConfig(
             agent_url=agent_url,
             query=query,
             expected_tools=expected_tools,
             timeout_seconds=timeout_seconds,
             max_tokens_budget=max_tokens_budget,
             model=model,
             stream=False,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py` around lines 120 -
130, The run_eval function accepts a stream parameter but currently ignores it
by hardcoding stream=False in the TaskConfig; update the TaskConfig
instantiation in conftest.py (the call that builds TaskConfig) to use the passed
stream variable (stream=stream) or, if you must preserve a default override,
make that explicit by asserting or logging the override when stream is True so
callers are not silently ignored; locate the TaskConfig creation near the
function that builds task results (the run_eval/task runner in this file) and
replace the hardcoded value with the passed-in variable or add an explicit
guard/notice.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 24-30: RETRIEVER_EVIDENCE contains overly generic tokens (e.g.,
"information", "relevant") that cause heuristic false positives; update the list
used by tests (RETRIEVER_EVIDENCE) to remove or replace generic terms with more
specific gating phrases such as "retrieved from knowledge base", "source:",
"document id:", "Retrieved document:", "evidence:", or other explicit retrieval
markers so the heuristic only matches clear retrieval responses rather than
common language.

In `@agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py`:
- Around line 36-37: Validate the retrieved pass_at_k value (k =
agentic_rag_thresholds.get("pass_at_k", 8)) before any loop or division that
computes pass rates: add a guard that raises or skips the test if k <= 0 to
avoid division by zero or invalid rates, and apply the same guard near other
occurrences where pass_at_k is read (the other test blocks that use k to compute
pass fractions). Ensure the guard runs once per test before entering loops or
performing divisions so subsequent code that uses k can assume k > 0.

---

Nitpick comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 120-130: The run_eval function accepts a stream parameter but
currently ignores it by hardcoding stream=False in the TaskConfig; update the
TaskConfig instantiation in conftest.py (the call that builds TaskConfig) to use
the passed stream variable (stream=stream) or, if you must preserve a default
override, make that explicit by asserting or logging the override when stream is
True so callers are not silently ignored; locate the TaskConfig creation near
the function that builds task results (the run_eval/task runner in this file)
and replace the hardcoded value with the passed-in variable or add an explicit
guard/notice.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 66159ad6-ea1a-4dcd-b598-3cbfc5480a66

📥 Commits

Reviewing files that changed from the base of the PR and between a327361 and 0848a38.

📒 Files selected for processing (18)
  • .gitignore
  • README.md
  • agents/langgraph/agentic_rag/README.md
  • agents/langgraph/agentic_rag/evalhub/tool_use.yaml
  • agents/langgraph/agentic_rag/tests/behavioral/conftest.py
  • agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
  • agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
  • docs/adding-behavioral-tests.md
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/tests/run-e2e.sh
  • pyproject.toml
  • tests/behavioral/configs/thresholds.yaml
  • tests/behavioral/conftest.py

Comment thread agents/langgraph/agentic_rag/tests/behavioral/conftest.py
Comment thread agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
andrewdonheiser added a commit that referenced this pull request May 15, 2026
Replace generic terms like "information" and "relevant" with
multi-word phrases that only match actual retrieval output.
Addresses CodeRabbit review comment on PR #102.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mpk-droid
Copy link
Copy Markdown
Contributor

Missing evidence of API contract and adversarial test execution

The Jira acceptance criterion #4 requires "API contract (7) and adversarial (4) tests pass," but the PR only documents the 11 agent-specific behavioral tests. The validation section mentions "Cross-agent consistency 13/13" which may cover this, but it's ambiguous — could you confirm the shared tests/behavioral/ API contract and adversarial tests were run against the deployed agentic_rag agent?

Assisted by Claude Opus 4.6 (1M context)

@mpk-droid
Copy link
Copy Markdown
Contributor

Test plan checkboxes unchecked

The three test plan checkboxes are unchecked, but the validation section above reports all gates passed. If the tests were run, please tick the boxes.

Assisted by Claude Opus 4.6 (1M context)

Comment thread agents/langgraph/agentic_rag/tests/behavioral/conftest.py
Copy link
Copy Markdown
Contributor

@mpk-droid mpk-droid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. added few comments.

@tarun-etikala
Copy link
Copy Markdown
Contributor

tarun-etikala commented May 18, 2026

Hey @andrewdonheiser - a repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team.

This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance

andrewdonheiser and others added 5 commits May 19, 2026 08:15
Add pytest behavioral test suite (tool usage, response quality, cost/latency,
reliability) with MLflow trace enrichment and EvalHub fixture for the
LangGraph agentic_rag agent. Update shared configs, Containerfile, run-e2e.sh,
and documentation.

Ref: RHAIENG-4223

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… tests

Fixes from PR review:
- Add agentic_rag to root conftest _AGENT_URL_MAP and report header
- Remove duplicated _load_golden, import load_golden from conftest
- Add test_response_completeness (parametrized, +4 test cases)
- Add used_fallback tracking and warning in test_reliability
- Sync evalhub/tool_use.yaml with golden_queries.yaml
- Centralize RETRIEVER_EVIDENCE in conftest, use in greeting test
- Update run-e2e.sh header comment (four -> five agents)
- Keep stream parameter in run_eval for interface consistency (RHAIENG-5146)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace generic terms like "information" and "relevant" with
multi-word phrases that only match actual retrieval output.
Addresses CodeRabbit review comment on PR #102.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… test

Address PR review feedback:
- Replace generic RETRIEVER_EVIDENCE terms with domain-specific terms
  from the agent's knowledge base (langchain, langgraph, milvus, etc.)
  to avoid false positives matching non-retrieval responses
- Add rejected_elements to adversarial golden query and a dedicated
  test_adversarial_prompt_injection_resistance test to verify the agent
  doesn't leak system prompt content

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite branch from 1afcbb2 to 6f789a8 Compare May 19, 2026 12:20
@andrewdonheiser andrewdonheiser requested a review from a team as a code owner May 19, 2026 12:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py`:
- Around line 62-73: The tests currently allow missing telemetry by warning when
result.tool_calls is absent; change this to a hard failure so missing
MLflow-enriched tool_calls fails the test: replace the warnings.warn branch with
an assertion or raise (e.g., assert False or raise AssertionError) that mentions
result and golden["expected_tools"], and do the same fix for the other
occurrences around lines 85-87, 102-103, and 143-147; keep the
score_tool_selection(result, golden["expected_tools"]) flow intact when
result.tool_calls exists so only the absence of result.tool_calls triggers the
failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5b8d3ecc-1459-4a80-b6b0-6a2477879eab

📥 Commits

Reviewing files that changed from the base of the PR and between 0848a38 and 6f789a8.

📒 Files selected for processing (18)
  • .gitignore
  • README.md
  • agents/langgraph/agentic_rag/README.md
  • agents/langgraph/agentic_rag/evalhub/tool_use.yaml
  • agents/langgraph/agentic_rag/tests/behavioral/conftest.py
  • agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
  • agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
  • docs/adding-behavioral-tests.md
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/tests/run-e2e.sh
  • pyproject.toml
  • tests/behavioral/configs/thresholds.yaml
  • tests/behavioral/conftest.py
✅ Files skipped from review due to trivial changes (7)
  • README.md
  • docs/adding-evalhub-agent-integration.md
  • agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
  • .gitignore
  • agents/langgraph/agentic_rag/README.md
  • docs/adding-behavioral-tests.md
  • evals/evalhub_adapter/README.md
🚧 Files skipped from review as they are similar to previous changes (10)
  • agents/langgraph/agentic_rag/evalhub/tool_use.yaml
  • pyproject.toml
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/tests/run-e2e.sh
  • agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
  • tests/behavioral/configs/thresholds.yaml
  • agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
  • tests/behavioral/conftest.py
  • agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
  • agents/langgraph/agentic_rag/tests/behavioral/conftest.py

Comment thread agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
@andrewdonheiser
Copy link
Copy Markdown
Contributor Author

andrewdonheiser commented May 19, 2026

Missing evidence of API contract and adversarial test execution

The Jira acceptance criterion #4 requires "API contract (7) and adversarial (4) tests pass," but the PR only documents the 11 agent-specific behavioral tests. The validation section mentions "Cross-agent consistency 13/13" which may cover this, but it's ambiguous — could you confirm the shared tests/behavioral/ API contract and adversarial tests were run against the deployed agentic_rag agent?

Assisted by Claude Opus 4.6 (1M context)

Yes they were run. I was not even looking at these checkbox items I was treating them as claude bloat. In future tickets I will make sure these are clarified

Copy link
Copy Markdown
Contributor

@mpk-droid mpk-droid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andrewdonheiser andrewdonheiser merged commit 554082b into main May 19, 2026
8 checks passed
@andrewdonheiser andrewdonheiser deleted the RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite branch May 19, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants