feat: add behavioral tests and EvalHub integration for agentic_rag agent by andrewdonheiser · Pull Request #102 · red-hat-data-services/agentic-starter-kits

andrewdonheiser · 2026-05-14T21:38:17Z

Summary

Add pytest behavioral test suite for langgraph/agentic_rag agent: tool usage, response quality, cost/latency, and reliability tests with MLflow trace enrichment
Add EvalHub fixture (evalhub/tool_use.yaml) and update Containerfile, run-e2e.sh for orchestrated evaluation
Update shared configs (thresholds, pyproject markers), documentation, and agent README

Jira

RHAIENG-4223 — Eval coverage: Agentic RAG — deploy agent and validate test suite

Validation (Phase 11 — all gates passed)

11a: 11/11 pytest tests passed (188.69s) with MLflow enrichment active
11b: HARD GATE passed — tool_calls populated from MLflow traces (F1 scoring, not content heuristics)
11c: MLflow trace structure verified — TOOL, CHAT_MODEL, CHAIN spans confirmed
11d: Agent pod logs clean
11e: EvalHub E2E completed — all scores 1.0 (tool_selection, tool_sequence, hallucinated_tools, tool_call_validity)
11f: Cross-agent consistency 13/13 (2 pre-existing deviations in other agents tracked in RHAIENG-5126 and RHAIENG-5127)

Test plan

[ x] pytest agents/langgraph/agentic_rag/tests/behavioral/ --collect-only — 11 tests collected
[ x] Run full suite against deployed agent with MLflow env vars set
[ x] Verify EvalHub E2E via evals/evalhub_adapter/tests/run-e2e.sh

🤖 Generated with Claude Code

coderabbitai · 2026-05-14T21:40:56Z

📝 Walkthrough

Walkthrough

Added a comprehensive behavioral test suite for the LangGraph Agentic RAG agent. The PR introduces pytest configuration, shared test fixtures with golden query datasets, four test modules validating tool usage, latency, reliability, and response quality, plus full EvalHub container and e2e test integration.

Changes

Agentic RAG Behavioral Testing

Layer / File(s)	Summary
Configuration and Evaluation Thresholds `pyproject.toml`, `tests/behavioral/configs/thresholds.yaml`, `.gitignore`, `tests/behavioral/conftest.py`	Registers `agentic_rag` pytest marker, adds thresholds for tool accuracy, response coherence, latency p95, and pass@k metrics; maps marker to `AGENTIC_RAG_AGENT_URL` environment variable; excludes test validation reports from version control.
Test Harness and Fixtures `agents/langgraph/agentic_rag/tests/behavioral/conftest.py`, `agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml`	Pytest fixtures provide agent URL resolution (with localhost default), async HTTP client, threshold config loading, known tools list, and core `run_eval` runner that builds TaskConfig, forces stream=False, and optionally enriches results via MLflow. Golden queries fixture defines test cases across difficulty levels with expected tools and response elements.
Tool Usage Validation Tests `agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py`	Parametrized and standalone tests validate tool selection accuracy (with response-content fallback), prevent hallucinated tool calls, verify tool argument JSON validity, ensure adversarial prompts don't expose rejected elements, and confirm greeting input does not trigger tools.
Latency and Reliability Tests `agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py`, `agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py`	Latency test scores against p95 threshold; pass@k tests run repeated queries and measure tool selection and response coherence consistency with computed pass rates against thresholds.
Response Quality Tests `agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py`	Plan coherence and response completeness tests validate that generated responses meet expected semantic quality and contain required elements from golden query specs.
Documentation and Guides `agents/langgraph/agentic_rag/README.md`, `README.md`, `docs/adding-behavioral-tests.md`, `docs/adding-evalhub-agent-integration.md`	Agent README documents test setup and pytest commands; root README lists environment variable; behavioral testing guide references conftest example; EvalHub guide lists fixture paths.
EvalHub Integration and E2E Tests `agents/langgraph/agentic_rag/evalhub/tool_use.yaml`, `evals/evalhub_adapter/Containerfile`, `evals/evalhub_adapter/README.md`, `evals/evalhub_adapter/tests/run-e2e.sh`	Tool-use fixture added for EvalHub; container build extended to include agentic_rag fixtures and validate YAML; e2e script discovers agent route via env and heuristics, performs health checks, generates eval config with MLflow parameters, submits job, and reports results.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding behavioral tests and EvalHub integration for the agentic_rag agent.
Docstring Coverage	✅ Passed	Docstring coverage is 95.24% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides a comprehensive overview of the changes including test suite additions, EvalHub integration, and validation results that directly correspond to the file modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

agents/langgraph/agentic_rag/tests/behavioral/conftest.py (1)

120-130: ⚡ Quick win

stream is accepted but silently ignored in run_eval.

This can hide test intent bugs when callers pass stream=True. If the interface must stay compatible, make the override explicit.

Suggested guard

     async def _run(
         query: str,
         expected_tools: list[str] | None = None,
         timeout_seconds: float = 30.0,
         max_tokens_budget: int | None = None,
         model: str | None = None,
         stream: bool = False,
     ) -> TaskResult:
+        if stream:
+            warnings.warn(
+                "agentic_rag run_eval forces stream=False; ignoring stream=True",
+                stacklevel=2,
+            )
         config = TaskConfig(
             agent_url=agent_url,
             query=query,
             expected_tools=expected_tools,
             timeout_seconds=timeout_seconds,
             max_tokens_budget=max_tokens_budget,
             model=model,
             stream=False,
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py` around lines 120 -
130, The run_eval function accepts a stream parameter but currently ignores it
by hardcoding stream=False in the TaskConfig; update the TaskConfig
instantiation in conftest.py (the call that builds TaskConfig) to use the passed
stream variable (stream=stream) or, if you must preserve a default override,
make that explicit by asserting or logging the override when stream is True so
callers are not silently ignored; locate the TaskConfig creation near the
function that builds task results (the run_eval/task runner in this file) and
replace the hardcoded value with the passed-in variable or add an explicit
guard/notice.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 24-30: RETRIEVER_EVIDENCE contains overly generic tokens (e.g.,
"information", "relevant") that cause heuristic false positives; update the list
used by tests (RETRIEVER_EVIDENCE) to remove or replace generic terms with more
specific gating phrases such as "retrieved from knowledge base", "source:",
"document id:", "Retrieved document:", "evidence:", or other explicit retrieval
markers so the heuristic only matches clear retrieval responses rather than
common language.

In `@agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py`:
- Around line 36-37: Validate the retrieved pass_at_k value (k =
agentic_rag_thresholds.get("pass_at_k", 8)) before any loop or division that
computes pass rates: add a guard that raises or skips the test if k <= 0 to
avoid division by zero or invalid rates, and apply the same guard near other
occurrences where pass_at_k is read (the other test blocks that use k to compute
pass fractions). Ensure the guard runs once per test before entering loops or
performing divisions so subsequent code that uses k can assume k > 0.

---

Nitpick comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/conftest.py`:
- Around line 120-130: The run_eval function accepts a stream parameter but
currently ignores it by hardcoding stream=False in the TaskConfig; update the
TaskConfig instantiation in conftest.py (the call that builds TaskConfig) to use
the passed stream variable (stream=stream) or, if you must preserve a default
override, make that explicit by asserting or logging the override when stream is
True so callers are not silently ignored; locate the TaskConfig creation near
the function that builds task results (the run_eval/task runner in this file)
and replace the hardcoded value with the passed-in variable or add an explicit
guard/notice.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 66159ad6-ea1a-4dcd-b598-3cbfc5480a66

📥 Commits

Reviewing files that changed from the base of the PR and between a327361 and 0848a38.

📒 Files selected for processing (18)

.gitignore
README.md
agents/langgraph/agentic_rag/README.md
agents/langgraph/agentic_rag/evalhub/tool_use.yaml
agents/langgraph/agentic_rag/tests/behavioral/conftest.py
agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
docs/adding-behavioral-tests.md
docs/adding-evalhub-agent-integration.md
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/README.md
evals/evalhub_adapter/tests/run-e2e.sh
pyproject.toml
tests/behavioral/configs/thresholds.yaml
tests/behavioral/conftest.py

Replace generic terms like "information" and "relevant" with multi-word phrases that only match actual retrieval output. Addresses CodeRabbit review comment on PR #102. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mpk-droid · 2026-05-18T18:15:15Z

Missing evidence of API contract and adversarial test execution

The Jira acceptance criterion #4 requires "API contract (7) and adversarial (4) tests pass," but the PR only documents the 11 agent-specific behavioral tests. The validation section mentions "Cross-agent consistency 13/13" which may cover this, but it's ambiguous — could you confirm the shared tests/behavioral/ API contract and adversarial tests were run against the deployed agentic_rag agent?

Assisted by Claude Opus 4.6 (1M context)

mpk-droid · 2026-05-18T18:15:17Z

Test plan checkboxes unchecked

The three test plan checkboxes are unchecked, but the validation section above reports all gates passed. If the tests were run, please tick the boxes.

Assisted by Claude Opus 4.6 (1M context)

mpk-droid

overall looks good. added few comments.

tarun-etikala · 2026-05-18T18:51:50Z

Hey @andrewdonheiser - a repo-level ruleset is added that now requires Unit Tests and lint checks to pass before merge, plus approval from the agentic-starter-kits-maintainers team.

This PR is currently blocked because the Unit Tests check hasn't run on it. A rebase onto main should pick up the updated workflow and trigger the required checks. Please rebase when you get a chance

Add pytest behavioral test suite (tool usage, response quality, cost/latency, reliability) with MLflow trace enrichment and EvalHub fixture for the LangGraph agentic_rag agent. Update shared configs, Containerfile, run-e2e.sh, and documentation. Ref: RHAIENG-4223 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… tests Fixes from PR review: - Add agentic_rag to root conftest _AGENT_URL_MAP and report header - Remove duplicated _load_golden, import load_golden from conftest - Add test_response_completeness (parametrized, +4 test cases) - Add used_fallback tracking and warning in test_reliability - Sync evalhub/tool_use.yaml with golden_queries.yaml - Centralize RETRIEVER_EVIDENCE in conftest, use in greeting test - Update run-e2e.sh header comment (four -> five agents) - Keep stream parameter in run_eval for interface consistency (RHAIENG-5146) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace generic terms like "information" and "relevant" with multi-word phrases that only match actual retrieval output. Addresses CodeRabbit review comment on PR #102. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… test Address PR review feedback: - Replace generic RETRIEVER_EVIDENCE terms with domain-specific terms from the agent's knowledge base (langchain, langgraph, milvus, etc.) to avoid false positives matching non-retrieval responses - Add rejected_elements to adversarial golden query and a dedicated test_adversarial_prompt_injection_resistance test to verify the agent doesn't leak system prompt content Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py`:
- Around line 62-73: The tests currently allow missing telemetry by warning when
result.tool_calls is absent; change this to a hard failure so missing
MLflow-enriched tool_calls fails the test: replace the warnings.warn branch with
an assertion or raise (e.g., assert False or raise AssertionError) that mentions
result and golden["expected_tools"], and do the same fix for the other
occurrences around lines 85-87, 102-103, and 143-147; keep the
score_tool_selection(result, golden["expected_tools"]) flow intact when
result.tool_calls exists so only the absence of result.tool_calls triggers the
failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 5b8d3ecc-1459-4a80-b6b0-6a2477879eab

📥 Commits

Reviewing files that changed from the base of the PR and between 0848a38 and 6f789a8.

📒 Files selected for processing (18)

.gitignore
README.md
agents/langgraph/agentic_rag/README.md
agents/langgraph/agentic_rag/evalhub/tool_use.yaml
agents/langgraph/agentic_rag/tests/behavioral/conftest.py
agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py
docs/adding-behavioral-tests.md
docs/adding-evalhub-agent-integration.md
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/README.md
evals/evalhub_adapter/tests/run-e2e.sh
pyproject.toml
tests/behavioral/configs/thresholds.yaml
tests/behavioral/conftest.py

✅ Files skipped from review due to trivial changes (7)

README.md
docs/adding-evalhub-agent-integration.md
agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml
.gitignore
agents/langgraph/agentic_rag/README.md
docs/adding-behavioral-tests.md
evals/evalhub_adapter/README.md

🚧 Files skipped from review as they are similar to previous changes (10)

agents/langgraph/agentic_rag/evalhub/tool_use.yaml
pyproject.toml
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/tests/run-e2e.sh
agents/langgraph/agentic_rag/tests/behavioral/test_response_quality.py
tests/behavioral/configs/thresholds.yaml
agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py
tests/behavioral/conftest.py
agents/langgraph/agentic_rag/tests/behavioral/test_cost_latency.py
agents/langgraph/agentic_rag/tests/behavioral/conftest.py

andrewdonheiser · 2026-05-19T14:57:39Z

Missing evidence of API contract and adversarial test execution

The Jira acceptance criterion #4 requires "API contract (7) and adversarial (4) tests pass," but the PR only documents the 11 agent-specific behavioral tests. The validation section mentions "Cross-agent consistency 13/13" which may cover this, but it's ambiguous — could you confirm the shared tests/behavioral/ API contract and adversarial tests were run against the deployed agentic_rag agent?

Assisted by Claude Opus 4.6 (1M context)

Yes they were run. I was not even looking at these checkbox items I was treating them as claude bloat. In future tickets I will make sure these are clarified

mpk-droid

lgtm

github-actions Bot added area/langgraph area/docs area/tests size/l labels May 14, 2026

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread agents/langgraph/agentic_rag/tests/behavioral/conftest.py

Comment thread agents/langgraph/agentic_rag/tests/behavioral/test_reliability.py

mpk-droid reviewed May 18, 2026

View reviewed changes

Comment thread agents/langgraph/agentic_rag/tests/behavioral/conftest.py

mpk-droid reviewed May 18, 2026

View reviewed changes

Comment thread agents/langgraph/agentic_rag/tests/behavioral/fixtures/golden_queries.yaml

mpk-droid requested changes May 18, 2026

View reviewed changes

andrewdonheiser and others added 5 commits May 19, 2026 08:15

style: apply ruff formatting to agentic_rag behavioral tests

6dcf021

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewdonheiser force-pushed the RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite branch from 1afcbb2 to 6f789a8 Compare May 19, 2026 12:20

andrewdonheiser requested a review from a team as a code owner May 19, 2026 12:20

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread agents/langgraph/agentic_rag/tests/behavioral/test_tool_usage.py

mpk-droid approved these changes May 19, 2026

View reviewed changes

andrewdonheiser merged commit 554082b into main May 19, 2026
8 checks passed

andrewdonheiser deleted the RHAIENG-4223-eval-coverage-agentic-rag-deploy-agent-and-validate-test-suite branch May 19, 2026 16:12

Conversation

andrewdonheiser commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Jira

Validation (Phase 11 — all gates passed)

Test plan

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mpk-droid commented May 18, 2026

Uh oh!

mpk-droid commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

mpk-droid left a comment

Choose a reason for hiding this comment

Uh oh!

tarun-etikala commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewdonheiser commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mpk-droid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andrewdonheiser commented May 14, 2026 •

edited

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

tarun-etikala commented May 18, 2026 •

edited

Loading

andrewdonheiser commented May 19, 2026 •

edited

Loading