RHAIENG-5593: add eval coverage for Google ADK agent by andrewdonheiser · Pull Request #183 · red-hat-data-services/agentic-starter-kits

andrewdonheiser · 2026-06-12T19:44:11Z

Summary

Add behavioral test suite (12 tests) and EvalHub integration for the Google ADK agent
Add MLflow tracing integration (tracing.py, enable_tracing(), tool span wrapping)
Add Makefile deploy target MLflow flags (conditional --set for MLflow env vars)
Fix run-btests-pytest.sh AGENTS array paths to include templates/ subdirectory (broken since RHAIENG-5413 move)

Apologies for the larger PR — RHAIENG-5605 (MLflow tracing) and RHAIENG-5606 (Makefile deploy flags) were included because the risk was low and the implementation pattern is well-established across all other agents. Both were discovered as blockers during behavioral test implementation and follow the exact same pattern as the 8 existing agents.

Tickets covered

Ticket	Scope
RHAIENG-5593	Behavioral tests + EvalHub fixture + shared infra wiring
RHAIENG-5605	MLflow tracing integration (tracing.py, main.py, pyproject.toml, Dockerfile)
RHAIENG-5606	Makefile deploy/dry-run MLflow flags

New files

agents/google/templates/adk/src/adk_agent/tracing.py — MLflow tracing init + tool span wrapper
agents/google/templates/adk/tests/behavioral/ — conftest, 5 test files, golden queries fixture
agents/google/templates/adk/evalhub/tool_use.yaml — EvalHub fixture

Validation results

Check	Result
Pytest behavioral tests	12/12 PASS
run-btests-pytest.sh	google-adk PASS (other failures pre-existing)
MLflow enrichment gate	PASS (no fallback warnings)
MLflow trace structure	CHAIN → CHAT_MODEL + TOOL with proper nesting
EvalHub E2E	tool_selection=1.0, hallucinated=1.0, validity=1.0, sequence=1.0
All mlflow_run_ids non-null	PASS (all 10 agents)
Cross-agent consistency	13/13 points

Also discovered

RHAIENG-5612: Langflow agent was never added to _AGENT_URL_MAP or run-btests-pytest.sh (filed, not in this PR)
langgraph-hitl-agent had hardcoded MLFLOW_TRACKING_TOKEN (fixed during validation via make deploy)

Test plan

uv run --extra test --extra test-mlflow pytest agents/google/templates/adk/tests/behavioral/ -v — all 12 pass
./tests/behavioral/deterministic/run-btests-pytest.sh google/templates/adk — PASS
OC_NAMESPACE=adonheis-testing ./evals/evalhub_adapter/tests/run-e2e.sh — all 10 agents complete, scores reasonable
CI checks pass

🤖 Generated with Claude Code

coderabbitai · 2026-06-12T19:44:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 59d1d95e-8e64-4060-8d57-45ed04ec1a77

📥 Commits

Reviewing files that changed from the base of the PR and between 5a1c1ac and cc0f8be.

⛔ Files ignored due to path filters (1)

agents/google/templates/adk/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (27)

README.md
agents/google/templates/adk/.env.example
agents/google/templates/adk/Dockerfile
agents/google/templates/adk/Makefile
agents/google/templates/adk/README.md
agents/google/templates/adk/evalhub/tool_use.yaml
agents/google/templates/adk/examples/ai_service.py
agents/google/templates/adk/main.py
agents/google/templates/adk/pyproject.toml
agents/google/templates/adk/src/adk_agent/agent.py
agents/google/templates/adk/src/adk_agent/tracing.py
agents/google/templates/adk/tests/behavioral/conftest.py
agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
agents/google/templates/adk/tests/behavioral/test_cost_latency.py
agents/google/templates/adk/tests/behavioral/test_reliability.py
agents/google/templates/adk/tests/behavioral/test_response_quality.py
agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
agents/google/templates/adk/tests/behavioral/test_tool_usage.py
docs/adding-behavioral-tests.md
docs/adding-evalhub-agent-integration.md
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/README.md
evals/evalhub_adapter/tests/run-e2e.sh
pyproject.toml
tests/behavioral/configs/thresholds.yaml
tests/behavioral/conftest.py
tests/behavioral/deterministic/run-btests-pytest.sh

✅ Files skipped from review due to trivial changes (7)

README.md
pyproject.toml
docs/adding-evalhub-agent-integration.md
tests/behavioral/conftest.py
docs/adding-behavioral-tests.md
agents/google/templates/adk/pyproject.toml
agents/google/templates/adk/.env.example

🚧 Files skipped from review as they are similar to previous changes (16)

tests/behavioral/configs/thresholds.yaml
agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
agents/google/templates/adk/README.md
agents/google/templates/adk/Dockerfile
evals/evalhub_adapter/Containerfile
agents/google/templates/adk/evalhub/tool_use.yaml
agents/google/templates/adk/main.py
agents/google/templates/adk/tests/behavioral/test_response_quality.py
agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
agents/google/templates/adk/tests/behavioral/test_reliability.py
agents/google/templates/adk/tests/behavioral/test_cost_latency.py
agents/google/templates/adk/Makefile
evals/evalhub_adapter/README.md
agents/google/templates/adk/tests/behavioral/conftest.py
evals/evalhub_adapter/tests/run-e2e.sh
agents/google/templates/adk/tests/behavioral/test_tool_usage.py

📝 Walkthrough

Walkthrough

Adds optional MLflow tracing to Google ADK (tool/agent spans and startup wiring), comprehensive behavioral pytest suites and fixtures for Google ADK, and EvalHub fixture + e2e integration; updates build, deploy, and test configurations to wire tracing and evaluation settings.

Changes

MLflow Tracing Infrastructure & Agent Integration

Layer / File(s)	Summary
MLflow tracing module and dependencies `agents/google/templates/adk/src/adk_agent/tracing.py`, `agents/google/templates/adk/pyproject.toml`	New `tracing.py` module implements MLflow health checking, function wrapping with trace decorators (tool/agent span types), and startup initialization that configures experiment tracking and async logging. Adds `mlflow>=3.10.0` as an optional `tracing` dependency group.
Agent initialization with tracing integration `agents/google/templates/adk/src/adk_agent/agent.py`, `agents/google/templates/adk/main.py`, `agents/google/templates/adk/examples/ai_service.py`	Agent tool functions are wrapped with `wrap_func_with_mlflow_trace` to produce `traced_tools`, which are passed to `LlmAgent` construction. FastAPI application startup and example service call `enable_tracing()` before runner initialization. `ChatCompletionResponse.context` field documentation is updated to clarify it includes intermediate agent messages with tool calls and responses.
Build, deployment, and environment configuration `agents/google/templates/adk/Dockerfile`, `agents/google/templates/adk/.env.example`, `agents/google/templates/adk/Makefile`, `agents/google/templates/adk/README.md`	Dockerfile installs with `.[tracing]` extras to enable MLflow dependencies. `.env.example` adds optional MLflow configuration templates. Makefile `run-*` targets conditionally pass `--extra tracing` when `MLFLOW_TRACKING_URI` is set; `deploy` and `dry-run` targets conditionally set Helm environment variables for MLflow tracking/experiment parameters. README documents behavioral testing with MLflow trace enrichment.

Behavioral Testing Framework

Layer / File(s)	Summary
Test framework, fixtures, and configuration `agents/google/templates/adk/tests/behavioral/conftest.py`, `agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml`, `tests/behavioral/configs/thresholds.yaml`, `tests/behavioral/conftest.py`, `README.md`, `pyproject.toml`	Agent-specific conftest provides repository-root discovery, fixtures for agent URL, async HTTP client, evaluation configuration, and trace enrichment. Golden dataset YAML defines five test prompts with expected tools and response elements. Root conftest maps `google_adk` marker to `GOOGLE_ADK_AGENT_URL`. Root configuration adds `google_adk` pytest marker and threshold parameters for accuracy, latency, and pass@k evaluation.
Individual behavioral test modules `agents/google/templates/adk/tests/behavioral/test_*.py` (5 modules)	Five test modules evaluate distinct aspects: `test_cost_latency` measures p95 response latency, `test_reliability` computes tool-usage and response-quality pass rates across multiple runs, `test_response_quality` scores plan coherence, `test_streaming_parity` verifies consistent tool calls between streaming and non-streaming modes, and `test_tool_usage` validates tool selection accuracy, prevents hallucination, checks argument validity, and ensures greetings do not invoke tools.

EvalHub Agent Integration & E2E Setup

Layer / File(s)	Summary
EvalHub fixture definition and documentation `agents/google/templates/adk/evalhub/tool_use.yaml`, `evals/evalhub_adapter/README.md`, `docs/adding-behavioral-tests.md`, `docs/adding-evalhub-agent-integration.md`	New `tool_use.yaml` fixture defines five evaluation prompts with expected tools and response elements. EvalHub adapter README documents Google ADK as a supported agent with fixture mappings and supported tool coverage. Documentation references updated to include Google ADK agent examples in behavioral testing and EvalHub integration guides.
EvalHub container image and e2e test integration `evals/evalhub_adapter/Containerfile`, `evals/evalhub_adapter/tests/run-e2e.sh`, `tests/behavioral/deterministic/run-btests-pytest.sh`	Containerfile copies Google ADK fixtures into image and extends build validation. E2E script discovers `google-adk-agent` route, performs health checks, generates agent-specific eval configuration, submits eval runs, and prints MLflow-backed results. Deterministic test script updates agent path layout to `/templates/` structure and includes additional agents including `google/templates/adk`.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main objective of the changeset: adding evaluation coverage for the Google ADK agent, which aligns with the behavioral tests, EvalHub integration, and related infrastructure changes.
Description check	✅ Passed	The description comprehensively covers the changeset, including behavioral test suite additions, MLflow tracing integration, Makefile updates, and path fixes, with detailed validation results and ticket references.
Docstring Coverage	✅ Passed	Docstring coverage is 92.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch RHAIENG-5593-eval-coverage-google-adk-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…tion for Google ADK agent Add full behavioral test suite and evaluation infrastructure for the Google ADK agent (RHAIENG-5593), along with MLflow tracing integration (RHAIENG-5605) and Makefile deploy MLflow flags (RHAIENG-5606). Behavioral tests (RHAIENG-5593): - 5 test files: tool_usage, response_quality, cost_latency, reliability, streaming_parity (12 tests total, all passing) - Golden queries fixture with 4 search queries + 1 adversarial - EvalHub tool_use.yaml fixture, Containerfile COPY, run-e2e.sh blocks - Shared infra: conftest _AGENT_URL_MAP, thresholds.yaml, pyproject.toml marker, run-btests-pytest.sh AGENTS array - Documentation updates (adding-behavioral-tests.md, evalhub README) MLflow tracing (RHAIENG-5605): - New src/adk_agent/tracing.py with enable_tracing() and wrap_func_with_mlflow_trace() for TOOL span creation - main.py lifespan calls enable_tracing() - pyproject.toml [tracing] optional dependency group - Dockerfile installs [tracing] extra - .env.example documents MLflow env vars Makefile deploy flags (RHAIENG-5606): - Conditional --set flags for MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME, MLFLOW_TRACKING_INSECURE_TLS, MLFLOW_WORKSPACE, MLFLOW_TRACKING_TOKEN - run-app/run-app-fresh/run-cli conditionally add --extra tracing Also fixes run-btests-pytest.sh AGENTS array paths to include templates/ subdirectory (broken since RHAIENG-5413 template move). Validated: 12/12 pytest pass, EvalHub E2E 1.0/1.0/1.0/1.0, MLflow enrichment confirmed, 13/13 cross-agent consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/google/templates/adk/src/adk_agent/tracing.py`:
- Around line 49-50: Logs currently include the full mlflow_url (variable
mlflow_url) which can leak credentials or sensitive query params; update every
logging call that prints mlflow_url (the MLflow health check/log lines and other
occurrences) to redact secrets by parsing the URL (use urllib.parse.urlparse),
remove username/password and query params, and rebuild a safe string (e.g.,
scheme + "://" + hostname + optional port + path or replace netloc credentials
with "***") before logging; apply the same redaction logic wherever mlflow_url
is logged (search for mlflow_url in tracing.py) and add a small helper function
(e.g., redact_url(url: str) -> str) to centralize this behavior and use it in
the health-check and other log statements.
- Around line 46-47: The health check uses a raw requests.get(mlflow_url) which
ignores MLflow auth/TLS envs and can falsely disable tracing; update the check
around response = requests.get(mlflow_url, ...) to perform the request using the
same auth/TLS configuration the rest of the code uses: read
MLFLOW_TRACKING_TOKEN (set an Authorization: Bearer header if present) and
respect any insecure-TLS flag or cert verification setting (pass verify=False
when configured), or alternatively invoke the MlflowClient/Mlflow tracking API
to probe /health so the request uses MLflow's configured transport; ensure both
places (the current request and the duplicate around lines 120-122) use this
unified approach so authenticated or insecure-TLS endpoints are handled
correctly.

In `@agents/google/templates/adk/tests/behavioral/test_reliability.py`:
- Line 24: The _SEARCH_EVIDENCE list in test_reliability.py is too broad
(contains "openshift" and "red hat") and can falsely mark runs as tool-usage
passes; replace those generic tokens with targeted evidence strings that
reliably indicate a tool call (e.g., phrases the agent/tool emits such as
"invoked search", "called search tool", "tool_response:", "Search results for")
and update the same evidence usage at the other occurrence (lines ~55-58) so
tests only count explicit tool-invocation outputs rather than generic topic
words.

In `@agents/google/templates/adk/tests/behavioral/test_streaming_parity.py`:
- Around line 44-49: The parity check currently skips asserting when both
result_sync.tool_calls and result_stream.tool_calls are empty, allowing vacuous
passes; update the logic around sync_tools/stream_tools (the sets built from
result_sync.tool_calls and result_stream.tool_calls) to first assert that at
least one of them is non-empty (e.g., assert sync_tools or stream_tools, "No
tool calls emitted in either run; expected at least one"), then perform the
equality assertion sync_tools == stream_tools so the test fails if both runs
stop emitting tool calls unexpectedly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 76f604fb-84db-4444-b9f7-ea7840e52b78

📥 Commits

Reviewing files that changed from the base of the PR and between ba5c5a4 and 5a1c1ac.

📒 Files selected for processing (26)

README.md
agents/google/templates/adk/.env.example
agents/google/templates/adk/Dockerfile
agents/google/templates/adk/Makefile
agents/google/templates/adk/README.md
agents/google/templates/adk/evalhub/tool_use.yaml
agents/google/templates/adk/main.py
agents/google/templates/adk/pyproject.toml
agents/google/templates/adk/src/adk_agent/agent.py
agents/google/templates/adk/src/adk_agent/tracing.py
agents/google/templates/adk/tests/behavioral/conftest.py
agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
agents/google/templates/adk/tests/behavioral/test_cost_latency.py
agents/google/templates/adk/tests/behavioral/test_reliability.py
agents/google/templates/adk/tests/behavioral/test_response_quality.py
agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
agents/google/templates/adk/tests/behavioral/test_tool_usage.py
docs/adding-behavioral-tests.md
docs/adding-evalhub-agent-integration.md
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/README.md
evals/evalhub_adapter/tests/run-e2e.sh
pyproject.toml
tests/behavioral/configs/thresholds.yaml
tests/behavioral/conftest.py
tests/behavioral/deterministic/run-btests-pytest.sh

…ion, streaming parity - Health check now uses MLFLOW_TRACKING_TOKEN and MLFLOW_TRACKING_INSECURE_TLS so secured endpoints don't falsely disable tracing - Add _safe_uri() helper to strip credentials/query params from logged URIs - Streaming parity test now asserts tool calls are present (prevents vacuous pass when both modes emit nothing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Call enable_tracing() in ai_service.py before get_runner() so that `make run-cli` with MLFLOW_TRACKING_URI produces traces. Add missing trailing newlines to tool_use.yaml and golden_queries.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewdonheiser requested a review from a team as a code owner June 12, 2026 19:44

andrewdonheiser requested review from mpk-droid and shricharan-ks June 12, 2026 19:44

github-actions Bot added area/google-adk area/docs area/tests area/tracing labels Jun 12, 2026

github-actions Bot added the size/l label Jun 12, 2026

andrewdonheiser and others added 2 commits June 12, 2026 15:52

style: apply ruff format and update uv.lock for Google ADK agent

47c14a5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewdonheiser force-pushed the RHAIENG-5593-eval-coverage-google-adk-agent branch from 1f2791e to 47c14a5 Compare June 12, 2026 19:52

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

andrewdonheiser and others added 2 commits June 12, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHAIENG-5593: add eval coverage for Google ADK agent#183

RHAIENG-5593: add eval coverage for Google ADK agent#183
andrewdonheiser wants to merge 4 commits into
mainfrom
RHAIENG-5593-eval-coverage-google-adk-agent

andrewdonheiser commented Jun 12, 2026 •

edited by atlassian Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrewdonheiser commented Jun 12, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tickets covered

New files

Validation results

Also discovered

Test plan

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrewdonheiser commented Jun 12, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading