Skip to content

RHAIENG-5593: add eval coverage for Google ADK agent#183

Open
andrewdonheiser wants to merge 4 commits into
mainfrom
RHAIENG-5593-eval-coverage-google-adk-agent
Open

RHAIENG-5593: add eval coverage for Google ADK agent#183
andrewdonheiser wants to merge 4 commits into
mainfrom
RHAIENG-5593-eval-coverage-google-adk-agent

Conversation

@andrewdonheiser

@andrewdonheiser andrewdonheiser commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add behavioral test suite (12 tests) and EvalHub integration for the Google ADK agent
  • Add MLflow tracing integration (tracing.py, enable_tracing(), tool span wrapping)
  • Add Makefile deploy target MLflow flags (conditional --set for MLflow env vars)
  • Fix run-btests-pytest.sh AGENTS array paths to include templates/ subdirectory (broken since RHAIENG-5413 move)

Apologies for the larger PR — RHAIENG-5605 (MLflow tracing) and RHAIENG-5606 (Makefile deploy flags) were included because the risk was low and the implementation pattern is well-established across all other agents. Both were discovered as blockers during behavioral test implementation and follow the exact same pattern as the 8 existing agents.

Tickets covered

Ticket Scope
RHAIENG-5593 Behavioral tests + EvalHub fixture + shared infra wiring
RHAIENG-5605 MLflow tracing integration (tracing.py, main.py, pyproject.toml, Dockerfile)
RHAIENG-5606 Makefile deploy/dry-run MLflow flags

New files

  • agents/google/templates/adk/src/adk_agent/tracing.py — MLflow tracing init + tool span wrapper
  • agents/google/templates/adk/tests/behavioral/ — conftest, 5 test files, golden queries fixture
  • agents/google/templates/adk/evalhub/tool_use.yaml — EvalHub fixture

Validation results

Check Result
Pytest behavioral tests 12/12 PASS
run-btests-pytest.sh google-adk PASS (other failures pre-existing)
MLflow enrichment gate PASS (no fallback warnings)
MLflow trace structure CHAIN → CHAT_MODEL + TOOL with proper nesting
EvalHub E2E tool_selection=1.0, hallucinated=1.0, validity=1.0, sequence=1.0
All mlflow_run_ids non-null PASS (all 10 agents)
Cross-agent consistency 13/13 points

Also discovered

  • RHAIENG-5612: Langflow agent was never added to _AGENT_URL_MAP or run-btests-pytest.sh (filed, not in this PR)
  • langgraph-hitl-agent had hardcoded MLFLOW_TRACKING_TOKEN (fixed during validation via make deploy)

Test plan

  • uv run --extra test --extra test-mlflow pytest agents/google/templates/adk/tests/behavioral/ -v — all 12 pass
  • ./tests/behavioral/deterministic/run-btests-pytest.sh google/templates/adk — PASS
  • OC_NAMESPACE=adonheis-testing ./evals/evalhub_adapter/tests/run-e2e.sh — all 10 agents complete, scores reasonable
  • CI checks pass

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 59d1d95e-8e64-4060-8d57-45ed04ec1a77

📥 Commits

Reviewing files that changed from the base of the PR and between 5a1c1ac and cc0f8be.

⛔ Files ignored due to path filters (1)
  • agents/google/templates/adk/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (27)
  • README.md
  • agents/google/templates/adk/.env.example
  • agents/google/templates/adk/Dockerfile
  • agents/google/templates/adk/Makefile
  • agents/google/templates/adk/README.md
  • agents/google/templates/adk/evalhub/tool_use.yaml
  • agents/google/templates/adk/examples/ai_service.py
  • agents/google/templates/adk/main.py
  • agents/google/templates/adk/pyproject.toml
  • agents/google/templates/adk/src/adk_agent/agent.py
  • agents/google/templates/adk/src/adk_agent/tracing.py
  • agents/google/templates/adk/tests/behavioral/conftest.py
  • agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
  • agents/google/templates/adk/tests/behavioral/test_cost_latency.py
  • agents/google/templates/adk/tests/behavioral/test_reliability.py
  • agents/google/templates/adk/tests/behavioral/test_response_quality.py
  • agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
  • agents/google/templates/adk/tests/behavioral/test_tool_usage.py
  • docs/adding-behavioral-tests.md
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/tests/run-e2e.sh
  • pyproject.toml
  • tests/behavioral/configs/thresholds.yaml
  • tests/behavioral/conftest.py
  • tests/behavioral/deterministic/run-btests-pytest.sh
✅ Files skipped from review due to trivial changes (7)
  • README.md
  • pyproject.toml
  • docs/adding-evalhub-agent-integration.md
  • tests/behavioral/conftest.py
  • docs/adding-behavioral-tests.md
  • agents/google/templates/adk/pyproject.toml
  • agents/google/templates/adk/.env.example
🚧 Files skipped from review as they are similar to previous changes (16)
  • tests/behavioral/configs/thresholds.yaml
  • agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
  • agents/google/templates/adk/README.md
  • agents/google/templates/adk/Dockerfile
  • evals/evalhub_adapter/Containerfile
  • agents/google/templates/adk/evalhub/tool_use.yaml
  • agents/google/templates/adk/main.py
  • agents/google/templates/adk/tests/behavioral/test_response_quality.py
  • agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
  • agents/google/templates/adk/tests/behavioral/test_reliability.py
  • agents/google/templates/adk/tests/behavioral/test_cost_latency.py
  • agents/google/templates/adk/Makefile
  • evals/evalhub_adapter/README.md
  • agents/google/templates/adk/tests/behavioral/conftest.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • agents/google/templates/adk/tests/behavioral/test_tool_usage.py

📝 Walkthrough

Walkthrough

Adds optional MLflow tracing to Google ADK (tool/agent spans and startup wiring), comprehensive behavioral pytest suites and fixtures for Google ADK, and EvalHub fixture + e2e integration; updates build, deploy, and test configurations to wire tracing and evaluation settings.

Changes

MLflow Tracing Infrastructure & Agent Integration

Layer / File(s) Summary
MLflow tracing module and dependencies
agents/google/templates/adk/src/adk_agent/tracing.py, agents/google/templates/adk/pyproject.toml
New tracing.py module implements MLflow health checking, function wrapping with trace decorators (tool/agent span types), and startup initialization that configures experiment tracking and async logging. Adds mlflow>=3.10.0 as an optional tracing dependency group.
Agent initialization with tracing integration
agents/google/templates/adk/src/adk_agent/agent.py, agents/google/templates/adk/main.py, agents/google/templates/adk/examples/ai_service.py
Agent tool functions are wrapped with wrap_func_with_mlflow_trace to produce traced_tools, which are passed to LlmAgent construction. FastAPI application startup and example service call enable_tracing() before runner initialization. ChatCompletionResponse.context field documentation is updated to clarify it includes intermediate agent messages with tool calls and responses.
Build, deployment, and environment configuration
agents/google/templates/adk/Dockerfile, agents/google/templates/adk/.env.example, agents/google/templates/adk/Makefile, agents/google/templates/adk/README.md
Dockerfile installs with .[tracing] extras to enable MLflow dependencies. .env.example adds optional MLflow configuration templates. Makefile run-* targets conditionally pass --extra tracing when MLFLOW_TRACKING_URI is set; deploy and dry-run targets conditionally set Helm environment variables for MLflow tracking/experiment parameters. README documents behavioral testing with MLflow trace enrichment.

Behavioral Testing Framework

Layer / File(s) Summary
Test framework, fixtures, and configuration
agents/google/templates/adk/tests/behavioral/conftest.py, agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml, tests/behavioral/configs/thresholds.yaml, tests/behavioral/conftest.py, README.md, pyproject.toml
Agent-specific conftest provides repository-root discovery, fixtures for agent URL, async HTTP client, evaluation configuration, and trace enrichment. Golden dataset YAML defines five test prompts with expected tools and response elements. Root conftest maps google_adk marker to GOOGLE_ADK_AGENT_URL. Root configuration adds google_adk pytest marker and threshold parameters for accuracy, latency, and pass@k evaluation.
Individual behavioral test modules
agents/google/templates/adk/tests/behavioral/test_*.py (5 modules)
Five test modules evaluate distinct aspects: test_cost_latency measures p95 response latency, test_reliability computes tool-usage and response-quality pass rates across multiple runs, test_response_quality scores plan coherence, test_streaming_parity verifies consistent tool calls between streaming and non-streaming modes, and test_tool_usage validates tool selection accuracy, prevents hallucination, checks argument validity, and ensures greetings do not invoke tools.

EvalHub Agent Integration & E2E Setup

Layer / File(s) Summary
EvalHub fixture definition and documentation
agents/google/templates/adk/evalhub/tool_use.yaml, evals/evalhub_adapter/README.md, docs/adding-behavioral-tests.md, docs/adding-evalhub-agent-integration.md
New tool_use.yaml fixture defines five evaluation prompts with expected tools and response elements. EvalHub adapter README documents Google ADK as a supported agent with fixture mappings and supported tool coverage. Documentation references updated to include Google ADK agent examples in behavioral testing and EvalHub integration guides.
EvalHub container image and e2e test integration
evals/evalhub_adapter/Containerfile, evals/evalhub_adapter/tests/run-e2e.sh, tests/behavioral/deterministic/run-btests-pytest.sh
Containerfile copies Google ADK fixtures into image and extends build validation. E2E script discovers google-adk-agent route, performs health checks, generates agent-specific eval configuration, submits eval runs, and prints MLflow-backed results. Deterministic test script updates agent path layout to */templates/* structure and includes additional agents including google/templates/adk.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main objective of the changeset: adding evaluation coverage for the Google ADK agent, which aligns with the behavioral tests, EvalHub integration, and related infrastructure changes.
Description check ✅ Passed The description comprehensively covers the changeset, including behavioral test suite additions, MLflow tracing integration, Makefile updates, and path fixes, with detailed validation results and ticket references.
Docstring Coverage ✅ Passed Docstring coverage is 92.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch RHAIENG-5593-eval-coverage-google-adk-agent

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

andrewdonheiser and others added 2 commits June 12, 2026 15:52
…tion for Google ADK agent

Add full behavioral test suite and evaluation infrastructure for the
Google ADK agent (RHAIENG-5593), along with MLflow tracing integration
(RHAIENG-5605) and Makefile deploy MLflow flags (RHAIENG-5606).

Behavioral tests (RHAIENG-5593):
- 5 test files: tool_usage, response_quality, cost_latency, reliability,
  streaming_parity (12 tests total, all passing)
- Golden queries fixture with 4 search queries + 1 adversarial
- EvalHub tool_use.yaml fixture, Containerfile COPY, run-e2e.sh blocks
- Shared infra: conftest _AGENT_URL_MAP, thresholds.yaml, pyproject.toml
  marker, run-btests-pytest.sh AGENTS array
- Documentation updates (adding-behavioral-tests.md, evalhub README)

MLflow tracing (RHAIENG-5605):
- New src/adk_agent/tracing.py with enable_tracing() and
  wrap_func_with_mlflow_trace() for TOOL span creation
- main.py lifespan calls enable_tracing()
- pyproject.toml [tracing] optional dependency group
- Dockerfile installs [tracing] extra
- .env.example documents MLflow env vars

Makefile deploy flags (RHAIENG-5606):
- Conditional --set flags for MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME,
  MLFLOW_TRACKING_INSECURE_TLS, MLFLOW_WORKSPACE, MLFLOW_TRACKING_TOKEN
- run-app/run-app-fresh/run-cli conditionally add --extra tracing

Also fixes run-btests-pytest.sh AGENTS array paths to include templates/
subdirectory (broken since RHAIENG-5413 template move).

Validated: 12/12 pytest pass, EvalHub E2E 1.0/1.0/1.0/1.0,
MLflow enrichment confirmed, 13/13 cross-agent consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-5593-eval-coverage-google-adk-agent branch from 1f2791e to 47c14a5 Compare June 12, 2026 19:52

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/google/templates/adk/src/adk_agent/tracing.py`:
- Around line 49-50: Logs currently include the full mlflow_url (variable
mlflow_url) which can leak credentials or sensitive query params; update every
logging call that prints mlflow_url (the MLflow health check/log lines and other
occurrences) to redact secrets by parsing the URL (use urllib.parse.urlparse),
remove username/password and query params, and rebuild a safe string (e.g.,
scheme + "://" + hostname + optional port + path or replace netloc credentials
with "***") before logging; apply the same redaction logic wherever mlflow_url
is logged (search for mlflow_url in tracing.py) and add a small helper function
(e.g., redact_url(url: str) -> str) to centralize this behavior and use it in
the health-check and other log statements.
- Around line 46-47: The health check uses a raw requests.get(mlflow_url) which
ignores MLflow auth/TLS envs and can falsely disable tracing; update the check
around response = requests.get(mlflow_url, ...) to perform the request using the
same auth/TLS configuration the rest of the code uses: read
MLFLOW_TRACKING_TOKEN (set an Authorization: Bearer header if present) and
respect any insecure-TLS flag or cert verification setting (pass verify=False
when configured), or alternatively invoke the MlflowClient/Mlflow tracking API
to probe /health so the request uses MLflow's configured transport; ensure both
places (the current request and the duplicate around lines 120-122) use this
unified approach so authenticated or insecure-TLS endpoints are handled
correctly.

In `@agents/google/templates/adk/tests/behavioral/test_reliability.py`:
- Line 24: The _SEARCH_EVIDENCE list in test_reliability.py is too broad
(contains "openshift" and "red hat") and can falsely mark runs as tool-usage
passes; replace those generic tokens with targeted evidence strings that
reliably indicate a tool call (e.g., phrases the agent/tool emits such as
"invoked search", "called search tool", "tool_response:", "Search results for")
and update the same evidence usage at the other occurrence (lines ~55-58) so
tests only count explicit tool-invocation outputs rather than generic topic
words.

In `@agents/google/templates/adk/tests/behavioral/test_streaming_parity.py`:
- Around line 44-49: The parity check currently skips asserting when both
result_sync.tool_calls and result_stream.tool_calls are empty, allowing vacuous
passes; update the logic around sync_tools/stream_tools (the sets built from
result_sync.tool_calls and result_stream.tool_calls) to first assert that at
least one of them is non-empty (e.g., assert sync_tools or stream_tools, "No
tool calls emitted in either run; expected at least one"), then perform the
equality assertion sync_tools == stream_tools so the test fails if both runs
stop emitting tool calls unexpectedly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 76f604fb-84db-4444-b9f7-ea7840e52b78

📥 Commits

Reviewing files that changed from the base of the PR and between ba5c5a4 and 5a1c1ac.

📒 Files selected for processing (26)
  • README.md
  • agents/google/templates/adk/.env.example
  • agents/google/templates/adk/Dockerfile
  • agents/google/templates/adk/Makefile
  • agents/google/templates/adk/README.md
  • agents/google/templates/adk/evalhub/tool_use.yaml
  • agents/google/templates/adk/main.py
  • agents/google/templates/adk/pyproject.toml
  • agents/google/templates/adk/src/adk_agent/agent.py
  • agents/google/templates/adk/src/adk_agent/tracing.py
  • agents/google/templates/adk/tests/behavioral/conftest.py
  • agents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yaml
  • agents/google/templates/adk/tests/behavioral/test_cost_latency.py
  • agents/google/templates/adk/tests/behavioral/test_reliability.py
  • agents/google/templates/adk/tests/behavioral/test_response_quality.py
  • agents/google/templates/adk/tests/behavioral/test_streaming_parity.py
  • agents/google/templates/adk/tests/behavioral/test_tool_usage.py
  • docs/adding-behavioral-tests.md
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/tests/run-e2e.sh
  • pyproject.toml
  • tests/behavioral/configs/thresholds.yaml
  • tests/behavioral/conftest.py
  • tests/behavioral/deterministic/run-btests-pytest.sh

Comment thread agents/google/templates/adk/src/adk_agent/tracing.py Outdated
Comment thread agents/google/templates/adk/src/adk_agent/tracing.py Outdated
Comment thread agents/google/templates/adk/tests/behavioral/test_reliability.py
Comment thread agents/google/templates/adk/tests/behavioral/test_streaming_parity.py Outdated
andrewdonheiser and others added 2 commits June 12, 2026 16:10
…ion, streaming parity

- Health check now uses MLFLOW_TRACKING_TOKEN and MLFLOW_TRACKING_INSECURE_TLS
  so secured endpoints don't falsely disable tracing
- Add _safe_uri() helper to strip credentials/query params from logged URIs
- Streaming parity test now asserts tool calls are present (prevents vacuous
  pass when both modes emit nothing)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Call enable_tracing() in ai_service.py before get_runner() so that
`make run-cli` with MLFLOW_TRACKING_URI produces traces. Add missing
trailing newlines to tool_use.yaml and golden_queries.yaml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant