RHAIENG-5593: add eval coverage for Google ADK agent#183
RHAIENG-5593: add eval coverage for Google ADK agent#183andrewdonheiser wants to merge 4 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (27)
✅ Files skipped from review due to trivial changes (7)
🚧 Files skipped from review as they are similar to previous changes (16)
📝 WalkthroughWalkthroughAdds optional MLflow tracing to Google ADK (tool/agent spans and startup wiring), comprehensive behavioral pytest suites and fixtures for Google ADK, and EvalHub fixture + e2e integration; updates build, deploy, and test configurations to wire tracing and evaluation settings. ChangesMLflow Tracing Infrastructure & Agent Integration
Behavioral Testing Framework
EvalHub Agent Integration & E2E Setup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…tion for Google ADK agent Add full behavioral test suite and evaluation infrastructure for the Google ADK agent (RHAIENG-5593), along with MLflow tracing integration (RHAIENG-5605) and Makefile deploy MLflow flags (RHAIENG-5606). Behavioral tests (RHAIENG-5593): - 5 test files: tool_usage, response_quality, cost_latency, reliability, streaming_parity (12 tests total, all passing) - Golden queries fixture with 4 search queries + 1 adversarial - EvalHub tool_use.yaml fixture, Containerfile COPY, run-e2e.sh blocks - Shared infra: conftest _AGENT_URL_MAP, thresholds.yaml, pyproject.toml marker, run-btests-pytest.sh AGENTS array - Documentation updates (adding-behavioral-tests.md, evalhub README) MLflow tracing (RHAIENG-5605): - New src/adk_agent/tracing.py with enable_tracing() and wrap_func_with_mlflow_trace() for TOOL span creation - main.py lifespan calls enable_tracing() - pyproject.toml [tracing] optional dependency group - Dockerfile installs [tracing] extra - .env.example documents MLflow env vars Makefile deploy flags (RHAIENG-5606): - Conditional --set flags for MLFLOW_TRACKING_URI, MLFLOW_EXPERIMENT_NAME, MLFLOW_TRACKING_INSECURE_TLS, MLFLOW_WORKSPACE, MLFLOW_TRACKING_TOKEN - run-app/run-app-fresh/run-cli conditionally add --extra tracing Also fixes run-btests-pytest.sh AGENTS array paths to include templates/ subdirectory (broken since RHAIENG-5413 template move). Validated: 12/12 pytest pass, EvalHub E2E 1.0/1.0/1.0/1.0, MLflow enrichment confirmed, 13/13 cross-agent consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1f2791e to
47c14a5
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/google/templates/adk/src/adk_agent/tracing.py`:
- Around line 49-50: Logs currently include the full mlflow_url (variable
mlflow_url) which can leak credentials or sensitive query params; update every
logging call that prints mlflow_url (the MLflow health check/log lines and other
occurrences) to redact secrets by parsing the URL (use urllib.parse.urlparse),
remove username/password and query params, and rebuild a safe string (e.g.,
scheme + "://" + hostname + optional port + path or replace netloc credentials
with "***") before logging; apply the same redaction logic wherever mlflow_url
is logged (search for mlflow_url in tracing.py) and add a small helper function
(e.g., redact_url(url: str) -> str) to centralize this behavior and use it in
the health-check and other log statements.
- Around line 46-47: The health check uses a raw requests.get(mlflow_url) which
ignores MLflow auth/TLS envs and can falsely disable tracing; update the check
around response = requests.get(mlflow_url, ...) to perform the request using the
same auth/TLS configuration the rest of the code uses: read
MLFLOW_TRACKING_TOKEN (set an Authorization: Bearer header if present) and
respect any insecure-TLS flag or cert verification setting (pass verify=False
when configured), or alternatively invoke the MlflowClient/Mlflow tracking API
to probe /health so the request uses MLflow's configured transport; ensure both
places (the current request and the duplicate around lines 120-122) use this
unified approach so authenticated or insecure-TLS endpoints are handled
correctly.
In `@agents/google/templates/adk/tests/behavioral/test_reliability.py`:
- Line 24: The _SEARCH_EVIDENCE list in test_reliability.py is too broad
(contains "openshift" and "red hat") and can falsely mark runs as tool-usage
passes; replace those generic tokens with targeted evidence strings that
reliably indicate a tool call (e.g., phrases the agent/tool emits such as
"invoked search", "called search tool", "tool_response:", "Search results for")
and update the same evidence usage at the other occurrence (lines ~55-58) so
tests only count explicit tool-invocation outputs rather than generic topic
words.
In `@agents/google/templates/adk/tests/behavioral/test_streaming_parity.py`:
- Around line 44-49: The parity check currently skips asserting when both
result_sync.tool_calls and result_stream.tool_calls are empty, allowing vacuous
passes; update the logic around sync_tools/stream_tools (the sets built from
result_sync.tool_calls and result_stream.tool_calls) to first assert that at
least one of them is non-empty (e.g., assert sync_tools or stream_tools, "No
tool calls emitted in either run; expected at least one"), then perform the
equality assertion sync_tools == stream_tools so the test fails if both runs
stop emitting tool calls unexpectedly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 76f604fb-84db-4444-b9f7-ea7840e52b78
📒 Files selected for processing (26)
README.mdagents/google/templates/adk/.env.exampleagents/google/templates/adk/Dockerfileagents/google/templates/adk/Makefileagents/google/templates/adk/README.mdagents/google/templates/adk/evalhub/tool_use.yamlagents/google/templates/adk/main.pyagents/google/templates/adk/pyproject.tomlagents/google/templates/adk/src/adk_agent/agent.pyagents/google/templates/adk/src/adk_agent/tracing.pyagents/google/templates/adk/tests/behavioral/conftest.pyagents/google/templates/adk/tests/behavioral/fixtures/golden_queries.yamlagents/google/templates/adk/tests/behavioral/test_cost_latency.pyagents/google/templates/adk/tests/behavioral/test_reliability.pyagents/google/templates/adk/tests/behavioral/test_response_quality.pyagents/google/templates/adk/tests/behavioral/test_streaming_parity.pyagents/google/templates/adk/tests/behavioral/test_tool_usage.pydocs/adding-behavioral-tests.mddocs/adding-evalhub-agent-integration.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/tests/run-e2e.shpyproject.tomltests/behavioral/configs/thresholds.yamltests/behavioral/conftest.pytests/behavioral/deterministic/run-btests-pytest.sh
…ion, streaming parity - Health check now uses MLFLOW_TRACKING_TOKEN and MLFLOW_TRACKING_INSECURE_TLS so secured endpoints don't falsely disable tracing - Add _safe_uri() helper to strip credentials/query params from logged URIs - Streaming parity test now asserts tool calls are present (prevents vacuous pass when both modes emit nothing) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Call enable_tracing() in ai_service.py before get_runner() so that `make run-cli` with MLFLOW_TRACKING_URI produces traces. Add missing trailing newlines to tool_use.yaml and golden_queries.yaml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
tracing.py,enable_tracing(), tool span wrapping)--setfor MLflow env vars)run-btests-pytest.shAGENTS array paths to includetemplates/subdirectory (broken since RHAIENG-5413 move)Apologies for the larger PR — RHAIENG-5605 (MLflow tracing) and RHAIENG-5606 (Makefile deploy flags) were included because the risk was low and the implementation pattern is well-established across all other agents. Both were discovered as blockers during behavioral test implementation and follow the exact same pattern as the 8 existing agents.
Tickets covered
New files
agents/google/templates/adk/src/adk_agent/tracing.py— MLflow tracing init + tool span wrapperagents/google/templates/adk/tests/behavioral/— conftest, 5 test files, golden queries fixtureagents/google/templates/adk/evalhub/tool_use.yaml— EvalHub fixtureValidation results
Also discovered
_AGENT_URL_MAPorrun-btests-pytest.sh(filed, not in this PR)langgraph-hitl-agenthad hardcodedMLFLOW_TRACKING_TOKEN(fixed during validation viamake deploy)Test plan
uv run --extra test --extra test-mlflow pytest agents/google/templates/adk/tests/behavioral/ -v— all 12 pass./tests/behavioral/deterministic/run-btests-pytest.sh google/templates/adk— PASSOC_NAMESPACE=adonheis-testing ./evals/evalhub_adapter/tests/run-e2e.sh— all 10 agents complete, scores reasonable🤖 Generated with Claude Code