This project implements a Continuous Evaluation Pipeline for a multi-agent system built with Google Agent Development Kit (ADK) and Agent2Agent (A2A) protocol on Cloud Run. It features a team of microservice agents that research, judge, and build content, orchestrated to deliver high-quality results.
The goal of this project is to demonstrate Agentic Engineering practices for Continuous Evaluation: safely deploying agents to shadow revisions, running automated evaluation suites using Vertex AI, and making data-driven decisions on agent deployments and improvements.
It is a companion code repository to the codelab From "vibe checks" to data-driven Agent Evaluation.
It uses Vertex AI Gen AI Evaluation Service that provides enterprise-grade tools for objective, data-driven assessment of generative AI models and agents.
The system uses a distributed microservices architecture where each agent runs in its own container and communicates via the A2A protocol:
- Orchestrator Service (
orchestrator): The main entry point and "brain" of the operation. It manages the workflow usingLoopAgentandSequentialAgentpatterns, delegating tasks to other agents. - Researcher Service (
researcher): A standalone agent equipped with a Wikipedia Search tool. It gathers information based on queries. - Judge Service (
judge): A standalone agent that evaluates the quality and relevance of the research provided by the Researcher. - Content Builder Service (
content_builder): A standalone agent that compiles the verified information into a final comprehensive report or course. - Agent App (
app): A user-facing web application that talks to the Orchestrator, allowing users to trigger runs and view progress.
multi-agent-eval/
├── agents/ # Source code for the agents
│ ├── orchestrator/ # Main Orchestrator agent (ADK API Service)
│ ├── researcher/ # Researcher agent (with Wikipedia Search Tool)
│ ├── judge/ # Judge agent (Quality Assurance)
│ └── content_builder/ # Content Builder agent (Writer)
├── app/ # Web App service application
│ └── frontend/ # Frontend application that uses Web App service API
├── evaluator/ # Evaluation Logic
│ ├── evaluate_agent.py # Main script to run Vertex AI evaluations
│ ├── eval_data_*.json # Golden Datasets for agents
│ └── show_evaluation_run.ipynb # Notebook to visualize results
│ └── show_evaluation_run_local.ipynb # Notebook to visualize results of a local run
├── shared/ # Common libraries (symlinked to agents)
│ ├── evaluation/ # Shared evaluation logic (engine & metrics)
│ ├── a2a_utils.py # Utilities for A2A Service-to-Service calls
│ ├── adk_app.py # ADK application wrapper
│ └── traced_authenticated_httpx.py # Auth handling for Service-to-Service calls
├── deploy.sh # Deployment Automation Script
└── evaluate.sh # CI/CD Entry point for Evaluation
Each agent is a separate Cloud Run service.
- Orchestrator: Implements the high-level logic. It breaks down the user request, asks the Researcher for info, asks the Judge to verify it, and loops until the Judge is satisfied before sending data to the Content Builder.
- Researcher: A specialized tool-use agent. It has access to external tools (Google Search/Wikipedia) and is optimized for information retrieval.
- Judge: A critic agent. It compares the research against the original query to ensure relevance.
- Content Builder: A creative agent. It takes raw text and formats it into educational content.
deploy.sh: Not just a deployment script, but a Revision Manager.- It enables necessary Google Cloud APIs.
- It identifies the current Project and Region.
- It supports Shadow Deployment via
--revision-tag. This allows deploying a new version of the code alongside the live version without routing public traffic to it.
evaluate.sh: The heart of the pipeline.- Captures the current Git Commit Hash.
- Calls
deploy.shto create a tagged revision (e.g.,c-a1b2c3d). - Runs the python evaluation suite against that specific revision URL.
The core evaluation logic is decoupled from the specific agent tests and lives in shared/evaluation.
evaluate.py: The Evaluation Engine.- Parallel Inference: Runs the evaluation dataset against the agent API in parallel
asyncio. - Data Management: Uploads both the inference results/traces and the original dataset to GCS.
- Vertex Integration: Trigger a Vertex Gen AI Evaluation Service Run to calculate metrics (both Rubric and Custom).
- Parallel Inference: Runs the evaluation dataset against the agent API in parallel
tool_metrics.py: Custom Metric Definitions.- Implements Trajectory metrics that usually require custom logic not found in standard LLM evaluators.
trajectory_exact_match: Did the agent call the exact sequence of tools?trajectory_precision/trajectory_recall: Information retrieval style metrics for tool usage.
The evaluator/ directory contains the specific test definitions for this project.
- We use Vertex AI Gen AI Evaluation Service.
- Metrics:
Final Response Match: Checks if the Researcher supports the correct answer (Golden Dataset).Tool Use Quality: Validates if tool calls are malformed or unnecessary.Hallucination: Verifies that the Orchestrator's final output is grounded in the retrieved context.
Shared adk_app.py script is used for all agents. It provides:
-
ADK API Server wrapper
-
A2A service registration and AgentCard
-
Robust Cloud Trace integration for end-to-end tracing, including A2A subagents
Note: The deployment script (
deploy.sh) sets theOTEL_TRACES_SAMPLERenvironment variable toalways_on. In production deployments, to avoid high trace volume, you may want to send it toparentbased_traceidratioor other value appropriate for high request rate. It also sets theADK_CAPTURE_MESSAGE_CONTENT_IN_SPANSenvironment variable tofalseto prevent personally identifiable information (PII) from being attached to tracing spans as attributes. -
Health checks
- uv: Python package manager (required for local development).
- Google Cloud SDK: For GCP services and authentication.
- Docker: If building containers locally (optional).
-
Install Dependencies:
uv sync
-
Set up credentials:
If you haven't set up your Google Cloud credentials for gcloud CLI yet, run:
gcloud auth login --update-adc
This project follows a "Deploy-then-Test" workflow, often called Shadow Testing.
Modify the agent code (e.g., change the prompt in agents/researcher/agent.py).
Instead of testing manually, run the full suite:
./evaluate.shWhat happens:
- Your code is deployed to Cloud Run as a new revision with a tag made of a commit hash (e.g.,
https://c-1234abcd---researcher-xyz.run.app). - The
evaluator.evaluate_agentperforms the evaluation or Researcher and Orchestrator agents using, respectively,eval_data_researcher.jsonandeval_data_orchestrator.jsondatasets.- It sends test prompts to the tagged revisions of the deployed agents.
- It evaluates the results using Vertex AI Gen AI Evaluation Service.
- It prints a summary of Pass/Fail metrics.
If the evaluation fails or you want to see details:
- Open
evaluator/show_evaluation_run.ipynbin Google Colab. - Set
GOOGLE_CLOUD_PROJECT,GOOGLE_CLOUD_REGION,EVAL_RUN_IDvariables. - Visualize the traces and metric breakdowns to debug.
If you run experinced a fallback to local execution, use
evaluator/show_evaluation_run_local.ipynb.
Once you are happy with the evaluation results:
./deploy.sh(Without flags, this deploys to the latest revision and routes 100% traffic to it).
In a production system, the agent evaluation should be run as part of the CI/CD pipeline. Cloud Build is a good choice for that.
.cloudbuild/cloudbuild.yaml is a example ofCloud Build configuration file that defines the following steps:
- Deploy the code to Cloud Run as a new revision with a tag made of a commit hash.
- Run the evaluation (and probably your unit tests before that).
- If the tests or the evaluation fail, the deployment will stop here.
- If the tests and the evaluation pass, it will continue with promoting the revisions to serve 100% of traffic.
.cloudbuild/run_cloud_build.sh is a example of a script that invokes the Cloud Build pipeline. It also shows how to create a Service Account with the necessary permissions to run the pipeline.
You may need to enable Cloud Build API for the project to use it.
gcloud services enable cloudbuild.googleapis.com
In a real system, you would want to create a Cloud Build Trigger that runs the pipeline when a new commit is pushed to the repository. In that case, SHORT_SHA substitution variable will be automatically set to the commit hash of the new commit, and cloudbuild.yaml handles that.