This guide covers the DeepEval-based evaluation framework for testing how well LLM agents use the RHOAI MCP server's tools to accomplish real-world tasks.
The evaluation framework measures whether an LLM agent can effectively use the MCP tools provided by the RHOAI server. Instead of checking tool implementations directly (that's what unit tests do), evals answer the question: "Given a natural-language task, does the agent call the right tools in the right order and produce a useful result?"
The framework uses:
- A real LLM agent (OpenAI, Anthropic, or Google Gemini) that receives tasks and calls MCP tools
- The real RHOAI MCP server running in-process with all plugins loaded
- A mock K8s cluster (or optionally a live cluster) providing realistic data
- DeepEval metrics with a judge LLM that scores the agent's tool usage and task completion
This replaces the earlier self-instrumentation approach (ENABLE_EVALUATION hooks) with an external, LLM-judged evaluation that better reflects real-world agent behavior.
- Python 3.10+
- uv package manager
- An API key for at least one supported LLM provider (for the agent LLM and the DeepEval judge LLM)
- (Optional) A live OpenShift cluster with RHOAI installed, for live-cluster evals
-
Copy the example environment file and fill in your API keys:
cp .env.eval.example .env.eval
At minimum, set the provider and API key for both agent and judge:
RHOAI_EVAL_LLM_API_KEY=sk-... RHOAI_EVAL_EVAL_API_KEY=sk-...
-
Install the eval dependency group:
uv sync --group eval
# Run all mock-cluster scenarios
make eval
# Run all scenarios including live-cluster tests
make eval-live
# Run a single scenario by name
make eval-scenario SCENARIO=cluster_exploration
make eval-scenario SCENARIO=training_workflow
make eval-scenario SCENARIO=model_deployment
make eval-scenario SCENARIO=troubleshooting
make eval-scenario SCENARIO=tool_discovery# Mock-cluster scenarios only
uv run --group eval pytest evals/ -v -m "eval and not live" --tb=short
# All scenarios
uv run --group eval pytest evals/ -v -m "eval" --tb=short
# Single scenario file
uv run --group eval pytest evals/scenarios/test_cluster_exploration.py -v --tb=shortAll variables use the RHOAI_EVAL_ prefix and can be set in .env.eval or as environment variables.
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
openai |
Agent LLM provider (see Supported Providers) |
LLM_MODEL |
gpt-4o |
Model name for the agent LLM |
LLM_API_KEY |
(none) | API key for the agent LLM |
LLM_BASE_URL |
(none) | Base URL for vLLM or Azure endpoints |
EVAL_PROVIDER |
openai |
Judge LLM provider (see Supported Providers) |
EVAL_MODEL |
gpt-4o |
Model name for the DeepEval judge LLM |
EVAL_API_KEY |
(none) | API key for the judge LLM |
EVAL_MODEL_BASE_URL |
(none) | Base URL for a custom judge endpoint |
VERTEX_PROJECT_ID |
(none) | Google Cloud project ID (for anthropic-vertex and google-vertex) |
VERTEX_LOCATION |
us-central1 |
Google Cloud region (for anthropic-vertex and google-vertex) |
CLUSTER_MODE |
mock |
mock (no cluster needed) or live (real cluster) |
MCP_USE_THRESHOLD |
0.5 |
Minimum score for MCP tool usage metrics (0.0-1.0) |
TASK_COMPLETION_THRESHOLD |
0.6 |
Minimum score for task completion metrics (0.0-1.0) |
MAX_AGENT_TURNS |
20 |
Maximum LLM turns per scenario (1-100) |
| Provider | Value | SDK | Notes |
|---|---|---|---|
| OpenAI | openai |
openai |
Default. Uses OpenAI API directly. |
| vLLM | vllm |
openai |
OpenAI-compatible. Requires LLM_BASE_URL. |
| Azure OpenAI | azure |
openai |
OpenAI-compatible. Requires LLM_BASE_URL. |
| Anthropic | anthropic |
anthropic |
Claude models via direct API. |
| Anthropic on Vertex AI | anthropic-vertex |
anthropic |
Claude models via Google Vertex AI. Requires VERTEX_PROJECT_ID. |
| Google Gemini | google-genai |
google-genai |
Gemini models via API key. |
| Google Gemini on Vertex AI | google-vertex |
google-genai |
Gemini models via Vertex AI. Requires VERTEX_PROJECT_ID. |
evals/
├── config.py # EvalConfig (pydantic-settings)
├── conftest.py # Shared pytest fixtures
├── mcp_harness.py # In-process MCP server lifecycle
├── agent.py # Provider-agnostic agent loop
├── deepeval_helpers.py # AgentResult -> DeepEval test case conversion
├── providers/
│ ├── __init__.py # Exports factory functions
│ ├── base.py # AgentLLMProvider ABC + dataclasses
│ ├── openai_provider.py # OpenAI/Azure/vLLM provider
│ ├── anthropic_provider.py # Anthropic/Anthropic-Vertex provider
│ ├── google_provider.py # Google GenAI/Vertex provider
│ ├── judge.py # DeepEvalBaseLLM subclasses per provider
│ └── factory.py # create_agent_provider(), create_judge_llm()
├── reporting/
│ ├── __init__.py # Exports EvalRecorder, evaluate_and_record
│ ├── models.py # Dataclasses for JSONL schema
│ ├── recorder.py # EvalRecorder + evaluate_and_record wrapper
│ ├── reader.py # JSONL loading
│ ├── formatting.py # Table rendering (terminal + markdown)
│ ├── comparison.py # Provider comparison report
│ ├── trending.py # Score trend report
│ ├── cli.py # CLI (summary, compare, trend)
│ └── __main__.py # python -m entry point
├── results/
│ └── eval_history.jsonl # Persisted eval results (gitignored)
├── mock_k8s/
│ ├── cluster_state.py # ClusterState dataclass + default data
│ └── mock_client.py # MockK8sClient (subclasses K8sClient)
├── metrics/
│ └── config.py # Metric factory functions
└── scenarios/
├── test_cluster_exploration.py # Cluster discovery scenario
├── test_training_workflow.py # Training job creation scenario
├── test_model_deployment.py # Model serving scenario
├── test_troubleshooting.py # Failed job diagnosis scenario
└── test_tool_discovery.py # Meta tool usage scenario
-
EvalConfig(config.py) loads settings fromRHOAI_EVAL_*env vars or.env.evalusing pydantic-settings. -
MCPHarness(mcp_harness.py) starts the real RHOAI MCP server in-process. In mock mode, it injects aMockK8sClientbefore the server lifespan begins, so all domain logic, plugin loading, and tool registration execute for real — only the K8s API calls are faked. In live mode, it uses the server's normal lifespan with a real cluster connection. -
Provider abstraction (
providers/) decouples the agent loop from any specific LLM SDK. Each provider implementsAgentLLMProvider— handling tool schema conversion, message formatting, API communication, and conversion back to OpenAI-style dicts for DeepEval. The factory functioncreate_agent_provider()dispatches on the configured provider. -
MCPAgent(agent.py) implements a provider-agnostic agent loop: it calls provider methods to format tools, build messages, send completions, and append results. It records all tool calls and messages in anAgentResult, converting messages to OpenAI format viamessages_for_deepeval()at the end. -
deepeval_helpers.pyconverts theAgentResultinto DeepEval test case objects (ConversationalTestCasefor multi-turn scenarios,LLMTestCasefor single-turn), attaching theMCPServertool definitions andMCPToolCallrecords. -
Metrics (
metrics/config.py) wrap DeepEval's built-in MCP metrics with configured thresholds. Thecreate_judge_llm()factory creates the appropriate judge LLM based on the configuredeval_provider. -
Scenarios (
scenarios/) are pytest test classes marked with@pytest.mark.eval. Each defines a natural-languageTASK, runs the agent, builds a DeepEval test case, and asserts that all metrics pass.
Scenario TASK ──> MCPAgent.run()
│
├──> AgentLLMProvider.send() ──> tool_calls
│ (OpenAI / Anthropic / Google)
│ │
├──< MCPHarness.call_tool() <──────┘
│ │
│ └──> RHOAI MCP Server ──> MockK8sClient
│
└──> AgentResult
│
├──> deepeval_helpers ──> ConversationalTestCase
│ │
└──> DeepEval evaluate() <─────┘
│
└──> Judge LLM scores metrics
| Scenario | File | Task | Metrics |
|---|---|---|---|
| Cluster Exploration | test_cluster_exploration.py |
Discover projects, running workbenches, and GPU availability | MultiTurnMCPUseMetric, MCPTaskCompletionMetric |
| Training Workflow | test_training_workflow.py |
Fine-tune Llama 3.1-8B with LoRA: check prerequisites, plan resources, create the job | MultiTurnMCPUseMetric, MCPTaskCompletionMetric |
| Model Deployment | test_model_deployment.py |
Deploy granite model via vLLM runtime and verify status | MultiTurnMCPUseMetric, MCPTaskCompletionMetric |
| Troubleshooting | test_troubleshooting.py |
Diagnose why failed-training-001 failed (OOMKilled) |
MultiTurnMCPUseMetric, MCPTaskCompletionMetric |
| Tool Discovery | test_tool_discovery.py |
Discover which tools to use for project setup with storage and workbench | MCPUseMetric (single-turn) |
When CLUSTER_MODE=mock, the create_default_cluster_state() function in evals/mock_k8s/cluster_state.py pre-populates a realistic RHOAI cluster:
| Resource Type | Name | Namespace | Details |
|---|---|---|---|
| Namespace/Project | ml-experiments |
— | "ML Experiments" |
| Namespace/Project | production-models |
— | "Production Models" |
| DataScienceCluster | default-dsc |
— | All components ready |
| AcceleratorProfile | nvidia-a100 |
— | NVIDIA A100 80GB GPU |
| Notebook (Workbench) | my-workbench |
ml-experiments |
Running, Minimal Python image |
| TrainJob (completed) | llama-finetune-001 |
ml-experiments |
Llama 3.1-8B fine-tune, completed |
| TrainJob (failed) | failed-training-001 |
ml-experiments |
OOMKilled: GPU out of memory |
| ClusterTrainingRuntime | torchtune-llama |
— | TorchTune LLaMA runtime |
| TrainingRuntime | custom-training-runtime |
ml-experiments |
Custom runtime |
| InferenceService | granite-serving |
production-models |
Granite 3B via vLLM, ready |
| ServingRuntime | vllm-runtime |
production-models |
vLLM serving runtime |
| DSPA | dspa-default |
ml-experiments |
Pipeline server, ready |
| Secret | aws-connection-models |
ml-experiments |
S3 data connection |
| PVC | workbench-storage |
ml-experiments |
20Gi, bound |
The MockK8sClient subclasses the real K8sClient and overrides all methods to return data from this state. This means the MCP server's domain logic runs unmodified — only the underlying K8s API calls are replaced.
- Create a new file
evals/scenarios/test_<name>.py:
"""Scenario: <Description>.
<What this scenario tests>.
"""
from __future__ import annotations
from typing import TYPE_CHECKING, Any
import pytest
from evals.agent import MCPAgent
from evals.config import EvalConfig
from evals.deepeval_helpers import build_mcp_server, result_to_conversational_test_case
from evals.mcp_harness import MCPHarness
from evals.metrics.config import create_multi_turn_mcp_use_metric, create_task_completion_metric
if TYPE_CHECKING:
from collections.abc import Callable
from evals.agent import AgentResult
@pytest.mark.eval
class TestMyScenario:
"""Evaluate agent's ability to <do something>."""
TASK = (
"Natural language description of what the agent should accomplish. "
"Be specific about resource names, namespaces, and expected actions."
)
@pytest.mark.eval
async def test_my_scenario(
self,
eval_config: EvalConfig,
harness: MCPHarness,
agent: MCPAgent,
evaluate_and_record: Callable[[str, AgentResult, list[Any], list[Any]], Any],
) -> None:
"""Agent should <expected behavior>."""
result = await agent.run(self.TASK)
# Basic sanity checks
tool_names = result.tool_names_used
assert len(tool_names) > 0, "Agent should call at least one tool"
# Build DeepEval test case and evaluate
mcp_server = build_mcp_server(harness)
test_case = result_to_conversational_test_case(result, mcp_server)
metrics = [
create_multi_turn_mcp_use_metric(eval_config),
create_task_completion_metric(eval_config),
]
eval_result = evaluate_and_record(
scenario="my_scenario",
agent_result=result,
test_cases=[test_case],
metrics=metrics,
)
for metric_result in eval_result.test_results[0].metrics_data:
assert metric_result.success, (
f"Metric {metric_result.metric_name} failed: {metric_result.reason}"
)-
If the scenario needs mock data that doesn't exist yet, add resources to
create_default_cluster_state()inevals/mock_k8s/cluster_state.py. -
Run the new scenario:
make eval-scenario SCENARIO=my_scenarioThe framework uses three DeepEval metrics, created via factory functions in evals/metrics/config.py:
Evaluates whether the agent selected and called appropriate MCP tools for a single-turn interaction. The judge LLM scores tool selection against the available tool set. Used by the tool discovery scenario.
Like MCPUseMetric, but evaluates the full multi-turn conversation. It considers the sequence and combination of tool calls across turns. Used by most scenarios.
Evaluates whether the agent actually accomplished the task based on the tool call results and final output. Checks not just that the right tools were called, but that the overall task goal was met.
All metrics accept a threshold (0.0-1.0) configurable via RHOAI_EVAL_MCP_USE_THRESHOLD and RHOAI_EVAL_TASK_COMPLETION_THRESHOLD.
RHOAI_EVAL_LLM_PROVIDER=openai
RHOAI_EVAL_LLM_MODEL=gpt-4o
RHOAI_EVAL_LLM_API_KEY=sk-...Set the provider to vllm and provide the endpoint URL:
RHOAI_EVAL_LLM_PROVIDER=vllm
RHOAI_EVAL_LLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
RHOAI_EVAL_LLM_API_KEY=token-placeholder
RHOAI_EVAL_LLM_BASE_URL=http://localhost:8000/v1RHOAI_EVAL_LLM_PROVIDER=azure
RHOAI_EVAL_LLM_MODEL=gpt-4o
RHOAI_EVAL_LLM_API_KEY=your-azure-key
RHOAI_EVAL_LLM_BASE_URL=https://your-resource.openai.azure.com/openai/deployments/gpt-4oRHOAI_EVAL_LLM_PROVIDER=anthropic
RHOAI_EVAL_LLM_MODEL=claude-sonnet-4-20250514
RHOAI_EVAL_LLM_API_KEY=sk-ant-...RHOAI_EVAL_LLM_PROVIDER=anthropic-vertex
RHOAI_EVAL_LLM_MODEL=claude-sonnet-4@20250514
RHOAI_EVAL_VERTEX_PROJECT_ID=my-gcp-project
RHOAI_EVAL_VERTEX_LOCATION=us-east5Authentication uses Application Default Credentials (ADC). Ensure gcloud auth application-default login has been run or a service account key is configured.
RHOAI_EVAL_LLM_PROVIDER=google-genai
RHOAI_EVAL_LLM_MODEL=gemini-2.0-flash
RHOAI_EVAL_LLM_API_KEY=AIza...RHOAI_EVAL_LLM_PROVIDER=google-vertex
RHOAI_EVAL_LLM_MODEL=gemini-2.0-flash
RHOAI_EVAL_VERTEX_PROJECT_ID=my-gcp-project
RHOAI_EVAL_VERTEX_LOCATION=us-central1Authentication uses Application Default Credentials (ADC).
The judge provider can be different from the agent provider. Set EVAL_PROVIDER to control which LLM evaluates the agent:
# Use Anthropic as the agent, OpenAI as the judge
RHOAI_EVAL_LLM_PROVIDER=anthropic
RHOAI_EVAL_LLM_MODEL=claude-sonnet-4-20250514
RHOAI_EVAL_LLM_API_KEY=sk-ant-...
RHOAI_EVAL_EVAL_PROVIDER=openai
RHOAI_EVAL_EVAL_MODEL=gpt-4o
RHOAI_EVAL_EVAL_API_KEY=sk-...For self-hosted judge endpoints (vLLM, Ollama), set the base URL:
RHOAI_EVAL_EVAL_PROVIDER=vllm
RHOAI_EVAL_EVAL_MODEL=my-judge-model
RHOAI_EVAL_EVAL_API_KEY=token
RHOAI_EVAL_EVAL_MODEL_BASE_URL=http://localhost:8001/v1The GitHub Actions workflow (.github/workflows/eval.yml) runs mock-cluster evals on manual dispatch:
- Trigger:
workflow_dispatchwithagent_provider,agent_model,judge_provider, andjudge_modelinputs - Defaults:
openaiprovider,gpt-4o-minifor the agent,gpt-4ofor the judge - Required secrets: Depends on provider selection:
- OpenAI/vLLM/Azure:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - Google:
GOOGLE_API_KEY - Vertex AI:
VERTEX_PROJECT_ID,VERTEX_LOCATION
- OpenAI/vLLM/Azure:
- Output: JUnit XML results, JSONL history, and markdown summary uploaded as artifacts
To trigger manually from the GitHub UI or CLI:
# Default (OpenAI)
gh workflow run eval.yml --field agent_model=gpt-4o --field judge_model=gpt-4o
# Anthropic agent, OpenAI judge
gh workflow run eval.yml \
--field agent_provider=anthropic \
--field agent_model=claude-sonnet-4-20250514 \
--field judge_provider=openai \
--field judge_model=gpt-4oMaintainers can trigger evals directly from a pull request by commenting @run_evals on the PR. This provides a quick way to validate changes without navigating to the Actions tab.
Who can trigger: Repository owners, organization members, and collaborators (based on author_association).
What happens:
- The workflow adds an
eyesreaction to acknowledge the comment - The PR's head branch is checked out (not the default branch)
- Evals run using
google-genai/gemini-2.0-flashfor both the agent and judge LLMs - Results are posted as a PR comment with the markdown summary table
- A
rocketreaction is added on success, orthumbsdownon failure
Required secret: GOOGLE_API_KEY (Gemini API key from Google AI Studio).
Default configuration for @run_evals:
| Setting | Value |
|---|---|
| Agent provider | google-genai |
| Agent model | gemini-2.0-flash |
| Judge provider | google-genai |
| Judge model | gemini-2.0-flash |
| Cluster mode | mock |
The workflow_dispatch trigger remains available for full provider/model flexibility.
Eval results are persisted across CI runs using GitHub Actions cache:
-
Before evals run, the workflow restores
evals/results/eval_history.jsonlfrom cache using the key patterneval-results-{branch}-{run_id}, falling back toeval-results-{branch}-(latest from the same branch), theneval-results-main-(baseline from main). -
During evals, each scenario appends a JSONL record to the history file automatically via the
evaluate_and_recordfixture. -
After evals, the workflow generates a summary report and score trend table, posting both to the GitHub Actions step summary. The updated JSONL file is saved back to cache via the
actions/cachepost-action.
This means PR runs can see and compare against results from previous main branch runs, making regressions immediately visible in the step summary.
Eval results are automatically recorded to evals/results/eval_history.jsonl during each run. Each scenario produces one JSONL line containing the run ID, git metadata, environment config, metric scores, and timing data.
Three Make targets provide terminal reports:
# Summary of the latest eval run (all scenarios in a table)
make eval-report
# Compare scores across different providers/models
make eval-compare
# Show score trends over time
make eval-trendThe reporting CLI supports additional filtering and formatting:
# Summary for a specific run ID
uv run --group eval python -m evals.reporting.cli summary --run-id a1b2c3d4e5f6
# Compare a specific scenario across providers
uv run --group eval python -m evals.reporting.cli compare --scenario cluster_exploration
# Trend for a specific provider, last 5 records, markdown output
uv run --group eval python -m evals.reporting.cli trend --provider openai/gpt-4o --last 5 --format markdownEach line in eval_history.jsonl is a self-contained JSON object:
{
"run_id": "a1b2c3d4e5f6",
"timestamp": "2026-02-18T14:30:00+00:00",
"scenario": "cluster_exploration",
"git": {"commit": "b9d6777", "branch": "main"},
"environment": {
"llm_provider": "openai", "llm_model": "gpt-4o",
"eval_provider": "openai", "eval_model": "gpt-4o",
"cluster_mode": "mock",
"mcp_use_threshold": 0.5, "task_completion_threshold": 0.6,
"max_agent_turns": 20
},
"metrics": [
{"name": "MultiTurnMCPUseMetric", "score": 0.85, "success": true, "threshold": 0.5, "reason": "..."}
],
"turns": 5,
"tool_names_used": ["list_projects", "list_workbenches"],
"passed": true,
"duration_seconds": 12.3
}The file is append-only and gitignored locally. No external dependencies are required for reporting — all formatting uses stdlib only.
openai.AuthenticationError: Error code: 401
Ensure RHOAI_EVAL_LLM_API_KEY and RHOAI_EVAL_EVAL_API_KEY are set in .env.eval or the environment. For Anthropic, the key should start with sk-ant-. For Google, use a Gemini API key.
NotFoundError: <resource type> '<name>' not found
The agent asked for a resource that doesn't exist in the mock cluster state. If this is expected for your scenario, add the resource to create_default_cluster_state() in evals/mock_k8s/cluster_state.py.
Agent reached maximum turns (20) without completing the task.
The agent couldn't finish within the turn limit. Try increasing RHOAI_EVAL_MAX_AGENT_TURNS or simplifying the task. This may also indicate the agent is stuck in a loop calling the same tools repeatedly.
Metric MCPTaskCompletionMetric failed: <reason>
The judge LLM determined the agent didn't complete the task successfully. Check the reason field for details. You can lower the threshold temporarily to see partial scores:
RHOAI_EVAL_TASK_COMPLETION_THRESHOLD=0.3 make evalopenai.APIConnectionError: Connection error.
Verify your vLLM endpoint is running and accessible at the URL specified in RHOAI_EVAL_LLM_BASE_URL. The URL should include /v1 (e.g., http://localhost:8000/v1).
For anthropic-vertex and google-vertex providers, authentication uses Google Cloud Application Default Credentials. Ensure you have authenticated:
gcloud auth application-default loginOr set a service account key:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.jsonFor more detail on what the agent is doing, enable debug logging:
RHOAI_EVAL_LLM_MODEL=gpt-4o uv run --group eval pytest evals/ -v -m "eval and not live" --tb=long -s --log-cli-level=DEBUG