test: support running evals in containerized holmesgpt #1261

mainred · 2025-12-29T13:00:38Z

The background is when the holmesgpt is deployed as the pod in k8s cluster, we want to share the model by /etc/holmes/config/model_list.yaml and toolset(mcp) configuration by /etc/holmes/config/custom_toolset.yaml

Two main changes are included:

assign mcp type to the mcp server when loaded from yaml so that we can load an mcp server and use it without extra change
share the MODEL_LIST_FILE_LOCATION which is used for server mode

Summary by CodeRabbit

New Features
- MCP servers can now be configured in YAML toolset files
- Loaded toolsets now track their source file path
Chores
- Expanded publicly available toolset classes for integration support
- Updated test utilities to support MCP toolset validation

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-12-29T13:00:49Z

Results of HolmesGPT evals

Duration: 4m 28s | View workflow logs

Results of HolmesGPT evals

ask_holmes: 9/9 test cases were successful, 0 regressions

Test suite	Test case	Status
ask	09_crashpod	✅
ask	101_loki_historical_logs_pod_deleted	✅
ask	111_pod_names_contain_service	✅
ask	12_job_crashing	✅
ask	162_get_runbooks	✅
ask	176_network_policy_blocking_traffic_no_runbooks	✅
ask	24_misconfigured_pvc	✅
ask	43_current_datetime_from_prompt	✅
ask	61_exact_match_counting	✅

Legend

✅ the test was successful
:minus: the test was skipped
⚠️ the test failed but is known to be flaky or known to fail
🚧 the test had a setup failure (not a code regression)
🔧 the test failed due to mock data issues (not a code regression)
🚫 the test was throttled by API rate limits/overload
❌ the test failed and should be fixed before merging the PR

🔄 Re-run evals manually

Option 1: Comment on this PR with /eval:

/eval

Or with options (one per line):

/eval
model: gpt-4o
filter: 09_crashpod
iterations: 5

Option	Description
`model`	Model(s) to test (default: same as automatic runs)
`markers`	Pytest markers (default: regression)
`filter`	Pytest -k filter
`iterations`	Number of runs, max 10

Option 2: Trigger via GitHub Actions UI → "Run workflow"

coderabbitai · 2025-12-29T13:00:51Z

Walkthrough

The PR introduces MCP (Model Context Protocol) server support to the toolset loading system, allowing toolsets to be loaded from YAML files with an mcp_servers block. It adds path attributes to loaded toolsets, expands public exports with additional toolset classes, and adjusts test mocking to permit MCP-type toolsets.

Changes

Cohort / File(s)	Summary
MCP Toolset Loading & Exports `holmes/plugins/toolsets/__init__.py`	Adds MCP server recognition in `load_toolsets_from_file()` to parse `mcp_servers` block and inject each as a toolset with `type=ToolsetType.MCP`. Sets `path` attribute on all loaded toolsets pointing to the source YAML file. Adds public imports for `DatadogGeneralToolset`, `DatadogTracesToolset`, `CoreInvestigationToolset`, and relocates `GrafanaToolset`. Reorders `USE_LEGACY_KUBERNETES_LOGS` in env var imports.
Test Validation & Infrastructure `tests/llm/utils/mock_toolset.py`	Modifies toolset validation to permit MCP-type toolsets referencing non-builtin names by adding `definition.type != ToolsetType.MCP` condition before raising validation error. Reorganizes imports (`threading`, `urllib`, `Path`, `ToolsetType`, `load_mock_dal`).
Test Import Reorganizations `tests/llm/utils/test_case_utils.py`, `tests/llm/utils/test_env_vars.py`	Reorganizes imports in `test_case_utils.py` to explicitly include `yaml`, `pydantic` utilities, and new references (`Config`, `MODEL_LIST_FILE_LOCATION`, `DefaultLLM`, `ResourceInstructions`, `CLASSIFIER_MODEL`, `MODEL`). Removes `MODEL_LIST_FILE_LOCATION` environment variable definition from `test_env_vars.py`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

ROB-1288 toolsset mcp #419: Implements RemoteMCPToolset and RemoteMCPTool classes that work directly with the MCP toolset type introduced in this PR.
ROB-1238 top level dict mcp_servers for list of remote mcp servers. #453: Adds parallel MCP support implementation with ToolsetType.MCP and YAML mcp_servers block parsing.
mcp exception fix #1121: Fixes MCP tool serialization by excluding the toolset field, addressing serialization concerns for MCP-type toolsets.

Suggested reviewers

moshemorad
RoiGlinik
arikalon1

Pre-merge checks

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title mentions 'support running evals in containerized holmesgpt' but the actual changes focus on supporting MCP servers in YAML configuration and sharing model/toolset files for containerized deployments, not specifically evaluations.	Consider revising the title to reflect the actual focus: supporting containerized deployment with shared model/toolset configuration files, or clarify if 'evals' refers to a specific internal concept.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-12-29T13:00:51Z

✅ Docker image ready for 643e773 (built in 4m 7s)

us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:643e773

⚠️ Warning: does not support ARM (ARM images are built on release only - not on every PR)

Use this tag to pull the image for testing.

📋 Copy commands

⚠️ Temporary images are deleted after 30 days. Copy to a permanent registry before using them:

gcloud auth configure-docker us-central1-docker.pkg.dev
docker pull us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:643e773
docker tag us-central1-docker.pkg.dev/robusta-development/temporary-builds/holmes:643e773 me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:643e773
docker push me-west1-docker.pkg.dev/robusta-development/development/holmes-dev:643e773

Patch Helm values in one line (choose the chart you use):

HolmesGPT chart:

helm upgrade --install holmesgpt ./helm/holmes \
  --set registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set image=holmes-dev:643e773

Robusta wrapper chart:

helm upgrade --install robusta robusta/robusta \
  --reuse-values \
  --set holmes.registry=me-west1-docker.pkg.dev/robusta-development/development \
  --set holmes.image=holmes-dev:643e773

mainred · 2025-12-29T13:01:43Z

holmes/plugins/toolsets/__init__.py

 from holmes.plugins.toolsets.azure_sql.azure_sql_toolset import AzureSQLToolset
 from holmes.plugins.toolsets.bash.bash_toolset import BashExecutorToolset
 from holmes.plugins.toolsets.coralogix.toolset_coralogix import CoralogixToolset
+from holmes.plugins.toolsets.datadog.toolset_datadog_general import (


These package import sequence change can be ignored if we introduce #1252

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

holmes/plugins/toolsets/__init__.py (1)

149-155: Minor typos in comments.

The logic is correct—builtin toolsets are properly typed and their paths are cleared to avoid exposing internal paths.

🔎 Proposed fix for typos

     # disable built-in toolsets by default, and the user can enable them explicitly in config.
     for toolset in all_toolsets:
-        # It's safe to set type here as we don't have mcp build-in toolsets.
+        # It's safe to set type here as we don't have mcp built-in toolsets.
         toolset.type = ToolsetType.BUILTIN
-        # dont' expose build-in toolsets path
+        # don't expose built-in toolsets path
         toolset.path = None

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6ae36b and b281711.

📒 Files selected for processing (4)

holmes/plugins/toolsets/__init__.py
tests/llm/utils/mock_toolset.py
tests/llm/utils/test_case_utils.py
tests/llm/utils/test_env_vars.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Ruff for formatting and linting (configured in pyproject.toml)
Type hints required (mypy configuration in pyproject.toml)
ALWAYS place Python imports at the top of the file, not inside functions or methods

Files:

tests/llm/utils/test_case_utils.py
tests/llm/utils/mock_toolset.py
tests/llm/utils/test_env_vars.py
holmes/plugins/toolsets/__init__.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Tests: match source structure under tests/

Files:

tests/llm/utils/test_case_utils.py
tests/llm/utils/mock_toolset.py
tests/llm/utils/test_env_vars.py

holmes/plugins/toolsets/**

📄 CodeRabbit inference engine (CLAUDE.md)

Toolsets: organize as holmes/plugins/toolsets/{name}.yaml or {name}/ directories

Files:

holmes/plugins/toolsets/__init__.py

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-29T08:35:37.668Z
Learning: New toolsets require integration tests

📚 Learning: 2025-12-29T08:35:37.668Z

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-29T08:35:37.668Z
Learning: New toolsets require integration tests

Applied to files:

tests/llm/utils/mock_toolset.py

📚 Learning: 2025-12-29T08:35:37.668Z

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-29T08:35:37.668Z
Learning: Applies to holmes/plugins/toolsets/** : Toolsets: organize as `holmes/plugins/toolsets/{name}.yaml` or `{name}/` directories

Applied to files:

tests/llm/utils/mock_toolset.py
holmes/plugins/toolsets/__init__.py

📚 Learning: 2025-12-29T08:35:37.668Z

Learnt from: CR
Repo: HolmesGPT/holmesgpt PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-29T08:35:37.668Z
Learning: Applies to holmes/plugins/toolsets/**/*.yaml : All toolsets MUST return detailed error messages from underlying APIs to enable LLM self-correction, including exact query/command executed, time ranges and parameters used, and full API error response

Applied to files:

tests/llm/utils/mock_toolset.py

🧬 Code graph analysis (3)

tests/llm/utils/test_case_utils.py (6)

holmes/config.py (1)

Config (46-526)

holmes/core/llm.py (2)

DefaultLLM (138-527)

models (681-683)

holmes/core/models.py (2)

InvestigateRequest (94-107)

WorkloadHealthRequest (219-229)

holmes/core/prompt.py (1)

append_file_to_user_prompt (9-13)

holmes/core/resource_instruction.py (1)

ResourceInstructions (15-17)

tests/llm/utils/constants.py (1)

get_allowed_tags_list (47-49)

tests/llm/utils/mock_toolset.py (4)

holmes/core/tools.py (1)

ToolsetType (152-155)

holmes/plugins/toolsets/__init__.py (2)

load_builtin_toolsets (126-156)

load_toolsets_from_file (57-79)

tests/llm/utils/mock_dal.py (1)

load_mock_dal (225-254)

tests/llm/utils/test_case_utils.py (1)

HolmesTestCase (113-144)

holmes/plugins/toolsets/__init__.py (3)

holmes/core/tools.py (2)

Toolset (542-786)

ToolsetType (152-155)

holmes/plugins/toolsets/datadog/toolset_datadog_general.py (1)

DatadogGeneralToolset (198-291)

holmes/plugins/toolsets/investigator/core_investigation.py (1)

CoreInvestigationToolset (134-154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: build
GitHub Check: llm_evals
GitHub Check: build (3.12)
GitHub Check: build (3.11)
GitHub Check: build (3.10)

🔇 Additional comments (8)

tests/llm/utils/test_env_vars.py (1)

1-6: LGTM!

The removal of MODEL_LIST_FILE_LOCATION from this test environment variables file and relocating its import to test_case_utils.py (from holmes.core.llm) aligns with the PR objective of exposing/sharing this constant for containerized deployments.

tests/llm/utils/test_case_utils.py (3)

8-18: LGTM!

The import reorganization properly centralizes MODEL_LIST_FILE_LOCATION from holmes.core.llm, enabling shared model configuration for containerized deployments. The additional imports (Config, DefaultLLM, ResourceInstructions) support the new model list handling functions added later in the file.

37-84: LGTM!

The model list handling functions are well-structured:

_model_list_exists() provides clear warning when the env var is set but file doesn't exist

_get_models_from_model_list() properly falls back when model list is unavailable

_filter_models_from_env() validates that requested models exist with helpful error messaging

get_models() correctly requires CLASSIFIER_MODEL when multiple models are specified

549-553: LGTM!

The create_eval_llm function cleanly delegates to Config._get_llm when the model list file exists, enabling containerized deployments to use shared model configuration, while providing a simple fallback for standalone usage.

tests/llm/utils/mock_toolset.py (2)

7-28: LGTM!

The import additions (threading, urllib, Path, ToolsetType, load_mock_dal) support the MCP toolset handling and are properly organized at the top of the file per coding guidelines.

704-710: LGTM!

The validation logic correctly allows MCP toolsets to bypass the builtin name check. Since MCP toolsets are external servers (not references to builtin toolsets), they should be permitted regardless of whether their names match existing builtins.

holmes/plugins/toolsets/__init__.py (2)

20-36: LGTM!

The new imports for DatadogGeneralToolset, DatadogTracesToolset, GrafanaToolset, and CoreInvestigationToolset are correctly organized at the top of the file and align with the toolset ecosystem expansion.

68-78: Approve with minor note on potential name collision.

The MCP server loading logic correctly:

Reads the mcp_servers block from YAML

Assigns the ToolsetType.MCP type to each entry

Merges them into toolsets_dict for unified processing

Sets the source path for traceability

Note that if an MCP server name matches an entry in toolsets, the MCP server will silently overwrite it. This may be intentional (allowing MCP to take precedence), but if explicit warnings for collisions are desired, consider logging when name in toolsets_dict before assignment.

moshemorad · 2026-01-01T16:40:00Z

holmes/plugins/toolsets/__init__.py

 THIS_DIR = os.path.abspath(os.path.dirname(__file__))


 def load_toolsets_from_file(


There are 2 implementation that loading toolsets files.
There is one in the toolset_manager which loading also mcp and there is this that only used to load_builtin toolsets until now.

I think that we need either implement the toolset file loading 1 and reuse or keep it separate but do not introduce the mcp logic to the builtin loading function.

OK, I was trying to merge the toolset load for both places, but the one in toolset_manager requires built_toolsets as the context.
I can revert the change in load_toolsets_from_file since it's used specifically for built-int toolset.

mainred added 2 commits December 29, 2025 12:04

chore: set mcp type when loading toolset from file

21b62f1

share MODEL_LIST_FILE_LOCATION in both tests and production code

b281711

mainred commented Dec 29, 2025

View reviewed changes

mainred changed the title ~~Chore: support running evals in containerized holmesgpt~~ test: support running evals in containerized holmesgpt Dec 29, 2025

coderabbitai bot reviewed Dec 29, 2025

View reviewed changes

moshemorad requested changes Jan 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: support running evals in containerized holmesgpt #1261

test: support running evals in containerized holmesgpt #1261

mainred commented Dec 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 29, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 29, 2025 •

edited

Loading

Uh oh!

mainred Dec 29, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

moshemorad Jan 1, 2026

Uh oh!

mainred Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		THIS_DIR = os.path.abspath(os.path.dirname(__file__))


		def load_toolsets_from_file(

test: support running evals in containerized holmesgpt #1261

Are you sure you want to change the base?

test: support running evals in containerized holmesgpt #1261

Conversation

mainred commented Dec 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results of HolmesGPT evals

Results of HolmesGPT evals

Uh oh!

coderabbitai bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks

Uh oh!

github-actions bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mainred Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

moshemorad Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

mainred Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mainred commented Dec 29, 2025 •

edited by coderabbitai bot

Loading

github-actions bot commented Dec 29, 2025 •

edited

Loading

coderabbitai bot commented Dec 29, 2025 •

edited

Loading

github-actions bot commented Dec 29, 2025 •

edited

Loading