Skip to content

Commit d251274

Browse files
committed
Release v0.2.1: fix silent skip in Bedrock provider validator
The validator in handle_run_evaluation only scanned the .py task file for provider IDs, but pipeline evals keep model IDs in the sibling .json config and only reference them via CONFIG[...] in the .py. As a result, validation found no providers, silently skipped, and invalid Bedrock IDs flowed through to Inspect — wasting samples on 0-token responses before failing. Now scans both the .py and the .json, so invalid IDs short-circuit with "Invalid model ID" before any sample runs. Also expands docs/DEVELOPMENT.md with the actual MCP dev workflow: editable install + Claude Code integration test as the primary loop, with pytest as a narrow supplement for deterministic logic.
1 parent f81d625 commit d251274

4 files changed

Lines changed: 68 additions & 21 deletions

File tree

docs/DEVELOPMENT.md

Lines changed: 54 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,14 @@ Technical reference for developers working on the LLM Evaluation Platform.
44

55
## Working on the MCP locally
66

7-
If you're changing eval_mcp code (tools, server, viewer) and want to run your changes against your own IDE:
7+
If you're changing eval_mcp code (tools, server, viewer) and want to run your changes against your own IDE.
8+
9+
### Dev loop at a glance
10+
11+
End-users install via `uvx --from llm-evaluation-system@latest eval-mcp` — do **not** iterate against that, it pulls the published package. For development, use these two layers:
12+
13+
1. **pytest** against narrow, deterministic logic (regex, parsing, validation) — milliseconds, catches a specific class of regression cheaply. Not a substitute for end-to-end coverage because the tools spawn real subprocesses, call Bedrock, and write user dirs — mocks give false green builds.
14+
2. **Claude Code / Desktop** pointed at your local build — the primary integration test. This is the ground truth: you exercise the exact same path users will get when you publish. Plan for this to be the main way you verify changes.
815

916
### 1. Clone and install editable
1017

@@ -17,24 +24,49 @@ uv pip install -e .
1724

1825
`-e` installs editable — edits in `eval_mcp/` take effect on next MCP restart without reinstalling.
1926

20-
### 2. Point your IDE at the local build
27+
### 2. Pytest for narrow deterministic logic (optional supplement)
28+
29+
For pure functions (regex parsing, validation, config munging), a handler-level pytest is a cheap safety net. The official MCP servers repo (`github.com/modelcontextprotocol/servers`) imports handlers directly — no server boot, no transport mocking. Same pattern here:
30+
31+
```python
32+
# tests/test_run_eval.py
33+
from eval_mcp.tools.run_eval import _validate_providers
34+
35+
async def test_invalid_bedrock_id_fails_validation():
36+
result = await _validate_providers(["bedrock/nonexistent-model"])
37+
assert result["valid"] is False
38+
assert "Invalid model ID" in result["failed_providers"][0]["error"]
39+
```
40+
41+
Run with `.venv/bin/pytest tests/`. Useful for pinning down specific regressions after you hit them — not useful as a proof that a feature works end-to-end.
42+
43+
### 3. Point Claude Code at your local build (primary integration test)
2144

22-
Instead of the uvx snippet from the README, use the venv's direct binary so your IDE runs the code you're editing:
45+
This is how you test what you're about to push. Use the venv's direct binary so your IDE runs the code you're editing:
46+
47+
**a. Edit `~/.claude.json`.** Find the `"eval"` MCP entry and replace its `command` / `args` so it points at your editable install:
2348

2449
```json
2550
{
2651
"mcpServers": {
2752
"eval": {
53+
"type": "stdio",
2854
"command": "/absolute/path/to/llm-evaluation-system/.venv/bin/eval-mcp",
29-
"timeout": 120000
55+
"env": {}
3056
}
3157
}
3258
}
3359
```
3460

35-
Restart the IDE after every meaningful change (MCP tools are loaded at session start).
61+
**b. Reload the MCP.** In Claude Code: `/mcp` → disconnect `eval` → reconnect. If tools don't refresh, fully restart the IDE (MCP tools are loaded at session start).
62+
63+
**c. Exercise the change.** Call the tool you just edited through Claude Code and verify the behavior you expect. This is the exact same code path users will hit after publish — no mocks, no shortcuts.
64+
65+
**d. After every edit** to `eval_mcp/*.py`, repeat step b (the editable install picks up source changes, but the running MCP process doesn't hot-reload — you have to reconnect to restart it).
66+
67+
**e. When you're done developing**, restore the original `uvx --from llm-evaluation-system@latest` block in `~/.claude.json` so your normal use is back on the published version.
3668

37-
### 3. Rebuild the viewer frontend
69+
### 4. Rebuild the viewer frontend
3870

3971
The viewer is a pre-built Next.js export served by the Python viewer app. When you change `frontend/` source:
4072

@@ -46,14 +78,20 @@ npm run build:viewer
4678

4779
This compiles the frontend and copies the static output into `eval_mcp/viewer_static/`. The viewer picks it up on next `eval-mcp view`.
4880

49-
### 4. Running evals against your changes
81+
### 5. Running evals against your changes
5082

5183
```bash
5284
.venv/bin/eval-mcp view # results viewer on :4001
5385
.venv/bin/inspect eval <task.py> # run an eval directly via Inspect AI CLI
5486
```
5587

56-
### 5. Publishing a new version
88+
### 6. Publishing a new version
89+
90+
Users pin to `@latest`, so every publish is immediately live for every user. Treat it accordingly:
91+
92+
- Every main-branch commit should be releasable. Land behind a CI gate (pytest + lint + type-check).
93+
- Bump the SemVer appropriately: patch for bug fixes, minor for additive changes, major for breaking tool-signature changes.
94+
- Update the release notes / changelog entry before tagging.
5795

5896
```bash
5997
# Bump version in pyproject.toml, then
@@ -67,7 +105,14 @@ Verify from a clean venv:
67105
uvx --refresh --from 'llm-evaluation-system==<new-version>' eval-mcp --help
68106
```
69107

70-
### 6. Running the MCP in Docker (optional)
108+
### 7. Adding a new tool (checklist)
109+
110+
1. Implement the handler in `eval_mcp/tools/<name>.py` as an async function.
111+
2. Register it in `eval_mcp/server.py` with a typed signature. The docstring is the tool description the LLM sees — keep it specific about expected ID formats, required prerequisites, and failure modes.
112+
3. (Optional) Write a pytest case for any narrow deterministic logic (parsing, validation) so regressions get caught cheaply next time.
113+
4. Point Claude Code at your local build (see section 3) and exercise the tool end-to-end through the IDE before publishing.
114+
115+
### 8. Running the MCP in Docker (optional)
71116

72117
The repo root `Dockerfile` builds a slim container that runs `eval-mcp serve` as an HTTP MCP (for self-hosting on EC2/ECS/AgentCore). Local dev rarely needs this — use the editable install above.
73118

eval_mcp/tools/run_eval.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -260,17 +260,19 @@ async def handle_run_evaluation(args: Dict[str, Any]) -> List[TextContent]:
260260
)
261261
]
262262

263-
# Read the task file to extract provider list for validation
263+
# Read the task file (and sibling JSON config, if present) to extract
264+
# provider list for validation. Pipeline evals keep model IDs in the
265+
# JSON config and only reference them via CONFIG[...] in the .py — so
266+
# scanning just the .py misses them and validation silently skips.
264267
try:
265-
task_content = Path(task_file).read_text()
266-
# Extract providers from the task file for validation
267-
providers = []
268-
for line in task_content.split("\n"):
269-
if "bedrock/" in line and '"' in line:
270-
# Extract model IDs from lines like: "bedrock/us.anthropic.claude-..."
271-
import re as _re
272-
matches = _re.findall(r'"(bedrock/[^"]+)"', line)
273-
providers.extend(matches)
268+
sources = [Path(task_file).read_text()]
269+
json_config = Path(task_file).with_suffix(".json")
270+
if json_config.exists():
271+
sources.append(json_config.read_text())
272+
273+
import re as _re
274+
provider_pattern = _re.compile(r'"(bedrock/[^"]+)"')
275+
providers = list({m for src in sources for m in provider_pattern.findall(src)})
274276

275277
if providers:
276278
validation = await _validate_providers(providers)

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "llm-evaluation-system"
3-
version = "0.2.0"
3+
version = "0.2.1"
44
description = "MCP server for agentic LLM evaluation: jury scoring, agent tracing via OpenTelemetry, document-grounded QA generation, PDF reports."
55
readme = "README.md"
66
requires-python = ">=3.12,<3.15"

uv.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)