You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Release v0.2.1: fix silent skip in Bedrock provider validator
The validator in handle_run_evaluation only scanned the .py task file for
provider IDs, but pipeline evals keep model IDs in the sibling .json config
and only reference them via CONFIG[...] in the .py. As a result, validation
found no providers, silently skipped, and invalid Bedrock IDs flowed through
to Inspect — wasting samples on 0-token responses before failing.
Now scans both the .py and the .json, so invalid IDs short-circuit with
"Invalid model ID" before any sample runs.
Also expands docs/DEVELOPMENT.md with the actual MCP dev workflow: editable
install + Claude Code integration test as the primary loop, with pytest as
a narrow supplement for deterministic logic.
Copy file name to clipboardExpand all lines: docs/DEVELOPMENT.md
+54-9Lines changed: 54 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,14 @@ Technical reference for developers working on the LLM Evaluation Platform.
4
4
5
5
## Working on the MCP locally
6
6
7
-
If you're changing eval_mcp code (tools, server, viewer) and want to run your changes against your own IDE:
7
+
If you're changing eval_mcp code (tools, server, viewer) and want to run your changes against your own IDE.
8
+
9
+
### Dev loop at a glance
10
+
11
+
End-users install via `uvx --from llm-evaluation-system@latest eval-mcp` — do **not** iterate against that, it pulls the published package. For development, use these two layers:
12
+
13
+
1.**pytest** against narrow, deterministic logic (regex, parsing, validation) — milliseconds, catches a specific class of regression cheaply. Not a substitute for end-to-end coverage because the tools spawn real subprocesses, call Bedrock, and write user dirs — mocks give false green builds.
14
+
2.**Claude Code / Desktop** pointed at your local build — the primary integration test. This is the ground truth: you exercise the exact same path users will get when you publish. Plan for this to be the main way you verify changes.
8
15
9
16
### 1. Clone and install editable
10
17
@@ -17,24 +24,49 @@ uv pip install -e .
17
24
18
25
`-e` installs editable — edits in `eval_mcp/` take effect on next MCP restart without reinstalling.
19
26
20
-
### 2. Point your IDE at the local build
27
+
### 2. Pytest for narrow deterministic logic (optional supplement)
28
+
29
+
For pure functions (regex parsing, validation, config munging), a handler-level pytest is a cheap safety net. The official MCP servers repo (`github.com/modelcontextprotocol/servers`) imports handlers directly — no server boot, no transport mocking. Same pattern here:
30
+
31
+
```python
32
+
# tests/test_run_eval.py
33
+
from eval_mcp.tools.run_eval import _validate_providers
result =await _validate_providers(["bedrock/nonexistent-model"])
37
+
assert result["valid"] isFalse
38
+
assert"Invalid model ID"in result["failed_providers"][0]["error"]
39
+
```
40
+
41
+
Run with `.venv/bin/pytest tests/`. Useful for pinning down specific regressions after you hit them — not useful as a proof that a feature works end-to-end.
42
+
43
+
### 3. Point Claude Code at your local build (primary integration test)
21
44
22
-
Instead of the uvx snippet from the README, use the venv's direct binary so your IDE runs the code you're editing:
45
+
This is how you test what you're about to push. Use the venv's direct binary so your IDE runs the code you're editing:
46
+
47
+
**a. Edit `~/.claude.json`.** Find the `"eval"` MCP entry and replace its `command` / `args` so it points at your editable install:
Restart the IDE after every meaningful change (MCP tools are loaded at session start).
61
+
**b. Reload the MCP.** In Claude Code: `/mcp` → disconnect `eval` → reconnect. If tools don't refresh, fully restart the IDE (MCP tools are loaded at session start).
62
+
63
+
**c. Exercise the change.** Call the tool you just edited through Claude Code and verify the behavior you expect. This is the exact same code path users will hit after publish — no mocks, no shortcuts.
64
+
65
+
**d. After every edit** to `eval_mcp/*.py`, repeat step b (the editable install picks up source changes, but the running MCP process doesn't hot-reload — you have to reconnect to restart it).
66
+
67
+
**e. When you're done developing**, restore the original `uvx --from llm-evaluation-system@latest` block in `~/.claude.json` so your normal use is back on the published version.
36
68
37
-
### 3. Rebuild the viewer frontend
69
+
### 4. Rebuild the viewer frontend
38
70
39
71
The viewer is a pre-built Next.js export served by the Python viewer app. When you change `frontend/` source:
40
72
@@ -46,14 +78,20 @@ npm run build:viewer
46
78
47
79
This compiles the frontend and copies the static output into `eval_mcp/viewer_static/`. The viewer picks it up on next `eval-mcp view`.
48
80
49
-
### 4. Running evals against your changes
81
+
### 5. Running evals against your changes
50
82
51
83
```bash
52
84
.venv/bin/eval-mcp view # results viewer on :4001
53
85
.venv/bin/inspect eval<task.py># run an eval directly via Inspect AI CLI
54
86
```
55
87
56
-
### 5. Publishing a new version
88
+
### 6. Publishing a new version
89
+
90
+
Users pin to `@latest`, so every publish is immediately live for every user. Treat it accordingly:
91
+
92
+
- Every main-branch commit should be releasable. Land behind a CI gate (pytest + lint + type-check).
93
+
- Bump the SemVer appropriately: patch for bug fixes, minor for additive changes, major for breaking tool-signature changes.
94
+
- Update the release notes / changelog entry before tagging.
1. Implement the handler in `eval_mcp/tools/<name>.py` as an async function.
111
+
2. Register it in `eval_mcp/server.py` with a typed signature. The docstring is the tool description the LLM sees — keep it specific about expected ID formats, required prerequisites, and failure modes.
112
+
3. (Optional) Write a pytest case for any narrow deterministic logic (parsing, validation) so regressions get caught cheaply next time.
113
+
4. Point Claude Code at your local build (see section 3) and exercise the tool end-to-end through the IDE before publishing.
114
+
115
+
### 8. Running the MCP in Docker (optional)
71
116
72
117
The repo root `Dockerfile` builds a slim container that runs `eval-mcp serve` as an HTTP MCP (for self-hosting on EC2/ECS/AgentCore). Local dev rarely needs this — use the editable install above.
0 commit comments