feat(skill-creator): leverage Monitor tool for streaming eval progress

**Problem**

When /skill-creator runs its eval loop (`run_loop.py`), the entire loop blocks in a single Bash call. The user sees nothing for minutes while 5 iterations of 10+ eval queries run in parallel. This is especially painful for description optimization where each iteration takes 30-60 seconds.

**Proposed Solution**

Use the new `Monitor` tool to stream eval progress as live notifications. `run_loop.py` already prints per-iteration progress to stderr in verbose mode:

```
Iteration 2/5
Train: 8/10 correct, precision=90% recall=80% accuracy=85% (42.3s)
[PASS] rate=3/3 expected=True: write a feature file
[FAIL] rate=0/3 expected=True: generate step definitions
Best score: 9/10 (iteration 3)
```

The SKILL.md would use Monitor instead of Bash to run the loop:

```
Monitor(
  command: "python3 run_loop.py --verbose ... 2>&1 | grep -E --line-buffered 'Iteration|Train:|Test :|\\[PASS\\]|\\[FAIL\\]|Best score|Exit reason|Improving|Proposed'",
  description: "skill-creator eval loop progress"
)
```

**Benefits**

1. User sees live progress instead of waiting for the full batch
2. Agent context is freed during eval runs — can prepare improvement prompts or explain results as they stream in
3. Failure is visible immediately (a crashloop or hung eval shows up as a notification, not as silence)

**Implementation Notes**

- `run_loop.py` prints progress to stderr, so the Monitor command needs `2>&1` to merge streams
- The grep filter should cover both success and failure signals per Monitor best practices
- The final JSON output goes to stdout (not stderr), so Monitor won't capture it — the agent reads `results.json` after the Monitor exits
- Consider adding a `--progress` flag to `run_loop.py` that prints machine-readable one-line summaries to stdout (instead of relying on stderr + grep)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skill-creator): leverage Monitor tool for streaming eval progress #955

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(skill-creator): leverage Monitor tool for streaming eval progress #955

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions