Skip to content

feat(skill-creator): leverage Monitor tool for streaming eval progress #955

@gsmethells

Description

@gsmethells

Problem

When /skill-creator runs its eval loop (run_loop.py), the entire loop blocks in a single Bash call. The user sees nothing for minutes while 5 iterations of 10+ eval queries run in parallel. This is especially painful for description optimization where each iteration takes 30-60 seconds.

Proposed Solution

Use the new Monitor tool to stream eval progress as live notifications. run_loop.py already prints per-iteration progress to stderr in verbose mode:

Iteration 2/5
Train: 8/10 correct, precision=90% recall=80% accuracy=85% (42.3s)
[PASS] rate=3/3 expected=True: write a feature file
[FAIL] rate=0/3 expected=True: generate step definitions
Best score: 9/10 (iteration 3)

The SKILL.md would use Monitor instead of Bash to run the loop:

Monitor(
  command: "python3 run_loop.py --verbose ... 2>&1 | grep -E --line-buffered 'Iteration|Train:|Test :|\\[PASS\\]|\\[FAIL\\]|Best score|Exit reason|Improving|Proposed'",
  description: "skill-creator eval loop progress"
)

Benefits

  1. User sees live progress instead of waiting for the full batch
  2. Agent context is freed during eval runs — can prepare improvement prompts or explain results as they stream in
  3. Failure is visible immediately (a crashloop or hung eval shows up as a notification, not as silence)

Implementation Notes

  • run_loop.py prints progress to stderr, so the Monitor command needs 2>&1 to merge streams
  • The grep filter should cover both success and failure signals per Monitor best practices
  • The final JSON output goes to stdout (not stderr), so Monitor won't capture it — the agent reads results.json after the Monitor exits
  • Consider adding a --progress flag to run_loop.py that prints machine-readable one-line summaries to stdout (instead of relying on stderr + grep)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions