Skip to content

Releases: razroo/iso

@razroo/iso-eval v0.1.0

19 Apr 05:36

Choose a tag to compare

Initial release of @razroo/iso-eval — behavioral eval runner for AI coding agents.

agentmd lints prompt structure, isolint lints prompt prose, iso-harness fans out the compiled source into every harness file layout. None of them answer did the agent actually do the task? — that's what iso-eval scores.

You give it a suite of tasks (baseline workspace + prompt + checks); it snapshots the workspace per trial, hands it to a runner, and verifies the resulting filesystem / command state against your checks.

v0.1 scope

  • Deterministic `fake` runner (executes `$ …` lines as shell in the snapshotted workspace) — exercises the orchestration layer offline and in CI
  • Checks: `command`, `file_exists`, `file_contains`, `file_not_contains`, `file_matches`, `llm_judge`
  • Real-agent runners (`claude-code`, `codex`, `cursor-agent`) coming in v0.2; the library API already accepts any `RunnerFn` today

Install

```bash
npm install -D @razroo/iso-eval
iso-eval run eval.yml
```

See the package README for the full suite shape and library API.

iso-v0.1.1

18 Apr 16:29

Choose a tag to compare

iso-harness-v0.3.0

18 Apr 16:29

Choose a tag to compare

iso-harness v0.2.0

18 Apr 15:38

Choose a tag to compare

What's new

iso-harness validate

Schema-check the iso/ source directory without writing anything:

iso-harness validate --source iso/
iso-harness validate --source iso/ --format json

Catches: missing command on an MCP server, non-string env values, duplicate agent names, unknown target-override keys (typos like cursor: skip written as Cursor: skip), non-string model fields, and empty descriptions/bodies.

build now gates on validation

iso-harness build runs the validator first and refuses to write output if the source has schema errors. Warnings (empty description, empty body, unknown-harness overrides) are surfaced in the build summary but do not block.

This is a behavior change: if your iso/ source had a latent schema bug, previous versions would silently generate wrong output across all four harnesses. 0.2.0 fails fast instead.

Test coverage

18 unit tests (was: 1 smoke-build script). Covers validation, build, source loading, frontmatter skip rules, TOML escaping, and the refuse-to-emit-on-error guarantee.

Full changelog: packages/iso-harness/CHANGELOG.md

agentmd v0.3.0

18 Apr 15:37

Choose a tag to compare

What's new

lint --format sarif

agentmd lint can now emit SARIF 2.1.0 for upload to GitHub code scanning. Same dialect as isolint --format sarif — driver name agentmd, rule IDs are the L-codes, severities map to error / warning / note.

agentmd lint prompts/*.md --format sarif > agentmd.sarif
# then upload with github/codeql-action/upload-sarif@v3

Full changelog: packages/agentmd/CHANGELOG.md

agentmd v0.2.0

18 Apr 15:21

Choose a tag to compare

Highlights

  • lint accepts multiple files, shell globs, and stdin (-).
  • Machine-readable lint output: --format json (for CI tools) and --format github (workflow annotations).
  • render - reads source from stdin for pipeline composition.
  • --flag=value works anywhere alongside --flag value.
  • test --timeout <ms> kills hung backends instead of stalling.
  • New: agentmd --version.

Parser

  • Strip UTF-8 BOM and normalize CRLF / bare CR so Windows and editor-exported files parse correctly (previously the # Agent: heading would silently fail).
  • Flag duplicate # Agent: headings as warning L12 instead of silently dropping them.

Linter

  • L9 split into L9a (agent heading), L9b (procedure), L9c (at least one rule) — each check is now individually filterable.
  • L3 duplicate-ID diagnostics now carry a line number and point at the first definition.

Tests

  • 85 tests (was 69). New subprocess CLI coverage for stdin, multi-file, globs, JSON/github formats, --flag=value.

agentmd v0.1.0

18 Apr 14:54

Choose a tag to compare

First release of @razroo/agentmd — a structured-markdown dialect for authoring agent prompts, with a linter for structure and a fixture-driven harness that measures per-rule adherence against a target model.