Skip to content

Add eval-harness to LLMOps section#538

Open
hoainho wants to merge 1 commit into
tensorchord:mainfrom
nano-step:add-eval-harness
Open

Add eval-harness to LLMOps section#538
hoainho wants to merge 1 commit into
tensorchord:mainfrom
nano-step:add-eval-harness

Conversation

@hoainho
Copy link
Copy Markdown

@hoainho hoainho commented Jun 1, 2026

Adding eval-harness

Project: https://github.com/nano-step/eval-harness
License: MIT
Language: Bash (+ jq, python3 stdlib)
Released: v0.4.2 on 2026-05-30

What it does

Behavior-regression testing for LLM agents — detects when an agent's behavior drifts from a baseline, attributes the cause across 4 deterministic classes (SKILL_CHANGED / FIXTURE_STALE / MODEL_CHANGED / UNKNOWN_DRIFT), and emits a 6-field FAIL schema with transcript_span + env_delta. Ships a composite GitHub Action and a git pre-push hook.

Why this fits the LLMOps section

LLMOps testing/observability is a known gap — existing tools tell you THAT a test failed but not WHY. eval-harness fills the regression-detection + attribution slice. It composes well with broader entries on this list (LangSmith, Arize-Phoenix, Langfuse, Helicone, etc.) rather than replacing them; honest comparison vs promptfoo: docs/why-not-promptfoo.md.

Distinctive features

  • 4-class failure attribution (deterministic SHA-comparison decision tree)
  • 6-field FAIL schema including transcript_span + env_delta
  • 3-sample byte-identical stability check — first-class flake tagging instead of retry-until-pass
  • Hard $-cost ceiling with daily budget enforcement (default EVAL_BUDGET_USD=2.00)
  • Per-(case,trigger) flock lockfile for safe concurrent CI runs

Project hygiene

  • v0.4.2 closes 8 audit-surfaced BLOCKERs (sandboxed score_shell, fixture path-traversal blocking, GNU/BSD grep portability, etc.)
  • 20/20 test suites green on main
  • CONTRIBUTING.md, CODE_OF_CONDUCT.md, SECURITY.md present
  • DCO sign-off on commit
  • Open good first issue + help wanted labels for contributors

Entry placement

Inserted between Deepchecks and Evidently (alphabetical, case-insensitive).

Thanks for maintaining this list.

Signed-off-by: Hoài Nhớ <nhoxtvt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant