Skip to content

Commit a310acd

Browse files
committed
feat(perf): infra perf-benchmark substrate — journeys, integrity contracts, percentile ratchet (0.90.0)
New domain-agnostic /perf subpath for infra performance benchmarking, complementing the judge-panel BenchmarkRunner (quality) with latency / reliability scoring over flat metric records: - JourneySpec + expandMatrix + scenarioKey: journeys × free-form axes cartesian matrix with sorted-dim stable keys and a combo filter. - checkRecordIntegrity + assertRecordIntegrity: a pass=true record must carry its journey's requiredFields / minimums / phaseFields; failed records are exempt. - summarizeRecords + gatePerf: nearest-rank p50/p90 PerfStat baselines and a tolerance ratchet with improvements, missing/new scenario detection, and a minSamples floor; null metrics never become fake zeros. Exported from the root barrel and the new ./perf subpath (tsup entry + package.json exports). Version 0.90.0 across npm + PyPI; CHANGELOG entry added. 25 vitest cases, each mutation-verified (7/7 mutants killed).
1 parent 250c1ec commit a310acd

11 files changed

Lines changed: 667 additions & 4 deletions

File tree

CHANGELOG.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,15 @@ All notable changes to `@tangle-network/agent-eval` and its sibling `agent-eval-
44

55
---
66

7-
## [0.86.0] — 2026-06-09 — fleet-rebuilt eval primitives
7+
## [0.90.0] — 2026-06-10 — infra perf-benchmark substrate (`/perf`)
8+
9+
Domain-agnostic infra-performance benchmarking: a journeys × axes scenario matrix, record-integrity contracts over flat metric records, and a percentile ratchet. Complements the judge-panel `BenchmarkRunner` (root) — that one scores QUALITY via judges; `/perf` scores LATENCY / RELIABILITY. All additive — no existing export changed.
10+
11+
### Added
12+
13+
- **`JourneySpec` + `expandMatrix` + `scenarioKey` (`/perf` + root).** A journey is one measurable user path (`provision.cold`, `chat.ttft`) carrying its own data contract: `requiredFields` (must be non-null on a passing record), `minimums` (numeric floors, e.g. `event_count ≥ 1` for streaming), `phaseFields` (per-phase breakdown, reported separately), and `requiresLLM` (nightly vs per-PR scheduling). `expandMatrix` does the cartesian expansion over free-form `ScenarioAxes` (driver × region × …) with a `filter` for invalid combos; scenario keys are `journeyId|dim=value|…` with dims sorted, so the key is stable across axes-object insertion order.
14+
- **`checkRecordIntegrity` + `assertRecordIntegrity` (`/perf` + root).** A record claiming `pass === true` must actually carry its journey's required measurements — a "passing" run with a null `total_ms` is an integrity violation (`null-required-field` / `below-minimum`), not a pass. Failed records are exempt (an errored run legitimately has nulls); `resolveJourney` returning null skips the record. The assert variant throws listing every violation.
15+
- **`summarizeRecords` + `gatePerf` (`/perf` + root).** Percentile ratchet: fold flat records into per-scenario `PerfStat` (`p50` / `p90` / `n`, nearest-rank on sorted values), then gate a current `PerfBaseline` against a committed one. Null / non-numeric metric values are excluded from `n` and a zero-sample field is omitted — no fake zeros. Regressions trip when p50 OR p90 exceed `tolerancePct` (default 10) over baseline; strict improvements are reported with negative `overBy`; scenarios under `minSamples` (default 3) in current are surfaced in `missingScenarios` and never gated; baseline/current key drift lands in `missingScenarios` / `newScenarios`.
816

917
One clean, canonical version of five generic patterns the fleet kept hand-rolling across 2–4 product agents each. All additive — no existing export changed.
1018

clients/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "agent-eval-rpc"
7-
version = "0.89.0"
7+
version = "0.90.0"
88
description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
99
readme = "README.md"
1010
requires-python = ">=3.10"

clients/python/src/agent_eval_rpc/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
try:
5959
__version__ = version("agent-eval-rpc")
6060
except PackageNotFoundError:
61-
__version__ = "0.89.0"
61+
__version__ = "0.90.0"
6262

6363
__all__ = [
6464
"Client",

package.json

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@tangle-network/agent-eval",
3-
"version": "0.89.0",
3+
"version": "0.90.0",
44
"description": "Evaluate and improve AI agents from runs, traces, judges, and feedback. Compare candidates, cluster failures, measure lift, and gate releases.",
55
"homepage": "https://github.com/tangle-network/agent-eval#readme",
66
"repository": {
@@ -109,6 +109,11 @@
109109
"import": "./dist/matrix/index.js",
110110
"default": "./dist/matrix/index.js"
111111
},
112+
"./perf": {
113+
"types": "./dist/perf/index.d.ts",
114+
"import": "./dist/perf/index.js",
115+
"default": "./dist/perf/index.js"
116+
},
112117
"./multishot": {
113118
"types": "./dist/multishot/index.d.ts",
114119
"import": "./dist/multishot/index.js",

src/index.ts

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1333,6 +1333,30 @@ export type {
13331333
AttestedReport,
13341334
} from './attestation'
13351335
export { ATTESTATION_ALGORITHM, attest, verifyAttestation } from './attestation'
1336+
// ── Perf — infra-performance benchmarking substrate ──────────────────
1337+
// Journeys × axes scenario matrix, record-integrity contracts, and the
1338+
// percentile ratchet (summarize → baseline → gate). Scores LATENCY /
1339+
// RELIABILITY over flat metric records; the judge-panel BenchmarkRunner
1340+
// (./benchmark) scores QUALITY. Also on the `/perf` subpath.
1341+
export type {
1342+
IntegrityResult,
1343+
IntegrityViolation,
1344+
JourneySpec,
1345+
PerfBaseline,
1346+
PerfGateResult,
1347+
PerfRegression,
1348+
PerfScenario,
1349+
PerfStat,
1350+
ScenarioAxes,
1351+
} from './perf'
1352+
export {
1353+
assertRecordIntegrity,
1354+
checkRecordIntegrity,
1355+
expandMatrix,
1356+
gatePerf,
1357+
scenarioKey,
1358+
summarizeRecords,
1359+
} from './perf'
13361360
// ── Anytime-valid sequential testing (e-process core) ────────────────
13371361
// The betting test-martingale behind the sequential gates. Gate-level
13381362
// machinery (sequentialPairedGate, sequentialDecide) lives on the /campaign

src/perf/index.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
/**
2+
* @tangle-network/agent-eval/perf
3+
*
4+
* Domain-agnostic infra-performance benchmarking substrate: a journeys ×
5+
* axes scenario matrix, record-integrity contracts over flat metric
6+
* records, and a percentile ratchet (summarize → baseline → gate).
7+
*
8+
* Complements the judge-panel `BenchmarkRunner` (root): that one scores
9+
* QUALITY; this one scores LATENCY / RELIABILITY over flat metric records.
10+
*/
11+
12+
export type { IntegrityResult, IntegrityViolation } from './integrity'
13+
export { assertRecordIntegrity, checkRecordIntegrity } from './integrity'
14+
export type { JourneySpec, PerfScenario, ScenarioAxes } from './journey'
15+
export { expandMatrix, scenarioKey } from './journey'
16+
export type { PerfBaseline, PerfGateResult, PerfRegression, PerfStat } from './ratchet'
17+
export { gatePerf, summarizeRecords } from './ratchet'

src/perf/integrity.ts

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
/**
2+
* Record-integrity contracts for perf metric records.
3+
*
4+
* A record that claims `pass === true` must actually carry the journey's
5+
* required measurements — a "passing" provision run with a null
6+
* `total_ms` is a lying record, not a pass. Failed records are exempt:
7+
* a run that errored mid-flight legitimately has nulls.
8+
*/
9+
10+
import type { JourneySpec } from './journey'
11+
12+
export interface IntegrityViolation {
13+
recordIndex: number
14+
journeyId: string
15+
field: string
16+
reason: 'null-required-field' | 'below-minimum'
17+
detail: string
18+
}
19+
20+
export interface IntegrityResult {
21+
succeeded: boolean
22+
violations: IntegrityViolation[]
23+
}
24+
25+
function isMissing(value: unknown): boolean {
26+
return value === null || value === undefined
27+
}
28+
29+
/**
30+
* Validates flat metric records (Record<string, unknown> with a boolean
31+
* `pass` field) against their journey contract. Only records with
32+
* pass === true are checked — a failed record may legitimately have nulls.
33+
* resolveJourney maps a record to its JourneySpec (or null to skip).
34+
*/
35+
export function checkRecordIntegrity(
36+
records: ReadonlyArray<Record<string, unknown>>,
37+
resolveJourney: (record: Record<string, unknown>) => JourneySpec | null,
38+
): IntegrityResult {
39+
const violations: IntegrityViolation[] = []
40+
for (const [recordIndex, record] of records.entries()) {
41+
if (record.pass !== true) continue
42+
const journey = resolveJourney(record)
43+
if (journey === null) continue
44+
for (const field of journey.requiredFields) {
45+
if (isMissing(record[field])) {
46+
violations.push({
47+
recordIndex,
48+
journeyId: journey.id,
49+
field,
50+
reason: 'null-required-field',
51+
detail: `required field '${field}' is ${record[field] === null ? 'null' : 'undefined'} on a passing '${journey.id}' record`,
52+
})
53+
}
54+
}
55+
for (const field of journey.phaseFields ?? []) {
56+
if (isMissing(record[field])) {
57+
violations.push({
58+
recordIndex,
59+
journeyId: journey.id,
60+
field,
61+
reason: 'null-required-field',
62+
detail: `phase field '${field}' is ${record[field] === null ? 'null' : 'undefined'} on a passing '${journey.id}' record`,
63+
})
64+
}
65+
}
66+
for (const { field, min } of journey.minimums ?? []) {
67+
const value = record[field]
68+
if (isMissing(value)) continue // null-ness is the required/phase fields' contract
69+
if (typeof value !== 'number' || Number.isNaN(value)) {
70+
violations.push({
71+
recordIndex,
72+
journeyId: journey.id,
73+
field,
74+
reason: 'below-minimum',
75+
detail: `field '${field}' has non-numeric value ${JSON.stringify(value)} on a passing '${journey.id}' record (minimum ${min})`,
76+
})
77+
continue
78+
}
79+
if (value < min) {
80+
violations.push({
81+
recordIndex,
82+
journeyId: journey.id,
83+
field,
84+
reason: 'below-minimum',
85+
detail: `field '${field}' is ${value}, below minimum ${min} on a passing '${journey.id}' record`,
86+
})
87+
}
88+
}
89+
}
90+
return { succeeded: violations.length === 0, violations }
91+
}
92+
93+
/** Throws an Error listing every violation when the result fails. */
94+
export function assertRecordIntegrity(
95+
records: ReadonlyArray<Record<string, unknown>>,
96+
resolveJourney: (record: Record<string, unknown>) => JourneySpec | null,
97+
): void {
98+
const result = checkRecordIntegrity(records, resolveJourney)
99+
if (result.succeeded) return
100+
const lines = result.violations.map(
101+
(v) => ` [record ${v.recordIndex}] ${v.journeyId}.${v.field} (${v.reason}): ${v.detail}`,
102+
)
103+
throw new Error(
104+
`Record integrity check failed with ${result.violations.length} violation(s):\n${lines.join('\n')}`,
105+
)
106+
}

src/perf/journey.ts

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
/**
2+
* Journey × axes matrix for infra performance benchmarks.
3+
*
4+
* A journey is one measurable user path ("provision.cold", "chat.ttft");
5+
* axes are free-form scenario dimensions (driver, region, image…). The
6+
* matrix expansion is pure bookkeeping — running the scenarios and
7+
* recording metrics is the caller's job. This module complements the
8+
* judge-panel `BenchmarkRunner` (src/benchmark.ts): that one scores
9+
* QUALITY via judges, this one structures LATENCY / RELIABILITY runs
10+
* over flat metric records.
11+
*/
12+
13+
/** One measurable user journey (e.g. "provision.cold", "chat.ttft"). */
14+
export interface JourneySpec {
15+
id: string
16+
description: string
17+
/** Needs a real LLM call — schedule nightly, not per-PR. */
18+
requiresLLM: boolean
19+
/**
20+
* Fields that MUST be non-null on a passing record of this journey.
21+
* A "passing" record missing one is an integrity violation, not a pass.
22+
*/
23+
requiredFields: ReadonlyArray<string>
24+
/** Numeric floors, e.g. {field: 'event_count', min: 1} for streaming. */
25+
minimums?: ReadonlyArray<{ field: string; min: number }>
26+
/** Per-phase breakdown fields expected non-null (subset of requiredFields semantics, reported separately). */
27+
phaseFields?: ReadonlyArray<string>
28+
}
29+
30+
export interface ScenarioAxes {
31+
/** e.g. driver: ['docker','firecracker'] — every key is a free-form dimension. */
32+
[dimension: string]: ReadonlyArray<string>
33+
}
34+
35+
export interface PerfScenario {
36+
/** `${journeyId}|${dim1}=${v1}|${dim2}=${v2}` (dims sorted). */
37+
key: string
38+
journey: JourneySpec
39+
axes: Record<string, string>
40+
}
41+
42+
/** Stable scenario key: journey id then `dim=value` pairs in sorted-dim order. */
43+
export function scenarioKey(journeyId: string, axes: Record<string, string>): string {
44+
const parts = Object.keys(axes)
45+
.sort()
46+
.map((dim) => `${dim}=${axes[dim]}`)
47+
return [journeyId, ...parts].join('|')
48+
}
49+
50+
/** Cartesian expansion; `filter` lets callers drop invalid combos (e.g. firecracker×resume). */
51+
export function expandMatrix(
52+
journeys: ReadonlyArray<JourneySpec>,
53+
axes: ScenarioAxes,
54+
filter?: (journeyId: string, combo: Record<string, string>) => boolean,
55+
): PerfScenario[] {
56+
const dims = Object.keys(axes).sort()
57+
let combos: Record<string, string>[] = [{}]
58+
for (const dim of dims) {
59+
const values = axes[dim] as ReadonlyArray<string>
60+
const next: Record<string, string>[] = []
61+
for (const combo of combos) {
62+
for (const value of values) {
63+
next.push({ ...combo, [dim]: value })
64+
}
65+
}
66+
combos = next
67+
}
68+
const scenarios: PerfScenario[] = []
69+
for (const journey of journeys) {
70+
for (const combo of combos) {
71+
if (filter && !filter(journey.id, combo)) continue
72+
scenarios.push({ key: scenarioKey(journey.id, combo), journey, axes: combo })
73+
}
74+
}
75+
return scenarios
76+
}

0 commit comments

Comments
 (0)