This directory contains the experimental Yanex agent-interface benchmark harness. It compares how different agent-facing surfaces affect correctness, token cost, tool calls, and bytes read over the same fixed Yanex experiment corpus.
This is not a storage-engine throughput benchmark. The product benchmark for
metadata, training-read, and checkpoint workloads lives in the root bench/
crate and is documented in docs/benchmarks.md.
The NoKV product crates expose the low-level native namespace surface used by the benchmark:
nokv-meta:stat_card,list_page,find_paths,grep_paths, andread_page, plus native indexedaggregate_paths;nokv-protocol:StatCard,ListPage,FindPaths,AggregatePaths,GrepPaths, andReadPageRPC DTOs;nokv-client: SDK methods plus the product-native agent adapter exposingls,stat,catalog,read,find,aggregate, andgrep;nokv-server: framed metadata RPC handlers for the same operations.
Native grep is now implemented through nokv-meta, nokv-protocol, and
nokv-server as a product-native file-content scan. The current five-task
Phase 1 registry exposes ls, stat, catalog, read, aggregate, find,
and grep. Native grep matches a case-insensitive literal substring and
returns matching lines with line numbers, so log-extraction tasks resolve
body facts without full file reads.
The benchmark arm named nokv_native_v1 uses the nokv-client agent adapter
with the Phase 1 tool profile above. The harness translates OpenAI tool calls
into product API calls, but does not own the measured card, find, index catalog,
aggregation, pagination, or consistency semantics.
The default Phase 1 API surface is openai_agents_responses_schema_once. The
Rust harness still owns batch planning, local judging, telemetry JSONL, and
NoKV/SQLite tool execution. A Python runner uses openai-agents>=0.7.0,<0.8.0
for the model/tool loop and a custom Responses model wrapper. Continuation
requests keep the same tool schemas available so the model can make additional
tool calls after observing tool outputs. Tool calls are
posted back to a per-run loopback bridge in the Rust harness.
The benchmark uses a fixed Yanex experiment tracking corpus. The materialized state includes:
- experiment metadata;
- params;
- metrics;
- dependencies;
- artifacts;
- git state;
- generated read-only index files.
Default local data root:
bench/data/yanex-demo
Expected local layout after preparation:
bench/data/yanex-demo/
corpus/
sqlite/yanex.db
nokv/meta
rustfs/
manifest.json
results/
The local corpus archive path is intentionally not committed. Pass it with
--archive when preparing data.
The Phase 1 harness compares two read-only arms:
| Arm | Surface |
|---|---|
sqlite_raw_v1 |
Raw SQLite schema/query/blob tools (including line-oriented grep_blob) plus ETL-maintained agent-index materialization tables. |
nokv_native_v1 |
NoKV product-native agent adapter. |
The fixed Phase 1 tasks live in tasks/phase1_readonly.yaml. The rubric
lives in rubric/phase1_readonly.yaml.
The active read-only workload is a five-task researcher-shaped set:
| Task | Shape |
|---|---|
train_top_configs_report |
Report the 5 best completed train.py runs by minimum finite val_loss with learning rate, batch size, stdout size, and git dirty state. |
eval_fidelity_leaderboard |
Report the 5 completed eval.py runs with the highest latest fidelity plus related utility, detection, and privacy metrics and stderr size. |
tabdiff_ddxplus_dcr_checkpoint_provenance |
For every ddxplus_dcr TabDiff sampling run, extract the loaded checkpoint file and model parameter count from the sampler stdout log. |
best_detection_eval_method_audit |
Find the completed eval.py run with the highest latest detection_roc_auc and extract the detection method name from its eval log. |
cancelled_train_interrupt_triage |
For every non-completed run, report stderr size and the stderr line number of the last KeyboardInterrupt occurrence. |
The two structured report tasks are judged against gold SQL. The three log-extraction tasks are judged against harness-side file-body oracles; oracles are judge-side data and are never exposed to either arm.
The benchmark has one core A/B comparison:
- Raw SQLite tools vs NoKV Native Namespace.
Precomputed indexes are a normal experiment-tracking system behavior and are part of the intended benchmark model. They make the benchmark closer to a real agent workload, where agents inspect catalogs and facets instead of scanning all raw blobs for every question.
The fairness rule is that every valid A/B comparison must expose logically equal introspection over the benchmarked facts. The syntax and access pattern can differ by surface, but the visible catalog fields, index facts, and limitations must not give one arm hidden task answers that the paired arm cannot discover through its own public interface.
In this benchmark, "NoKV native" must mean that the agent-facing tools map to NoKV product APIs. The harness may adapt OpenAI tool calls into SDK calls and enforce limits, but it must not own the measured metadata semantics.
Target native behavior:
ls/statreturn typed directory/file cards, not flat file entries.entry_count,record_count,schema,sample, compact body descriptors, catalog fields, and indexed values are available throughstat.statcards avoid evidence, snapshot, generation, and storage-internal body fields in the agent-visible payload.readdefaults to structured pagination; raw bytes require explicitformat = "bytes".find(path, predicates, sort, fields, facets, cursor, limit)searches paths and projects indexed field values; body/schema/sample inspection usesstatorread.finduses a constrained predicate grammar declared by stat/catalog cards.catalog(path, field_prefix, include_facets)provides compact field discovery without requiring a full stat card; facets are opt-in.aggregate(path, predicates, group_by, measures, sort, limit)provides compact summaries over indexed namespace facts.- paginated results include truncation state and
next_cursorwhen more results exist.
Generated /index/*.json files must not become hidden answer files. They can
support facet-count tasks, but they are not a substitute for product-level
typed index namespaces or future derived metric/set-pipeline APIs.
Start RustFS:
./bench/agent-interface/scripts/start_rustfs.shAlways prepare from a clean slate with --reset before measuring; verify
checks that the agent-visible NoKV catalog matches the registered index
fields and fails on stale fields from older encoders.
Prepare the fixed corpus:
cargo run -p nokv-bench --bin yanex-agent-bench -- prepare \
--archive /path/to/yanex-experiment-metadata-origami-data-gen-project.tar.gz \
--data-root bench/data/yanex-demo \
--resetVerify materialization:
cargo run -p nokv-bench --bin yanex-agent-bench -- verify \
--data-root bench/data/yanex-demoSet OpenAI credentials and model:
export OPENAI_API_KEY=...
export OPENAI_MODEL=gpt-5.5Run one task for one arm:
./bench/agent-interface/scripts/run_phase1_batch.sh \
--arm nokv_native_v1 \
--task-id cancelled_train_interrupt_triage \
--repeats 1 \
--output-jsonl bench/data/yanex-demo/results/phase1.jsonlInstall runner dependencies in the local benchmark virtual environment. The
wrapper automatically uses this environment when PYTHON is not set:
python3.12 -m venv bench/agent-interface/.venv
bench/agent-interface/.venv/bin/python -m pip install -r bench/agent-interface/agents_runner/requirements.txtSet YANEX_BENCH_AGENT_SDK_LIVE_PROBE=1 to make the Agent SDK runner perform a
real schema-once Responses continuation probe before a run. Probe failure is a
run failure.
Run all fixed Phase 1 tasks for all arms:
./bench/agent-interface/scripts/run_phase1_batch.sh \
--repeats 10 \
--output-jsonl bench/data/yanex-demo/results/phase1.jsonlInspect the tool registry:
cargo run -p nokv-bench --bin yanex-agent-bench -- tools \
--arm nokv_native_v1Inspect a NoKV raw namespace path:
cargo run -p nokv-bench --bin yanex-agent-bench -- nokv-stat \
--data-root bench/data/yanex-demo \
--path /yanex/runs/00023013/metadata.jsonThe nokv-* direct commands above remain raw debugging commands. The benchmark
arm uses the product-native ls/stat/catalog/read/find/aggregate/grep
adapter exposed by nokv-client; the harness passes tool calls through
without owning any namespace semantics.
Inspect SQLite schema:
cargo run -p nokv-bench --bin yanex-agent-bench -- sqlite-show-schema \
--db bench/data/yanex-demo/sqlite/yanex.dbSet pricing environment variables before a batch to record per-run USD cost
in the telemetry (all_in_cost_usd):
export OPENAI_INPUT_USD_PER_1M_TOKENS=0.75
export OPENAI_CACHED_INPUT_USD_PER_1M_TOKENS=0.075
export OPENAI_OUTPUT_USD_PER_1M_TOKENS=4.50Cached prompt tokens are billed at the cached rate; uncached prompt tokens at the input rate; completion tokens at the output rate.
- Run at least 5 repeats per arm/task pair before quoting numbers; small models show high per-run variance on compound tasks.
- Judge-side gold (gold SQL or file-body oracles) is never exposed to either arm.
- Do not implement one-sided semantics in the harness as benchmark-only
shortcuts; the harness stays a thin adapter over
nokv-clientandnokv-meta, and the raw SQLite arm keeps line-orientedgrep_blobparity for body search. - The published benchmark report lives at
bench/agent-interface/BENCHMARK_REPORT.md; its raw telemetry is committed underbench/agent-interface/results/.