@@ -52,8 +52,8 @@ benchmarks/data/splits/ ← 5 stratified CV splits (seeds 42, 0,
5252
5353| Subset | Questions | Purpose |
5454| ---| ---| ---|
55- | ` dev ` | 50 | Hyperparameter tuning only ( ~ 7 min) |
56- | ` held_out ` | 450 | Single clean evaluation ( ~ 60 min) |
55+ | ` dev ` | 50 | Hyperparameter tuning only |
56+ | ` held_out ` | 450 | Single clean evaluation |
5757
5858** Integrity rule:** all alpha decisions are made on ` dev ` only. The held-out
5959result is run once; no parameters are adjusted after observing held-out failures.
@@ -241,7 +241,7 @@ regenerate them, as the held-out IDs must be identical across runs.
241241
242242### Fixed split — canonical held-out result
243243
244- ** Step 1 — Find optimal alphas on dev ( ~ 7 min) **
244+ ** Step 1 — Find optimal alphas on dev**
245245
246246``` bash
247247.venv/bin/python -u benchmarks/longmemeval_bench.py \
@@ -255,7 +255,7 @@ Expected best combo: `ECR α=0.3 IDF α=0.6 CAATB α=0.2`
255255The sweep evaluates 27 alpha combinations without re-embedding — raw vector rows
256256are cached once per question and all combos are applied offline.
257257
258- ** Step 2 — Evaluate on held-out ( ~ 60 min) **
258+ ** Step 2 — Evaluate on held-out**
259259
260260``` bash
261261.venv/bin/python -u benchmarks/longmemeval_bench.py \
@@ -277,7 +277,7 @@ Results are written to `benchmarks/results/results_mw_ecr{α}_idf{α}_caatb{α}_
277277
278278---
279279
280- ### 5-seed cross-validated results ( ~ 5–6 hours)
280+ ### 5-seed cross-validated results
281281
282282``` bash
283283.venv/bin/python -u benchmarks/multiseed_sweep.py \
0 commit comments