A reproducible benchmark harness for Hebb Mind's retrieval and memory quality. Every benchmark runs against an isolated, ephemeral Hebb Mind server (its own port, workdir, and hebb.db) and drives the same production code paths a real user hits — the Claude Code ingest hooks (integrations/claude_code/{write,stop}.py) and the /api/v1/search endpoint. The numbers are what ships, not an eval-only pipeline.
Two homes, on purpose. This README is the contributor/operator guide — what we measure, how to run it, how reports are laid out. The user-facing result pages (headline numbers + framework comparisons) live in
repo_pages/benchmarks/and are published to the docs site. Keep methodology/regeneration notes here; keep polished results there.
Pick the metric that matches each dataset's ground truth — don't default to LLM-judged QA when a clean retrieval-level identifier exists.
| Dataset | --dataset |
Metric | eval_version |
Notes |
|---|---|---|---|---|
| LoCoMo | locomo |
session Recall@k | v4 | evidence is session-tagged; 1,986 q (1,978 scorable) |
| LoCoMo (QA) | locomo-qa |
end-to-end QA | v1 | same retrieval, adds generate + judge |
| LongMemEval | longmemeval |
session Recall@k + end-to-end QA | v3 | 500 q; QA uses the official reader + get_anscheck_prompt judge |
| LongMemEval (session-doc) | longmemeval-session |
session Recall@k | v1 | one verbatim doc per session variant |
| ConvoMem | convomem |
end-to-end QA judge | v3 | free-text answers; we deliberately skip the noisy substring metric |
| ConvoMem (substring) | convomem-substring |
substring match | v1 | the dataset's own (noisy) metric, for reference |
| MemBench | membench |
turn-level Hit@k | v1 | ground truth is a turn pointer (target_step_id); MCQ → no LLM judge |
| PersonaMem | personamem |
end-to-end MCQ accuracy | v1 | 589 q, 4-option (chance 25%); one partition per (context, end_index) cut point; exact-match on the chosen letter, no LLM judge |
| MemoryArena | memoryarena |
— | — | dataset adapter only; not in the runnable set |
eval_version is sticky per methodology and lives on the benchmark class. Bump it only when the protocol changes (chunking, scoring metric, ingest mirror) — never per run.
# 1. install eval extras
pip install -e ".[eval]"
# 2. judge-LLM credentials — eval/eval.json (model, base_url, api keys, concurrency,
# search weights). User-provided; do NOT commit real keys. Env overrides: HEBB_LLM_*.
# 3. download a dataset
python -m eval download --dataset longmemeval
# 4. run (writes a fresh report under eval/reports/...)
python -m eval run --dataset longmemeval --mode raw --top-k 10
python -m eval list # what's available / downloadedRetrieval config is inherited from your ~/.hebb/hebb.json (or a project-root hebb.json) via prepare_workdir — so the eval matches what you ship. Shipped default: all-MiniLM-L6-v2 (384-d) + bge-reranker-base rerank.
Useful flags on run:
--mode raw(no consolidation; always wipes + re-ingests) ·--mode consolidated(runs consolidation; reuses the db unless--rebuild).--skip-qa— retrievalRecall@konly (fast); omit it to also run end-to-end QA.--enable-rerank / --disable-rerank,--rerank-model— override the inherited rerank setting without touching your config.--disable-vector / --disable-fts5 / --disable-graph-search / --disable-lexical-boost / --disable-temporal-boost / --disable-graph-expand— 3-way-RRF + boost ablations.--max-scenarios N(+--scenario-offset) — smoke subset / fixed-size batches.
Isolation (important). Each benchmark gets its own port (eval.client.BENCHMARK_PORTS, 8401–8409) and its own eval/workdirs/<name>-<mode>/ (dedicated hebb.json + hebb.db). The harness pins the server's HEBB_HOME to that workdir and never touches the daily hebb service on 8321. recall_strengthening_enabled is forced off during eval so retrieval is a fixed snapshot (query order can't perturb scores).
The QA path follows the official LongMemEval protocol verbatim, so the number is comparable to the leaderboard / Zep / Mem0:
- Reader —
LLMJudge.generate_answer_official, the neutral officialrun_generation.pyprompt (no benchmark-specific tuning). Our tuned_GENERATE_PROMPTis kept for LoCoMo/ConvoMem. - Judge —
LLMJudge.judge_anscheck, a verbatim port ofget_anscheck_prompt: per-question-type grading (preference→rubric, temporal→off-by-one tolerated, knowledge-update→updated value, abstention→_absquestion ids).
Reports are written to:
eval/reports/<benchmark>/<eval_version>/run-N/<benchmark>.{md,json}
run-Nauto-increments (highest existing + 1) — no dates in the path; stacked runs of the same protocol. Cleanup is the operator's call.- The JSON carries per-question
individual_results+retrieval_metrics(recall_any@k,ndcg@k,qa_accuracy[_<category>]); the MD is the human-readable summary with the config block.
To refresh a published number:
- Re-run the benchmark → a new
run-N. - Update the figure and the
Source:path in the relevantrepo_pages/benchmarks/<dataset>/*.md— and itszh/mirror (per-language pages; don't mix languages in one file). - Commit the referenced
run-Nreport alongside the doc change — public pages cite the report path, so it must be in-tree to resolve. Intermediate/smoke runs stay untracked.
- Adapter in
eval/datasets/producingEvalScenario/EvalQuestion; register inADAPTERS. - Benchmark in
eval/benchmarks/subclassingBaseBenchmark(setdataset_name,eval_version); register inBENCHMARKS. - Allocate a port in
eval.client.BENCHMARK_PORTS. - Choose the metric from the table above by what the dataset's ground truth actually is.