You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+36Lines changed: 36 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,42 @@ memweave is a zero-infrastructure, async-first Python library that gives AI agen
15
15
16
16
---
17
17
18
+
## 📊 Benchmark — LongMemEval-S
19
+
20
+
Evaluated on [LongMemEval-S](https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned) — a 500-question benchmark covering multi-session memory, temporal reasoning, knowledge updates, and user preferences. Primary metric: **Recall@5** (any correct session in the top-5 results).
21
+
22
+
### Comparison with mempalace — held-out split (450 questions)
23
+
24
+
Same conditions: same dataset, same 50/450 dev/held-out split, same embedding model (`all-MiniLM-L6-v2` via Ollama — local, no API key). Parameters tuned on dev only; held-out is a single clean measurement with no post-hoc tuning. **No LLM, no API key, and no cloud service at any stage.**
> ECR — confidence-adaptive entity boost · IDF — corpus-relative keyword boost · CAATB — additive confidence-adaptive temporal boost. Three lightweight heuristic post-processors, zero neural inference. Implemented as custom plugins via `mem.register_postprocessor()` — not bundled with `pip install memweave`. Details and source in [`benchmarks/`](benchmarks/).
32
+
33
+
**memweave achieves 100% recall at R@23 — 7 ranks earlier than [mempalace (R@30)](https://github.com/MemPalace/mempalace/blob/main/benchmarks/results_mempal_hybrid_v4_held_out_session_20260414_1634.jsonl).** For any downstream re-ranker or LLM pass operating on a fixed top-K window, a smaller context window guarantees full coverage.
34
+
35
+
mempalace Hybrid v4 injects synthetic preference documents at ingestion time — 16 heuristic regex patterns (`"I prefer…"`, `"always use…"`, etc.) generate additional index entries per session. memweave reaches 98.00% without any ingestion-time augmentation.
The pipeline was re-evaluated on 5 independent stratified 50/450 splits (seeds 42, 0, 1, 2, 3), each with its own hyperparameter search on its own dev set. No information leaks across splits.
40
+
41
+
| Metric | Mean | ±Std |
42
+
|--------|------|------|
43
+
|**R@5**|**97.24%**|**±0.12%**|
44
+
| R@10 | 98.76% | ±0.12% |
45
+
| R@25 | 100.00% | ±0.00% |
46
+
| NDCG@5 | 92.28% | ±0.69% |
47
+
48
+
The ±0.12% R@5 standard deviation confirms results are stable across different data splits.
49
+
50
+
Full benchmark methodology, per-type breakdown, and step-by-step reproduction instructions: [`benchmarks/`](benchmarks/).
51
+
52
+
---
53
+
18
54
## 💡 Why memweave?
19
55
20
56
- 📄 **Human-readable by design.** Memories live in plain `.md` files on disk. Open them in your editor, inspect them in your terminal, or `git diff` what your agent learned between runs.
0 commit comments