An enterprise-scale situational benchmark for AI memory systems — 4.3M tokens, 2,708 questions.
Store 4.3M tokens (2.3M documents + 105K conversation turns), then prove your memory system can retrieve the right information for real-world situations.
WMB-100K measures: can your memory system retrieve the right information for real situations?
- Not LLM reasoning ability
- Not response generation quality
- Only situational retrieval accuracy + false memory defense
Memory systems don't answer questions — they provide information to LLMs. WMB-100K tests whether the memory system returned the right memories for the situation. The LLM interpretation is out of scope.
| V1 | V2 | |
|---|---|---|
| Questions | Fact lookup ("What time does the user wake up?") | Situational ("Should we schedule a morning meeting?") |
| Scoring | Keyword matching | GPT-4o-mini semantic judge |
| Focus | Did you find the fact? | Did you bring the right memories for the situation? |
| Benchmark | Turns | Tokens | Questions | False Memory Test |
|---|---|---|---|---|
| LOCOMO (Maharana et al., 2024) | 600 | ~50K | ~1,500 | No |
| LongMemEval (Wu et al., 2024) | ~1,000 | ~100K | 500 | No |
| WMB-100K | 105,591 | 4.3M | 2,708 | Yes (400) |
| Part | Data | Tokens |
|---|---|---|
| Part A | 10 document domains (Wikipedia, public domain) | 2.3M |
| Part B | 10 conversation categories (~10K turns each) | ~2.0M |
| Total | Store all 4.3M tokens to answer all questions | ~4.3M |
Feed your memory system all data: 2.3M tokens of documents (10 domains) + 105,591 turns of conversation (10 categories).
Ask 2,708 situational questions. Your system returns relevant memories.
GPT-4o-mini evaluates whether returned memories contain the information needed for each situation.
Single-memory situational questions. One relevant memory needed to address the situation.
- Part A S1: 1,000 questions (documents)
- Part B S1: 1,000 questions (conversation)
Example: "A friend invites the user to an early breakfast at 7am. What should they know about the user's morning routine?"
| Type | Description | Memories Needed | Difficulty |
|---|---|---|---|
| S2 Multi-Memory | Combine 2-3 memories | 2-3 | ★★ |
| S3 Cross-Category | Connect different domains | 2-3 | ★★★ |
| S4 Temporal | Track changes over time | 2+ | ★★★ |
| S5 Adversarial | Wrong premise, retrieve correct memory | 1-2 | ★★★★ |
| S6 Contradiction | User said conflicting things, retrieve both | 2+ | ★★★★ |
| S7 Reasoning Chain | 3+ memories needed in sequence | 3+ | ★★★★★ |
S2-S7 use the same GPT-4o-mini judge with CORRECT/WRONG binary scoring, reported as accuracy percentages.
400 questions about things never mentioned. Correct response: return nothing.
Part A: 1,000 S1 questions × 0.1 = 100 points max
Part B: 1,000 S1 questions × 0.1 = 100 points max
Score = Part A / 2 + Part B / 2 - FM Penalty = 100 max
FM Penalty: each false positive × -0.25 (400 probes, max -100)
Each question includes required_memories — the specific information the system must return.
Judge Input:
Question: "Should we schedule a morning meeting with the user?"
Required: ["user wakes up at 7:15", "user is not a morning person"]
Returned: [what your system returned]
Judge Output: CORRECT or WRONG
The exact judge prompt is in scripts/score.py. Temperature: 0. No partial credit.
The 2.5x penalty reflects that false memories are more harmful than missing memories. A missing memory means "I don't know" — inconvenient but safe. A false memory means confidently returning wrong information — potentially dangerous in production (wrong medical history, wrong legal details, wrong user preferences).
| Score | Grade |
|---|---|
| 90-100 | Exceptional |
| 80-89 | Excellent |
| 70-79 | Good |
| 60-69 | Fair |
| 50-59 | Below Average |
| 0-49 | Failing |
| System | Part A (/100) | Part B (/100) | Score (/100) | Grade | FM Defense |
|---|---|---|---|---|---|
| LangChain (FAISS) | — | Testing | — | — | — |
| Mem0 (OSS v1.0.7) | — | Testing | — | — | — |
S2-S7 Accuracy:
| System | S2 | S3 | S4 | S5 | S6 | S7 |
|---|---|---|---|---|---|---|
| LangChain (FAISS) | — | — | — | — | — | — |
| Mem0 (OSS v1.0.7) | — | — | — | — | — | — |
Results will be updated after V2 testing completes.
Note on V1 results: In V1 testing with keyword matching, Mem0 retrieved 84 correct memories out of 2,224 questions (3.8%) and LangChain retrieved 527 (23.7%). However, both scored 0.0 net because FM penalty (-100) exceeded raw points. The 0.0 score reflects net-after-penalty, not zero retrieval. V2 results with semantic scoring may differ.
| System | LOCOMO (600 turns) | LongMemEval (1K turns) | WMB-100K (100K turns) |
|---|---|---|---|
| Full Context (GPT-4) | ~85% | — | $1,638+ per run |
| Mem0 | 66.9% | 49.0% | Testing |
| OpenAI Memory | 52.9% | — | Not tested |
| Model | Estimated cost per run |
|---|---|
| GPT-4o-mini | ~$98 |
| GPT-4o | ~$1,638 |
| Claude Sonnet | ~$1,967 |
| Claude Opus | ~$9,835 |
Estimates based on per-question context loading at March 2026 pricing.
To add your system to the leaderboard, open a GitHub Issue with your result.json file. We will verify and add it.
Conversation data (Part B) was generated using Claude Haiku (Anthropic). Each category contains ~10K turns of synthetic dialogue with ~100 facts naturally embedded in noise. The data is synthetic — not real user conversations.
Known limitations of synthetic data:
- Conversations may be more structured than real human dialogue
- Fact distribution may be more uniform than organic conversations
- Emotional/social dynamics may be simplified
Document data is sourced from Wikipedia (public domain, Creative Commons). 10 domains, ~230K tokens each.
Question format (all_questions.json):
{
"id": "daily_life.S1.001",
"category": "daily_life",
"qtype": "S1Situational",
"text": "A friend invites the user to breakfast at 7am. What should they know?",
"gold_answer": "The user wakes up at 7:15 and is not a morning person.",
"required_memories": ["user wakes up at 7:15", "user is not a morning person"],
"gold_turn_ids": [120, 453],
"points": 0.1,
"false_penalty": 0.0
}Conversation turn format ({category}.jsonl):
{
"turn_id": 120,
"speaker": "user",
"text": "I usually wake up around 7:15, barely making it on time...",
"embedded_facts": ["daily_life.004"]
}| # | Category | Topics | Facts |
|---|---|---|---|
| 1 | daily_life |
Routines, meals, habits | 100 |
| 2 | relationships |
Family, friends, partner | 100 |
| 3 | work_career |
Projects, salary, promotion | 100 |
| 4 | health_fitness |
Exercise, injuries, diet | 100 |
| 5 | travel_places |
Trips, restaurants, moving | 100 |
| 6 | media_taste |
Movies, books, music, games | 100 |
| 7 | finance_goals |
Savings, loans, investments | 100 |
| 8 | pets_hobbies |
Photography, climbing, cat | 100 |
| 9 | education_skills |
Languages, courses, certs | 100 |
| 10 | beliefs_values |
Philosophy, politics, goals | 100 |
| # | Domain | Source | Tokens |
|---|---|---|---|
| 1-10 | Daily Life, Economics, History, Law, Literature, Medicine, Philosophy, Psychology, Science, Technology | Wikipedia | ~230K each |
- Python 3.10+
pip install openai anthropic- OpenAI API key (for GPT-4o-mini scoring, ~$2-3)
- Your memory system with store/search interface
# 1. Clone
git clone https://github.com/Irina1920/WMB-100K
cd WMB-100K
# 2. Install dependencies
pip install openai anthropic
# 3. Write an adapter (see scripts/test_mem0.py for example)
# 4. Run your adapter
export OPENAI_API_KEY=sk-...
python scripts/your_adapter.py full
# 5. Score with GPT-4o-mini judge
python scripts/score.pydef store(user_id: str, content: str) -> None:
"""Store a memory."""
your_system.add(content, user_id=user_id)
def search(user_id: str, query: str) -> list[str]:
"""Search memories, return relevant text."""
results = your_system.search(query, user_id=user_id)
return [r["text"] for r in results]See scripts/test_mem0.py and scripts/test_langchain.py for working examples.
- Synthetic conversations: Generated by Claude Haiku, not real user data. Real conversations are messier, more ambiguous, and less structured.
- English only: All questions and data are in English. Performance on other languages is untested.
- Two systems tested: Only Mem0 and LangChain FAISS have been tested so far. Results may not generalize to all memory systems.
- GPT-4o-mini judge: Semantic judgment is model-dependent. Different judge models may produce different scores.
- Wikipedia documents: Part A uses Wikipedia text, which may not represent domain-specific enterprise documents.
- FM question design: False memory probes are synthetically generated and may not cover all realistic hallucination patterns.
| Step | Cost |
|---|---|
| Dataset (included in repo) | $0 |
| Scoring (GPT-4o-mini, 2,708 questions) | ~$2-3 |
| Your system's ingestion costs | Varies (4.3M tokens) |
@misc{wmb100k2026,
title={WMB-100K: A 100,000-Turn Situational Benchmark for AI Memory Systems},
author={Wontopos},
year={2026},
url={https://github.com/Irina1920/WMB-100K}
}Apache 2.0 — Dataset, benchmark tool, and scoring system are free to use.
Maintained by Wontopos.
| General | official@wontopos.com |
| CEO | sunwoo.ceo@wontopos.com |
| Marketing | xcx135@wontopos.com |
| Frontend | LoseWoo@wontopos.com |