WMB-100K — Wontopos Memory Benchmark v2.0

An enterprise-scale situational benchmark for AI memory systems — 4.3M tokens, 2,708 questions.

Store 4.3M tokens (2.3M documents + 105K conversation turns), then prove your memory system can retrieve the right information for real-world situations.

What WMB-100K Tests

WMB-100K measures: can your memory system retrieve the right information for real situations?

Not LLM reasoning ability
Not response generation quality
Only situational retrieval accuracy + false memory defense

Memory systems don't answer questions — they provide information to LLMs. WMB-100K tests whether the memory system returned the right memories for the situation. The LLM interpretation is out of scope.

What Changed in V2

	V1	V2
Questions	Fact lookup ("What time does the user wake up?")	Situational ("Should we schedule a morning meeting?")
Scoring	Keyword matching	GPT-4o-mini semantic judge
Focus	Did you find the fact?	Did you bring the right memories for the situation?

Scale

Benchmark	Turns	Tokens	Questions	False Memory Test
LOCOMO (Maharana et al., 2024)	600	~50K	~1,500	No
LongMemEval (Wu et al., 2024)	~1,000	~100K	500	No
WMB-100K	105,591	4.3M	2,708	Yes (400)

Part	Data	Tokens
Part A	10 document domains (Wikipedia, public domain)	2.3M
Part B	10 conversation categories (~10K turns each)	~2.0M
Total	Store all 4.3M tokens to answer all questions	~4.3M

How It Works

Phase 1: Ingestion

Feed your memory system all data: 2.3M tokens of documents (10 domains) + 105,591 turns of conversation (10 categories).

Phase 2: Query

Ask 2,708 situational questions. Your system returns relevant memories.

Phase 3: Score

GPT-4o-mini evaluates whether returned memories contain the information needed for each situation.

Question Types

S1 — Scored (determines your final score)

Single-memory situational questions. One relevant memory needed to address the situation.

Part A S1: 1,000 questions (documents)
Part B S1: 1,000 questions (conversation)

Example: "A friend invites the user to an early breakfast at 7am. What should they know about the user's morning routine?"

S2-S7 — Analysis (accuracy % reported separately, not scored)

Type	Description	Memories Needed	Difficulty
S2 Multi-Memory	Combine 2-3 memories	2-3	★★
S3 Cross-Category	Connect different domains	2-3	★★★
S4 Temporal	Track changes over time	2+	★★★
S5 Adversarial	Wrong premise, retrieve correct memory	1-2	★★★★
S6 Contradiction	User said conflicting things, retrieve both	2+	★★★★
S7 Reasoning Chain	3+ memories needed in sequence	3+	★★★★★

S2-S7 use the same GPT-4o-mini judge with CORRECT/WRONG binary scoring, reported as accuracy percentages.

FM — False Memory (penalty)

400 questions about things never mentioned. Correct response: return nothing.

Scoring

Part A:  1,000 S1 questions × 0.1 = 100 points max
Part B:  1,000 S1 questions × 0.1 = 100 points max

Score = Part A / 2 + Part B / 2 - FM Penalty = 100 max

FM Penalty: each false positive × -0.25 (400 probes, max -100)

GPT-4o-mini Judge

Each question includes required_memories — the specific information the system must return.

Judge Input:
  Question: "Should we schedule a morning meeting with the user?"
  Required: ["user wakes up at 7:15", "user is not a morning person"]
  Returned: [what your system returned]

Judge Output: CORRECT or WRONG

The exact judge prompt is in scripts/score.py. Temperature: 0. No partial credit.

FM Penalty Ratio (-0.25 vs +0.1)

The 2.5x penalty reflects that false memories are more harmful than missing memories. A missing memory means "I don't know" — inconvenient but safe. A false memory means confidently returning wrong information — potentially dangerous in production (wrong medical history, wrong legal details, wrong user preferences).

Grades

Score	Grade
90-100	Exceptional
80-89	Excellent
70-79	Good
60-69	Fair
50-59	Below Average
0-49	Failing

Results

WMB-100K v2.0

System	Part A (/100)	Part B (/100)	Score (/100)	Grade	FM Defense
LangChain (FAISS)	—	Testing	—	—	—
Mem0 (OSS v1.0.7)	—	Testing	—	—	—

S2-S7 Accuracy:

System	S2	S3	S4	S5	S6	S7
LangChain (FAISS)	—	—	—	—	—	—
Mem0 (OSS v1.0.7)	—	—	—	—	—	—

Results will be updated after V2 testing completes.

Note on V1 results: In V1 testing with keyword matching, Mem0 retrieved 84 correct memories out of 2,224 questions (3.8%) and LangChain retrieved 527 (23.7%). However, both scored 0.0 net because FM penalty (-100) exceeded raw points. The 0.0 score reflects net-after-penalty, not zero retrieval. V2 results with semantic scoring may differ.

Cross-Benchmark Comparison

System	LOCOMO (600 turns)	LongMemEval (1K turns)	WMB-100K (100K turns)
Full Context (GPT-4)	~85%	—	$1,638+ per run
Mem0	66.9%	49.0%	Testing
OpenAI Memory	52.9%	—	Not tested

Cost of Full Context Approach

Model	Estimated cost per run
GPT-4o-mini	~$98
GPT-4o	~$1,638
Claude Sonnet	~$1,967
Claude Opus	~$9,835

Estimates based on per-question context loading at March 2026 pricing.

Submit Your Results

To add your system to the leaderboard, open a GitHub Issue with your result.json file. We will verify and add it.

Data

Synthetic Conversation Data

Conversation data (Part B) was generated using Claude Haiku (Anthropic). Each category contains ~10K turns of synthetic dialogue with ~100 facts naturally embedded in noise. The data is synthetic — not real user conversations.

Known limitations of synthetic data:

Conversations may be more structured than real human dialogue
Fact distribution may be more uniform than organic conversations
Emotional/social dynamics may be simplified

Document Data (Part A)

Document data is sourced from Wikipedia (public domain, Creative Commons). 10 domains, ~230K tokens each.

Data Schema

Question format (all_questions.json):

{
  "id": "daily_life.S1.001",
  "category": "daily_life",
  "qtype": "S1Situational",
  "text": "A friend invites the user to breakfast at 7am. What should they know?",
  "gold_answer": "The user wakes up at 7:15 and is not a morning person.",
  "required_memories": ["user wakes up at 7:15", "user is not a morning person"],
  "gold_turn_ids": [120, 453],
  "points": 0.1,
  "false_penalty": 0.0
}

Conversation turn format ({category}.jsonl):

{
  "turn_id": 120,
  "speaker": "user",
  "text": "I usually wake up around 7:15, barely making it on time...",
  "embedded_facts": ["daily_life.004"]
}

10 Conversation Categories

#	Category	Topics	Facts
1	`daily_life`	Routines, meals, habits	100
2	`relationships`	Family, friends, partner	100
3	`work_career`	Projects, salary, promotion	100
4	`health_fitness`	Exercise, injuries, diet	100
5	`travel_places`	Trips, restaurants, moving	100
6	`media_taste`	Movies, books, music, games	100
7	`finance_goals`	Savings, loans, investments	100
8	`pets_hobbies`	Photography, climbing, cat	100
9	`education_skills`	Languages, courses, certs	100
10	`beliefs_values`	Philosophy, politics, goals	100

10 Document Domains

#	Domain	Source	Tokens
1-10	Daily Life, Economics, History, Law, Literature, Medicine, Philosophy, Psychology, Science, Technology	Wikipedia	~230K each

Quick Start

Requirements

Python 3.10+
pip install openai anthropic
OpenAI API key (for GPT-4o-mini scoring, ~$2-3)
Your memory system with store/search interface

Run

# 1. Clone
git clone https://github.com/Irina1920/WMB-100K
cd WMB-100K

# 2. Install dependencies
pip install openai anthropic

# 3. Write an adapter (see scripts/test_mem0.py for example)

# 4. Run your adapter
export OPENAI_API_KEY=sk-...
python scripts/your_adapter.py full

# 5. Score with GPT-4o-mini judge
python scripts/score.py

Adapter Template

def store(user_id: str, content: str) -> None:
    """Store a memory."""
    your_system.add(content, user_id=user_id)

def search(user_id: str, query: str) -> list[str]:
    """Search memories, return relevant text."""
    results = your_system.search(query, user_id=user_id)
    return [r["text"] for r in results]

See scripts/test_mem0.py and scripts/test_langchain.py for working examples.

Limitations

Synthetic conversations: Generated by Claude Haiku, not real user data. Real conversations are messier, more ambiguous, and less structured.
English only: All questions and data are in English. Performance on other languages is untested.
Two systems tested: Only Mem0 and LangChain FAISS have been tested so far. Results may not generalize to all memory systems.
GPT-4o-mini judge: Semantic judgment is model-dependent. Different judge models may produce different scores.
Wikipedia documents: Part A uses Wikipedia text, which may not represent domain-specific enterprise documents.
FM question design: False memory probes are synthetically generated and may not cover all realistic hallucination patterns.

Cost to Run

Step	Cost
Dataset (included in repo)	$0
Scoring (GPT-4o-mini, 2,708 questions)	~$2-3
Your system's ingestion costs	Varies (4.3M tokens)

Citation

@misc{wmb100k2026,
  title={WMB-100K: A 100,000-Turn Situational Benchmark for AI Memory Systems},
  author={Wontopos},
  year={2026},
  url={https://github.com/Irina1920/WMB-100K}
}

License

Apache 2.0 — Dataset, benchmark tool, and scoring system are free to use.

Contact

Maintained by Wontopos.


General	official@wontopos.com
CEO	sunwoo.ceo@wontopos.com
Marketing	xcx135@wontopos.com
Frontend	LoseWoo@wontopos.com

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
documents		documents
results		results
scripts		scripts
src		src
.gitignore		.gitignore
COMPARISON.md		COMPARISON.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

WMB-100K — Wontopos Memory Benchmark v2.0

What WMB-100K Tests

What Changed in V2

Scale

How It Works

Phase 1: Ingestion

Phase 2: Query

Phase 3: Score

Question Types

S1 — Scored (determines your final score)

S2-S7 — Analysis (accuracy % reported separately, not scored)

FM — False Memory (penalty)

Scoring

GPT-4o-mini Judge

FM Penalty Ratio (-0.25 vs +0.1)

Grades

Results

WMB-100K v2.0

Cross-Benchmark Comparison

Cost of Full Context Approach

Submit Your Results

Data

Synthetic Conversation Data

Document Data (Part A)

Data Schema

10 Conversation Categories

10 Document Domains

Quick Start

Requirements

Run

Adapter Template

Limitations

Cost to Run

Citation

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages