Skip to content

Irina1920/WMB-100K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WMB-100K — Wontopos Memory Benchmark v2.0

An enterprise-scale situational benchmark for AI memory systems — 4.3M tokens, 2,708 questions.

Store 4.3M tokens (2.3M documents + 105K conversation turns), then prove your memory system can retrieve the right information for real-world situations.


What WMB-100K Tests

WMB-100K measures: can your memory system retrieve the right information for real situations?

  • Not LLM reasoning ability
  • Not response generation quality
  • Only situational retrieval accuracy + false memory defense

Memory systems don't answer questions — they provide information to LLMs. WMB-100K tests whether the memory system returned the right memories for the situation. The LLM interpretation is out of scope.


What Changed in V2

V1 V2
Questions Fact lookup ("What time does the user wake up?") Situational ("Should we schedule a morning meeting?")
Scoring Keyword matching GPT-4o-mini semantic judge
Focus Did you find the fact? Did you bring the right memories for the situation?

Scale

Benchmark Turns Tokens Questions False Memory Test
LOCOMO (Maharana et al., 2024) 600 ~50K ~1,500 No
LongMemEval (Wu et al., 2024) ~1,000 ~100K 500 No
WMB-100K 105,591 4.3M 2,708 Yes (400)
Part Data Tokens
Part A 10 document domains (Wikipedia, public domain) 2.3M
Part B 10 conversation categories (~10K turns each) ~2.0M
Total Store all 4.3M tokens to answer all questions ~4.3M

How It Works

Phase 1: Ingestion

Feed your memory system all data: 2.3M tokens of documents (10 domains) + 105,591 turns of conversation (10 categories).

Phase 2: Query

Ask 2,708 situational questions. Your system returns relevant memories.

Phase 3: Score

GPT-4o-mini evaluates whether returned memories contain the information needed for each situation.


Question Types

S1 — Scored (determines your final score)

Single-memory situational questions. One relevant memory needed to address the situation.

  • Part A S1: 1,000 questions (documents)
  • Part B S1: 1,000 questions (conversation)

Example: "A friend invites the user to an early breakfast at 7am. What should they know about the user's morning routine?"

S2-S7 — Analysis (accuracy % reported separately, not scored)

Type Description Memories Needed Difficulty
S2 Multi-Memory Combine 2-3 memories 2-3 ★★
S3 Cross-Category Connect different domains 2-3 ★★★
S4 Temporal Track changes over time 2+ ★★★
S5 Adversarial Wrong premise, retrieve correct memory 1-2 ★★★★
S6 Contradiction User said conflicting things, retrieve both 2+ ★★★★
S7 Reasoning Chain 3+ memories needed in sequence 3+ ★★★★★

S2-S7 use the same GPT-4o-mini judge with CORRECT/WRONG binary scoring, reported as accuracy percentages.

FM — False Memory (penalty)

400 questions about things never mentioned. Correct response: return nothing.


Scoring

Part A:  1,000 S1 questions × 0.1 = 100 points max
Part B:  1,000 S1 questions × 0.1 = 100 points max

Score = Part A / 2 + Part B / 2 - FM Penalty = 100 max

FM Penalty: each false positive × -0.25 (400 probes, max -100)

GPT-4o-mini Judge

Each question includes required_memories — the specific information the system must return.

Judge Input:
  Question: "Should we schedule a morning meeting with the user?"
  Required: ["user wakes up at 7:15", "user is not a morning person"]
  Returned: [what your system returned]

Judge Output: CORRECT or WRONG

The exact judge prompt is in scripts/score.py. Temperature: 0. No partial credit.

FM Penalty Ratio (-0.25 vs +0.1)

The 2.5x penalty reflects that false memories are more harmful than missing memories. A missing memory means "I don't know" — inconvenient but safe. A false memory means confidently returning wrong information — potentially dangerous in production (wrong medical history, wrong legal details, wrong user preferences).

Grades

Score Grade
90-100 Exceptional
80-89 Excellent
70-79 Good
60-69 Fair
50-59 Below Average
0-49 Failing

Results

WMB-100K v2.0

System Part A (/100) Part B (/100) Score (/100) Grade FM Defense
LangChain (FAISS) Testing
Mem0 (OSS v1.0.7) Testing

S2-S7 Accuracy:

System S2 S3 S4 S5 S6 S7
LangChain (FAISS)
Mem0 (OSS v1.0.7)

Results will be updated after V2 testing completes.

Note on V1 results: In V1 testing with keyword matching, Mem0 retrieved 84 correct memories out of 2,224 questions (3.8%) and LangChain retrieved 527 (23.7%). However, both scored 0.0 net because FM penalty (-100) exceeded raw points. The 0.0 score reflects net-after-penalty, not zero retrieval. V2 results with semantic scoring may differ.

Cross-Benchmark Comparison

System LOCOMO (600 turns) LongMemEval (1K turns) WMB-100K (100K turns)
Full Context (GPT-4) ~85% $1,638+ per run
Mem0 66.9% 49.0% Testing
OpenAI Memory 52.9% Not tested

Cost of Full Context Approach

Model Estimated cost per run
GPT-4o-mini ~$98
GPT-4o ~$1,638
Claude Sonnet ~$1,967
Claude Opus ~$9,835

Estimates based on per-question context loading at March 2026 pricing.

Submit Your Results

To add your system to the leaderboard, open a GitHub Issue with your result.json file. We will verify and add it.


Data

Synthetic Conversation Data

Conversation data (Part B) was generated using Claude Haiku (Anthropic). Each category contains ~10K turns of synthetic dialogue with ~100 facts naturally embedded in noise. The data is synthetic — not real user conversations.

Known limitations of synthetic data:

  • Conversations may be more structured than real human dialogue
  • Fact distribution may be more uniform than organic conversations
  • Emotional/social dynamics may be simplified

Document Data (Part A)

Document data is sourced from Wikipedia (public domain, Creative Commons). 10 domains, ~230K tokens each.

Data Schema

Question format (all_questions.json):

{
  "id": "daily_life.S1.001",
  "category": "daily_life",
  "qtype": "S1Situational",
  "text": "A friend invites the user to breakfast at 7am. What should they know?",
  "gold_answer": "The user wakes up at 7:15 and is not a morning person.",
  "required_memories": ["user wakes up at 7:15", "user is not a morning person"],
  "gold_turn_ids": [120, 453],
  "points": 0.1,
  "false_penalty": 0.0
}

Conversation turn format ({category}.jsonl):

{
  "turn_id": 120,
  "speaker": "user",
  "text": "I usually wake up around 7:15, barely making it on time...",
  "embedded_facts": ["daily_life.004"]
}

10 Conversation Categories

# Category Topics Facts
1 daily_life Routines, meals, habits 100
2 relationships Family, friends, partner 100
3 work_career Projects, salary, promotion 100
4 health_fitness Exercise, injuries, diet 100
5 travel_places Trips, restaurants, moving 100
6 media_taste Movies, books, music, games 100
7 finance_goals Savings, loans, investments 100
8 pets_hobbies Photography, climbing, cat 100
9 education_skills Languages, courses, certs 100
10 beliefs_values Philosophy, politics, goals 100

10 Document Domains

# Domain Source Tokens
1-10 Daily Life, Economics, History, Law, Literature, Medicine, Philosophy, Psychology, Science, Technology Wikipedia ~230K each

Quick Start

Requirements

  • Python 3.10+
  • pip install openai anthropic
  • OpenAI API key (for GPT-4o-mini scoring, ~$2-3)
  • Your memory system with store/search interface

Run

# 1. Clone
git clone https://github.com/Irina1920/WMB-100K
cd WMB-100K

# 2. Install dependencies
pip install openai anthropic

# 3. Write an adapter (see scripts/test_mem0.py for example)

# 4. Run your adapter
export OPENAI_API_KEY=sk-...
python scripts/your_adapter.py full

# 5. Score with GPT-4o-mini judge
python scripts/score.py

Adapter Template

def store(user_id: str, content: str) -> None:
    """Store a memory."""
    your_system.add(content, user_id=user_id)

def search(user_id: str, query: str) -> list[str]:
    """Search memories, return relevant text."""
    results = your_system.search(query, user_id=user_id)
    return [r["text"] for r in results]

See scripts/test_mem0.py and scripts/test_langchain.py for working examples.


Limitations

  • Synthetic conversations: Generated by Claude Haiku, not real user data. Real conversations are messier, more ambiguous, and less structured.
  • English only: All questions and data are in English. Performance on other languages is untested.
  • Two systems tested: Only Mem0 and LangChain FAISS have been tested so far. Results may not generalize to all memory systems.
  • GPT-4o-mini judge: Semantic judgment is model-dependent. Different judge models may produce different scores.
  • Wikipedia documents: Part A uses Wikipedia text, which may not represent domain-specific enterprise documents.
  • FM question design: False memory probes are synthetically generated and may not cover all realistic hallucination patterns.

Cost to Run

Step Cost
Dataset (included in repo) $0
Scoring (GPT-4o-mini, 2,708 questions) ~$2-3
Your system's ingestion costs Varies (4.3M tokens)

Citation

@misc{wmb100k2026,
  title={WMB-100K: A 100,000-Turn Situational Benchmark for AI Memory Systems},
  author={Wontopos},
  year={2026},
  url={https://github.com/Irina1920/WMB-100K}
}

License

Apache 2.0 — Dataset, benchmark tool, and scoring system are free to use.


Contact

Maintained by Wontopos.

General official@wontopos.com
CEO sunwoo.ceo@wontopos.com
Marketing xcx135@wontopos.com
Frontend LoseWoo@wontopos.com

About

WMB-100K — The first 100,000-turn benchmark for AI memory systems

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors