Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
hotpotqa-eval.ts	hotpotqa-eval.ts
locomo-eval.ts	locomo-eval.ts
mem0-hotpotqa.py	mem0-hotpotqa.py

Name

Last commit message

Last commit date

hotpotqa-eval.ts

locomo-eval.ts

mem0-hotpotqa.py

Benchmarks

Datasets

HotpotQA — Multi-hop question answering. Each question requires finding and combining information from exactly 2 documents out of 10 (2 gold, 8 distractors). Tests graph-relative retrieval.
LoCoMo — Long-context conversational memory. 10 conversations, 695 questions across single-hop, multi-hop, and temporal categories.

Evaluation Scripts

HotpotQA (Ori flat vs Ori explore vs Mem0)

# Ori (flat + explore, head-to-head)
npx tsx bench/hotpotqa-eval.ts --n 50 --json

# Mem0 comparison (requires pip install mem0ai)
python bench/mem0-hotpotqa.py

LoCoMo

# Full benchmark (695 questions, all categories)
npx tsx bench/locomo-eval.ts --json

# Single conversation
npx tsx bench/locomo-eval.ts --sample 0

# Filter by question type (1=multi-hop, 2=single-hop, 3=temporal)
npx tsx bench/locomo-eval.ts --categories 1,2,3

Latest Results

HotpotQA (50 questions, same dataset, same scoring)

System	R@5	F1	LLM-F1	Speed	API for ingestion
Ori flat	87.0%	50.6%	40.3%	142s	None (local)
Ori explore	90.0%	52.3%	41.0%	142s	None (local)
Mem0	29.0%	25.7%	18.8%	1347s	~500 LLM calls

LoCoMo (695 questions)

Category	Count	Recall	F1	MRR	AnsF1
Single-hop	321	55.6%	19.9%	29.9%	70.6%
Multi-hop	282	38.2%	24.6%	38.8%	61.3%
Temporal	92	26.6%	12.5%	21.8%	45.5%
Overall	695	44.7%	20.8%	32.4%	63.5%

Data

data/hotpotqa-dev.json — HotpotQA dev set (200 questions)
data/locomo10.json — LoCoMo 10-conversation dataset

Results

JSON output from each benchmark run stored in results/ with timestamps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Benchmarks

Datasets

Evaluation Scripts

HotpotQA (Ori flat vs Ori explore vs Mem0)

LoCoMo

Latest Results

HotpotQA (50 questions, same dataset, same scoring)

LoCoMo (695 questions)

Data

Results

FilesExpand file tree

bench

Directory actions

More options

Directory actions

More options

Latest commit

History

bench

Folders and files

parent directory

README.md

Benchmarks

Datasets

Evaluation Scripts

HotpotQA (Ori flat vs Ori explore vs Mem0)

LoCoMo

Latest Results

HotpotQA (50 questions, same dataset, same scoring)

LoCoMo (695 questions)

Data

Results