Name	Name	Last commit message	Last commit date
parent directory ..
a1	a1
ab	ab
e1.3	e1.3
locomo	locomo
longmemeval	longmemeval
micro	micro
physics-ablation	physics-ablation
sequential-learning	sequential-learning
soak	soak
INIT_LAZY_DB.md	INIT_LAZY_DB.md
LOCOMO_INVESTIGATION.md	LOCOMO_INVESTIGATION.md
LONGMEMEVAL_RESOLVED.md	LONGMEMEVAL_RESOLVED.md
README.md	README.md

Hippo Benchmarks

Two benchmarks, two different questions.

1. Sequential Learning Benchmark

Question: Does the memory system help agents learn from mistakes over time?

No other public benchmark tests this. LongMemEval, LoCoMo, and ConvoMem all test retrieval accuracy on a fixed corpus. None of them measure whether an agent with memory performs better over a sequence of tasks than one without.

cd sequential-learning
node run.mjs --adapter all

Results (hippo v0.11.0):

Condition	Overall	Early	Mid	Late	Learns?
No memory	100%	100%	100%	100%	No
Static memory	20%	33%	11%	14%	No
Hippo	40%	78%	22%	14%	Yes

The hippo agent's trap-hit rate drops from 78% (early) to 14% (late) as it accumulates error memories. The no-memory baseline hits every trap. Static memory helps but doesn't improve over time.

Adding your own memory system: Implement the adapter interface in sequential-learning/adapters/interface.mjs and run:

node run.mjs --adapter your-adapter

Zero dependencies. Node.js 22.5+ only.

2. LongMemEval Integration

Question: How accurately can hippo retrieve the right memory from a large corpus?

LongMemEval (ICLR 2025) is the industry-standard benchmark for AI agent memory. 500 questions across 5 abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

cd longmemeval
pip install -r requirements.txt

# Download data from HuggingFace (see README)
# Then run:
python run.py --data data/longmemeval_oracle.json

Hippo v0.11.0 results (BM25 only, zero dependencies):

Metric	Hippo	MemPalace (raw)	MemPalace (reranked)
R@1	50.4%	—	—
R@5	74.0%	96.6%	100%
R@10	82.6%	—	—

Question Type	R@5
single-session-assistant	94.6%
knowledge-update	88.5%
temporal-reasoning	73.7%
multi-session	72.2%
single-session-user	65.7%
single-session-preference	26.7%

Hippo achieves 74% R@5 with BM25 keyword matching and zero runtime dependencies. MemPalace's 96.6% uses ChromaDB embeddings. Adding hippo embed (hybrid BM25 + cosine) should close the gap.

Fast retrieval mode (recommended): retrieve_fast.py queries SQLite FTS5 directly — 500 questions in 2 seconds. The CLI-based retrieve.py takes hours due to subprocess overhead.

cd longmemeval
python ingest_direct.py --data data/longmemeval_oracle.json --store-dir ./store
python retrieve_fast.py --data data/longmemeval_oracle.json --store-dir ./store --output results/retrieval.jsonl
python evaluate_retrieval.py --retrieval results/retrieval.jsonl --data data/longmemeval_oracle.json

For full pipeline with LLM answer generation + evaluation (requires ANTHROPIC_API_KEY):

python run.py --data data/longmemeval_oracle.json

What each benchmark proves

	Sequential Learning	LongMemEval
Tests	Agent improvement over time	Retrieval accuracy on fixed corpus
Unique to hippo?	Yes (no other benchmark tests this)	No (industry standard)
Hippo result	78% -> 14% trap rate (learns)	74.0% R@5 (BM25 only)
What it proves	Decay + strengthening + outcome feedback produce learning curves	BM25 keyword search competes with embedding systems at zero dependency cost
Metric	Trap-hit-rate decline (early vs late)	Recall@K, answer-in-content
Dependencies	Node.js 22.5+	Python 3.9+ (retrieval eval needs no API key)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Hippo Benchmarks

1. Sequential Learning Benchmark

2. LongMemEval Integration

What each benchmark proves

FilesExpand file tree

benchmarks

Directory actions

More options

Directory actions

More options

Latest commit

History

benchmarks

Folders and files

parent directory

README.md

Hippo Benchmarks

1. Sequential Learning Benchmark

2. LongMemEval Integration

What each benchmark proves