Official implementation of our ACL 2026 paper HyperMem: Hypergraph Memory for Long-Term Conversations.
Long-term memory for conversational agents requires modelling high-order associations, i.e., joint dependencies among multiple related episodes and facts, which pairwise relations in existing RAG and graph-based memory systems cannot capture. HyperMem addresses this by structuring memory as a three-level hypergraph (topics → episodes → facts) connected through weighted hyperedges, and retrieving information via a coarse-to-fine top-down traversal.
On the LoCoMo benchmark, HyperMem reaches 92.73% LLM-as-a-judge accuracy, outperforming the strongest RAG baseline (HyperGraphRAG, 86.49%) by +6.24% and the strongest memory system (MemOS, 75.80%) by +16.93%.
Three-level hypergraph with hyperedges linking nodes of the same level.
Given a dialogue stream
where
| Level | Node | Semantics |
|---|---|---|
| L3 | Topic | Long-horizon theme grouping topically related episodes |
| L2 | Episode | Temporally contiguous dialogue segment describing one event |
| L1 | Fact | Atomic queryable knowledge unit extracted from an episode |
- Episode Detection: an LLM-driven streaming boundary detector partitions the raw dialogue into semantically complete episodes, each summarised and timestamped.
- Topic Aggregation: streaming topic matching against historical topics lazily groups related episodes under shared topics; new topics are created when no sufficient match exists.
- Fact Extraction: atomic facts are extracted from each episode (potential queries, keywords, summary), then bound to facts in the same episode via a weighted hyperedge.
Node embeddings are refined by aggregating information from incident hyperedges. A hyperedge embedding is computed as an attention-weighted sum of its member nodes,
and each node is updated as $\mathbf{h}'_v = \mathbf{h}v + \lambda \cdot \mathrm{Agg}{e \in \mathcal{N}(v)}(\mathbf{h}_e)$ with
For a query
-
Stage 1, Topic Retrieval: BM25 and dense rankings are fused by Reciprocal Rank Fusion,
$$\mathrm{RRF}(d) = \sum_{m=1}^{M} \frac{1}{k + \mathrm{rank}_m(d)},$$ and the top-$k^T$ topics are kept after optional reranking.
-
Stage 2, Episode Retrieval: episodes in the topic subgraph are scored and the top-$k^E$ are retained.
-
Stage 3, Fact Retrieval: facts linked to the retained episodes are scored and the top-$k^F$ are used as evidence.
The final answer is generated by an LLM conditioned on the retrieved episodes (with their summaries) and facts.
HyperMem is tested with Python 3.12 and CUDA 12.1.
git clone https://github.com/<org>/HyperMem.git
cd HyperMem
conda create -n hypermem python=3.12 -y
conda activate hypermem
pip install -r requirements.txtCreate a .env file at the repository root:
# LLM backend (OpenAI-compatible; we use OpenRouter in the paper)
OPENROUTER_API_KEY=sk-...
# Local model endpoints
EMBEDDING_BASE_URL=http://localhost:11810/v1/embeddings
RERANKER_BASE_URL=http://localhost:12810HyperMem uses Qwen3-Embedding-4B for semantic encoding and Qwen3-Reranker-4B for reranking. Both are served via vLLM:
bash scripts/serve_embedding.sh # GPUs 0-3, port 11810
bash scripts/serve_reranker.sh # GPUs 4-7, port 12810The full LoCoMo evaluation pipeline is launched with a single command:
bash scripts/run_eval.shThe script sequentially runs six stages; all artefacts are written under results/<experiment_name>/.
| Stage | Script | Purpose |
|---|---|---|
| 1 | stage1_memory_extraction.py |
Episode detection from raw dialogues |
| 2 | stage2_hypergraph_extraction.py |
Topic aggregation + fact extraction + hypergraph construction |
| 3 | stage3_hypergraph_index.py |
BM25 and dense indices over the hypergraph |
| 4 | stage4_hypergraph_retrieval.py |
Top-down hierarchical retrieval |
| 5 | stage5_response.py |
LLM answer generation from retrieved evidence |
| 6 | stage6_eval.py |
LLM-as-judge evaluation (3 rounds, averaged) |
Individual stages can be run via:
python hypermem/main/eval.py --stages 4 5 6All hyper-parameters live in hypermem/config.py and can be overridden through environment variables:
export HYPERMEM_EXPERIMENT_NAME="HyperMem-v3"
export HYPERMEM_USE_RERANKER=false
export HYPERMEM_INITIAL_CANDIDATES=100 # pre-fusion candidate pool
export HYPERMEM_TOPIC_TOP_K=15 # k^T
export HYPERMEM_EPISODE_TOP_K=25 # k^E
export HYPERMEM_FACT_TOP_K=30 # k^FThis setting uses
Accuracy is reported as the LLM-as-judge score (GPT-4o-mini), averaged over 3 evaluation rounds.
| Method | Single-hop | Multi-hop | Temporal | Open Domain | Overall |
|---|---|---|---|---|---|
| GraphRAG | 79.55 | 54.96 | 50.16 | 58.33 | 67.60 |
| LightRAG | 86.68 | 84.04 | 60.75 | 71.88 | 79.87 |
| HippoRAG 2 | 86.44 | 75.89 | 78.50 | 66.67 | 81.62 |
| HyperGraphRAG | 90.61 | 80.85 | 85.36 | 70.83 | 86.49 |
| OpenAI | 63.79 | 42.92 | 21.71 | 63.22 | 52.90 |
| LangMem | 62.23 | 47.92 | 23.43 | 72.20 | 58.10 |
| Zep | 61.70 | 41.35 | 49.31 | 76.60 | 65.99 |
| A-Mem | 39.79 | 18.85 | 49.91 | 54.05 | 48.38 |
| Mem0 | 67.13 | 51.15 | 55.51 | 72.93 | 66.88 |
| Mem0$^g$ | 65.71 | 47.19 | 58.13 | 75.71 | 68.44 |
| MIRIX | 85.11 | 83.70 | 88.39 | 65.62 | 85.38 |
| Memobase | 73.12 | 64.65 | 81.20 | 53.12 | 72.01 |
| MemU | 66.34 | 63.12 | 27.10 | 50.56 | 56.55 |
| MemOS | 81.09 | 67.49 | 75.18 | 55.90 | 75.80 |
| HyperMem (Ours) | 96.08 | 93.62 | 89.72 | 70.83 | 92.73 |
HyperMem/
├── hypermem/
│ ├── config.py # Experiment configuration
│ ├── types.py # Episode / Topic / Fact data classes
│ ├── structure.py # Hypergraph nodes and hyperedges
│ ├── extractors/ # LLM-driven extraction modules
│ │ ├── episode_extractor.py
│ │ ├── topic_extractor.py
│ │ ├── fact_extractor.py
│ │ └── hypergraph_extractor.py
│ ├── llm/ # OpenAI-compatible LLM / embedding / reranker clients
│ ├── prompts/ # Prompt templates (episode / topic / fact / answer)
│ ├── utils/ # Utility functions
│ └── main/ # Six-stage pipeline entry points
├── scripts/
│ ├── run_eval.sh # End-to-end evaluation driver
│ ├── serve_embedding.sh # Qwen3-Embedding-4B server
│ └── serve_reranker.sh # Qwen3-Reranker-4B server
├── data/ # LoCoMo-10 and auxiliary benchmarks
├── results/ # Per-experiment artefacts
├── requirements.txt
└── README.md
Each experiment directory under results/ contains the extracted episodes/, topics/, facts/, the built hypergraphs/, bm25_index/, vectors/, along with search_results.json, retrieval_logs.json, responses.json, and the final judged.json.
If HyperMem is useful in your research, please cite our paper:
@inproceedings{yue2026hypermem,
title = {HyperMem: Hypergraph Memory for Long-Term Conversations},
author = {Yue, Juwei and Hu, Chuanrui and Sheng, Jiawei and Zhou, Zuyi and Zhang, Wenyuan and Liu, Tingwen and Guo, Li and Deng, Yafeng},
booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2026}
}