Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

HyperMem: Hypergraph Memory for Long-Term Conversations

Official implementation of our ACL 2026 paper HyperMem: Hypergraph Memory for Long-Term Conversations.

Long-term memory for conversational agents requires modelling high-order associations, i.e., joint dependencies among multiple related episodes and facts, which pairwise relations in existing RAG and graph-based memory systems cannot capture. HyperMem addresses this by structuring memory as a three-level hypergraph (topics → episodes → facts) connected through weighted hyperedges, and retrieving information via a coarse-to-fine top-down traversal.

On the LoCoMo benchmark, HyperMem reaches 92.73% LLM-as-a-judge accuracy, outperforming the strongest RAG baseline (HyperGraphRAG, 86.49%) by +6.24% and the strongest memory system (MemOS, 75.80%) by +16.93%.


Method

Three-level hypergraph with hyperedges linking nodes of the same level.

Given a dialogue stream $X = {x_t}_{t=1}^T$, HyperMem constructs a memory hypergraph

$$\mathcal{H} = (\mathcal{V}^T \cup \mathcal{V}^E \cup \mathcal{V}^F,; \mathcal{E}^E \cup \mathcal{E}^F),$$

where $\mathcal{V}^T, \mathcal{V}^E, \mathcal{V}^F$ denote topic, episode and fact nodes respectively. Episode hyperedges $\mathcal{E}^E$ connect episode nodes under the same topic with weights $w^E \in [0,1]$; fact hyperedges $\mathcal{E}^F$ connect fact nodes belonging to the same episode with weights $w^F \in [0,1]$.

Level Node Semantics
L3 Topic Long-horizon theme grouping topically related episodes
L2 Episode Temporally contiguous dialogue segment describing one event
L1 Fact Atomic queryable knowledge unit extracted from an episode

Hypergraph Construction

  1. Episode Detection: an LLM-driven streaming boundary detector partitions the raw dialogue into semantically complete episodes, each summarised and timestamped.
  2. Topic Aggregation: streaming topic matching against historical topics lazily groups related episodes under shared topics; new topics are created when no sufficient match exists.
  3. Fact Extraction: atomic facts are extracted from each episode (potential queries, keywords, summary), then bound to facts in the same episode via a weighted hyperedge.

Hypergraph Embedding Propagation

Node embeddings are refined by aggregating information from incident hyperedges. A hyperedge embedding is computed as an attention-weighted sum of its member nodes,

$$\mathbf{h}_e = \sum_{v \in V(e)} \alpha_{e,v} \mathbf{h}_v,\quad \alpha_{e,v} = \frac{\exp(w_{e,v})}{\sum_{u \in V(e)} \exp(w_{e,u})},$$

and each node is updated as $\mathbf{h}'_v = \mathbf{h}v + \lambda \cdot \mathrm{Agg}{e \in \mathcal{N}(v)}(\mathbf{h}_e)$ with $\lambda = 0.5$.

Coarse-to-Fine Retrieval

For a query $q$, retrieval proceeds top-down:

  • Stage 1, Topic Retrieval: BM25 and dense rankings are fused by Reciprocal Rank Fusion,

    $$\mathrm{RRF}(d) = \sum_{m=1}^{M} \frac{1}{k + \mathrm{rank}_m(d)},$$

    and the top-$k^T$ topics are kept after optional reranking.

  • Stage 2, Episode Retrieval: episodes in the topic subgraph are scored and the top-$k^E$ are retained.

  • Stage 3, Fact Retrieval: facts linked to the retained episodes are scored and the top-$k^F$ are used as evidence.

The final answer is generated by an LLM conditioned on the retrieved episodes (with their summaries) and facts.


Installation

HyperMem is tested with Python 3.12 and CUDA 12.1.

git clone https://github.com/<org>/HyperMem.git
cd HyperMem

conda create -n hypermem python=3.12 -y
conda activate hypermem
pip install -r requirements.txt

Environment variables

Create a .env file at the repository root:

# LLM backend (OpenAI-compatible; we use OpenRouter in the paper)
OPENROUTER_API_KEY=sk-...

# Local model endpoints
EMBEDDING_BASE_URL=http://localhost:11810/v1/embeddings
RERANKER_BASE_URL=http://localhost:12810

Local model services

HyperMem uses Qwen3-Embedding-4B for semantic encoding and Qwen3-Reranker-4B for reranking. Both are served via vLLM:

bash scripts/serve_embedding.sh   # GPUs 0-3, port 11810
bash scripts/serve_reranker.sh    # GPUs 4-7, port 12810

Reproducing the paper

The full LoCoMo evaluation pipeline is launched with a single command:

bash scripts/run_eval.sh

The script sequentially runs six stages; all artefacts are written under results/<experiment_name>/.

Stage Script Purpose
1 stage1_memory_extraction.py Episode detection from raw dialogues
2 stage2_hypergraph_extraction.py Topic aggregation + fact extraction + hypergraph construction
3 stage3_hypergraph_index.py BM25 and dense indices over the hypergraph
4 stage4_hypergraph_retrieval.py Top-down hierarchical retrieval
5 stage5_response.py LLM answer generation from retrieved evidence
6 stage6_eval.py LLM-as-judge evaluation (3 rounds, averaged)

Individual stages can be run via:

python hypermem/main/eval.py --stages 4 5 6

Configuration

All hyper-parameters live in hypermem/config.py and can be overridden through environment variables:

export HYPERMEM_EXPERIMENT_NAME="HyperMem-v3"
export HYPERMEM_USE_RERANKER=false
export HYPERMEM_INITIAL_CANDIDATES=100       # pre-fusion candidate pool
export HYPERMEM_TOPIC_TOP_K=15               # k^T
export HYPERMEM_EPISODE_TOP_K=25             # k^E
export HYPERMEM_FACT_TOP_K=30                # k^F

This setting uses $\lambda = 0.5$, $(k^T, k^E, k^F) = (15, 25, 30)$, BM25 + dense retrieval with RRF ($k = 60$), and sum aggregation for hyperedge embedding propagation.


Results

LoCoMo benchmark

Accuracy is reported as the LLM-as-judge score (GPT-4o-mini), averaged over 3 evaluation rounds.

Method Single-hop Multi-hop Temporal Open Domain Overall
GraphRAG 79.55 54.96 50.16 58.33 67.60
LightRAG 86.68 84.04 60.75 71.88 79.87
HippoRAG 2 86.44 75.89 78.50 66.67 81.62
HyperGraphRAG 90.61 80.85 85.36 70.83 86.49
OpenAI 63.79 42.92 21.71 63.22 52.90
LangMem 62.23 47.92 23.43 72.20 58.10
Zep 61.70 41.35 49.31 76.60 65.99
A-Mem 39.79 18.85 49.91 54.05 48.38
Mem0 67.13 51.15 55.51 72.93 66.88
Mem0$^g$ 65.71 47.19 58.13 75.71 68.44
MIRIX 85.11 83.70 88.39 65.62 85.38
Memobase 73.12 64.65 81.20 53.12 72.01
MemU 66.34 63.12 27.10 50.56 56.55
MemOS 81.09 67.49 75.18 55.90 75.80
HyperMem (Ours) 96.08 93.62 89.72 70.83 92.73

Project Structure

HyperMem/
├── hypermem/
│   ├── config.py                 # Experiment configuration
│   ├── types.py                  # Episode / Topic / Fact data classes
│   ├── structure.py              # Hypergraph nodes and hyperedges
│   ├── extractors/               # LLM-driven extraction modules
│   │   ├── episode_extractor.py
│   │   ├── topic_extractor.py
│   │   ├── fact_extractor.py
│   │   └── hypergraph_extractor.py
│   ├── llm/                      # OpenAI-compatible LLM / embedding / reranker clients
│   ├── prompts/                  # Prompt templates (episode / topic / fact / answer)
│   ├── utils/                    # Utility functions
│   └── main/                     # Six-stage pipeline entry points
├── scripts/
│   ├── run_eval.sh               # End-to-end evaluation driver
│   ├── serve_embedding.sh        # Qwen3-Embedding-4B server
│   └── serve_reranker.sh         # Qwen3-Reranker-4B server
├── data/                         # LoCoMo-10 and auxiliary benchmarks
├── results/                      # Per-experiment artefacts
├── requirements.txt
└── README.md

Each experiment directory under results/ contains the extracted episodes/, topics/, facts/, the built hypergraphs/, bm25_index/, vectors/, along with search_results.json, retrieval_logs.json, responses.json, and the final judged.json.


Citation

If HyperMem is useful in your research, please cite our paper:

@inproceedings{yue2026hypermem,
  title     = {HyperMem: Hypergraph Memory for Long-Term Conversations},
  author    = {Yue, Juwei and Hu, Chuanrui and Sheng, Jiawei and Zhou, Zuyi and Zhang, Wenyuan and Liu, Tingwen and Guo, Li and Deng, Yafeng},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
  year      = {2026}
}