Skip to content

Commit 1411d48

Browse files
Merge branch 'mem9-ai:main' into main
2 parents 24fe455 + c0ee90b commit 1411d48

19 files changed

+4277
-370
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,4 +42,6 @@ __pycache__/
4242
/benchmark/MR-NIAH/origin/
4343
/benchmark/MR-NIAH/output/
4444
/benchmark/MR-NIAH/results*/
45+
/benchmark/MR-NIAH/logs/
46+
/benchmark/MR-NIAH/results-logs/
4547
/benchmark/MR-NIAH/.cache/

benchmark/MR-NIAH/AGENTS.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@ MR-NIAH is a bridge from the MiniMax benchmark corpus to OpenClaw sessions and m
88

99
## Files and workflow
1010

11-
| File | Role |
12-
|------|------|
13-
| `fetch_data.py` | Mirror/update upstream dataset into `origin/` |
14-
| `mr-niah-transcript.py` | Convert raw turns into OpenClaw session JSON |
15-
| `run_batch.py` | Replay generated sessions through an OpenClaw profile |
16-
| `run_mem_compare.sh` | Compare baseline vs mem9-enabled profile |
17-
| `score.py` | Apply MR-NIAH scoring rubric to predictions |
18-
| `USAGE.md` | Full prerequisites and end-to-end usage |
11+
| File | Role |
12+
| ----------------------- | ----------------------------------------------------- |
13+
| `fetch_data.py` | Mirror/update upstream dataset into `origin/` |
14+
| `mr-niah-transcript.py` | Convert raw turns into OpenClaw session JSON |
15+
| `run_batch.py` | Replay generated sessions through an OpenClaw profile |
16+
| `run_mem_compare.sh` | Compare baseline vs mem9-enabled profile |
17+
| `score.py` | Apply MR-NIAH scoring rubric to predictions |
18+
| `USAGE.md` | Full prerequisites and end-to-end usage |
1919

2020
## Where to look
2121

@@ -30,8 +30,8 @@ MR-NIAH is a bridge from the MiniMax benchmark corpus to OpenClaw sessions and m
3030
```bash
3131
cd benchmark/MR-NIAH && python3 fetch_data.py
3232
cd benchmark/MR-NIAH && python3 mr-niah-transcript.py
33-
cd benchmark/MR-NIAH && python3 run_batch.py --profile mrniah_local --agent main --local --limit 30
34-
cd benchmark/MR-NIAH && MRNIAH_LIMIT=30 bash run_mem_compare.sh
33+
cd benchmark/MR-NIAH && python3 run_batch.py --profile mrniah_local --agent main --limit 30
34+
cd benchmark/MR-NIAH && SAMPLE_LIMIT=30 bash run_mem_compare.sh
3535
cd benchmark/MR-NIAH && python3 score.py results/predictions.jsonl
3636
```
3737

benchmark/MR-NIAH/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Research teams have produced numerous memory benchmarks for chatbots, but their
1111
- **Dataset mirroring (`fetch_data.py`)** – keeps a local `origin/` mirror of MiniMax’s MR-NIAH dumps.
1212
- **Transcript bridge (`mr-niah-transcript.py`)** – rewrites raw turns into OpenClaw `session` JSON plus an `index`.
1313
- **Batch execution (`run_batch.py`)** – rehydrates sessions into a profile, calls `openclaw agent`, and stores `results/`.
14-
- **Comparison runner (`run_mem_compare.sh`)** – clones the baseline profile, installs the mem9 plugin, provisions a fresh mem9 space on the hosted mem9 API (or another configured mem9 endpoint), runs both profiles, and prints accuracy deltas.
14+
- **Comparison runner (`run_mem_compare.sh`)** – clones the baseline profile, installs the mem9 plugin, provisions a fresh mem9 space on the hosted mem9 API (or another configured mem9 endpoint), runs both profiles, prints accuracy deltas, and (on successful full comparisons) writes a tar.gz archive containing results + logs. Supports `--model` / `--compact`, plus managed profiles (template + `.env`) to avoid baseline/mem drift; defaults to `benchmark/MR-NIAH/config/openclaw/`.
1515
- **Scoring (`score.py`)** – invokes the MR-NIAH exact-match rubric so downstream results remain comparable to prior papers.
1616

1717
Directory layout, helper scripts, and agent responsibilities are summarized below:

benchmark/MR-NIAH/USAGE.md

Lines changed: 129 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,51 @@ This document explains how to prepare an OpenClaw profile, set up the required d
66

77
### OpenClaw profiles
88

9-
1. Pick a baseline profile name (defaults to `mrniah_local` throughout the scripts) and initialize it with the OpenClaw CLI so that `~/.openclaw-<profile>/openclaw.json` exists.
10-
2. Ensure the profile includes at least one agent (default: `main`) because `run_batch.py` and `run_mem_compare.sh` drop regenerated transcripts into `<profile>/agents/<agent>/sessions/` and update `sessions.json` automatically.
11-
3. When you plan to run comparisons, keep the baseline profile pristine. The comparison script clones it to `~/.openclaw-${MRNIAH_MEM_PROFILE}` (default `mrniah_mem`) before installing the mem9 plugin, so everything that should be shared—API keys, transports, tools—must already live in the baseline directory.
12-
4. If you are working from an existing team profile, copy the entire folder into `~/.openclaw-<profile>` (or let the CLI initialize it) before you start the pipeline; the scripts never copy data back into the repo.
9+
There are two ways to run:
10+
11+
1) **Full baseline-vs-mem comparison (recommended)**: `run_mem_compare.sh` defaults to managed profiles and recreates fresh OpenClaw profiles per run from `benchmark/MR-NIAH/config/openclaw/`. You do not need to manually initialize `~/.openclaw-<profile>` beforehand.
12+
13+
2) **Single-profile batch runs** (e.g. calling `run_batch.py` directly): initialize your profile with the OpenClaw CLI so that `~/.openclaw-<profile>/openclaw.json` exists.
14+
15+
#### (Optional) Managed profiles (template + .env)
16+
17+
If you do not want to manually maintain two profiles (baseline + mem) and risk configuration drift (e.g. compaction settings), `run_mem_compare.sh` can recreate profiles from a template directory each run.
18+
19+
For full baseline-vs-mem comparisons, managed profiles are enabled by default to avoid accidental reuse of existing profiles. If you do not pass `--base-profile/--mem-profile`, the runner appends a `_yyyymmddhhmmss` suffix automatically.
20+
21+
Requirements:
22+
23+
- A template directory that contains at least an `openclaw.json` (it can also include `agents/`, `workspace/`, etc).
24+
- An `.env` file that contains your secret API keys and any other required environment variables.
25+
- The runner treats it as opaque and never prints it.
26+
- `.env` is gitignored.
27+
28+
Default locations (in this repo):
29+
30+
- Template dir: `benchmark/MR-NIAH/config/openclaw/` (must contain `openclaw.json`)
31+
- Env file: `benchmark/MR-NIAH/config/openclaw/.env`
32+
33+
Setup:
34+
35+
1) Copy `example.env` to `.env`:
36+
37+
```
38+
cp benchmark/MR-NIAH/config/openclaw/example.env benchmark/MR-NIAH/config/openclaw/.env
39+
```
40+
41+
2) Edit `benchmark/MR-NIAH/config/openclaw/.env` to set your keys.
42+
43+
3) Ensure `benchmark/MR-NIAH/config/openclaw/openclaw.json` references the same variable names (it typically uses `${ENV_VAR}` placeholders).
44+
45+
Example (recreate baseline + mem from a local template, set model + compaction preset):
46+
47+
```
48+
./run_mem_compare.sh \
49+
--model "dashscope/qwen3.5-plus" \
50+
--compact "safeguard-20k"
51+
```
52+
53+
Compaction presets live under `benchmark/MR-NIAH/openclaw/compact/` (a default `safeguard-20k` preset is included).
1354

1455
### Software and infrastructure
1556

@@ -18,7 +59,7 @@ This document explains how to prepare an OpenClaw profile, set up the required d
1859
| Python 3.10+ & pip | Runs `fetch_data.py`, `mr-niah-transcript.py`, `run_batch.py`, and `score.py`. | Install dependencies with `python3 -m pip install -r requirements.txt` from the repo root if available, or install `requests`, `click`, and `rich` manually. |
1960
| Git + network access to MiniMax’s MR-NIAH repo | `fetch_data.py` mirrors upstream datasets via GitHub. | Works with anonymous HTTPS; provide a token if your network requires it. |
2061
| OpenClaw CLI (latest) | Executes agents for each regenerated session. | Verify `openclaw --version` works and that the CLI can run your chosen profile interactively. |
21-
| Access to the hosted mem9 API (or another mem9-compatible endpoint) | Stores mem9 state whenever you run the comparison flow. | By default the script uses `https://api.mem9.ai`; set `MEM9_BASE_URL` if you want a different endpoint. |
62+
| Access to the hosted mem9 API (or another mem9-compatible endpoint) | Stores mem9 state whenever you run the comparison flow. | By default the script uses `https://api.mem9.ai`; pass `--mem9-base-url` if you want a different endpoint. |
2263

2364
## Pipeline
2465

@@ -43,46 +84,106 @@ python3 mr-niah-transcript.py [--lang LANG] [--tokens BUCKET ...] [--input FILE
4384
- `output/sessions/<uuid>.jsonl` – session history ready for OpenClaw.
4485
- `output/index.jsonl` – metadata that downstream steps consume.
4586
- The defaults read all files in `origin/` if present; pass explicit files with `--input` or disable auto-selection via `--lang none`.
87+
- If `benchmark/MR-NIAH/output/` is not writable (or you want to keep it immutable), write transcripts somewhere else:
88+
89+
```
90+
python3 mr-niah-transcript.py --output-dir /tmp/mrniah-output
91+
```
4692

4793
### 3. Run OpenClaw batches
4894

4995
```
50-
python3 run_batch.py --profile mrniah_local --agent main --local --limit 30
96+
python3 run_batch.py --profile mrniah_local --agent main --limit 30
5197
```
5298

5399
- The script copies each transcript into `<profile>/agents/<agent>/sessions/`, registers it in `sessions.json`, calls `openclaw agent --session-id ... --message "<question>" --json`, and stores both structured JSON and raw logs under `results/`.
54100
- Key flags:
55101
- `--profile` – target OpenClaw profile (must already exist as described above).
56102
- `--agent` – agent directory name inside the profile. Defaults to `main`.
57-
- `--local` – forwards OpenClaw’s `--local` flag, useful when the agent relies on local transports.
58103
- `--limit` – cap the number of MR-NIAH samples processed.
104+
- `--output-dir` – where to read `index.jsonl` and `sessions/*.jsonl` from (default: `output/`).
105+
- `--import-sessions` – uploads the session transcript to mem9 via `/imports` before each agent turn. Requires mem9 tenant details via `--mem9-api-url/--mem9-tenant-id` (or env vars / profile config).
59106
- Artifacts land in `results/predictions.jsonl` plus `results/raw/*.stdout.json` / `.stderr.txt`.
60107

61108
### 4. (Optional) Baseline vs mem9 comparison
62109

63110
```
64-
MRNIAH_LIMIT=30 ./run_mem_compare.sh
111+
./run_mem_compare.sh --limit 30
112+
```
113+
114+
If you generated transcripts into a non-default output directory, pass the same location to the runner:
115+
116+
```
117+
./run_mem_compare.sh --output-dir /tmp/mrniah-output --limit 10
118+
```
119+
120+
To rerun only one side (useful when baseline already exists and you just want to retry the mem9 run):
121+
122+
```
123+
./run_mem_compare.sh --profile mrniah_mem --limit 30
65124
```
66125

126+
To resume a failed single-profile run from a specific sample id (keeps `benchmark/MR-NIAH/results-<profile>/` and appends to `predictions.jsonl`):
127+
128+
```
129+
./run_mem_compare.sh --profile mrniah_mem --resume 91
130+
```
131+
132+
To re-run a single case (useful for patching up failures after the batch finishes):
133+
134+
```
135+
./run_mem_compare.sh --profile mrniah_mem --case 91
136+
```
137+
138+
By default, the runner continues on per-case failures and records them into `predictions.jsonl`.
139+
To stop immediately on the first failure, add `--fail-fast`.
140+
141+
To compare existing runs without re-running (e.g. baseline succeeded earlier, mem was re-run later):
142+
143+
```
144+
./run_mem_compare.sh --compare
145+
```
146+
147+
#### Common options
148+
149+
- `--model <provider/model>`: sets `agents.defaults.model.primary` for both baseline + mem profiles.
150+
- `--compact <preset|path.json>`: applies a compaction preset to both baseline + mem profiles (`agents.defaults.contextTokens` + `agents.defaults.compaction`).
151+
- `--model-context-window <n>`: best-effort patch of the selected model catalog entry in `openclaw.json` (`models.providers.*.models[].contextWindow`). This is only applied when the profile `openclaw.json` contains a matching model entry.
152+
- `--mem9-base-url <url>`: overrides the default mem9 base URL for this run.
153+
154+
#### Post-processing (archive)
155+
156+
When you run a full baseline-vs-mem comparison (not `--profile`, not `--compare`, not `--case`, not `--resume`) and the script completes successfully, it automatically creates a tarball in `results-logs/` containing:
157+
158+
- both `results-<profile>/` directories
159+
- the main compare log file
160+
67161
1. Verifies `output/index.jsonl` exists (generate it if missing).
68-
2. Clones `~/.openclaw-${MRNIAH_BASE_PROFILE}` to `~/.openclaw-${MRNIAH_MEM_PROFILE}` unless you export `MRNIAH_RESET_MEM_PROFILE=1`.
69-
3. Uses the hosted mem9 API by default (`https://api.mem9.ai`), or the endpoint you provide via `MEM9_BASE_URL`.
70-
4. Provisions a fresh mem9 space for the run.
71-
5. Installs the `openclaw-plugin` into the memory profile, adds `plugins.allow=["mem9"]`, and writes the tenant credentials into `plugins.entries.mem9.config`.
72-
6. Calls `run_batch.py` twice (baseline vs mem), renaming each `results/` directory to `results-${profile}`.
73-
7. Prints accuracy for both runs and the delta.
74-
75-
Common environment variables:
76-
77-
| Variable | Default | Purpose |
78-
| -------------------------- | --------------------------------------------- | --------------------------------------------------------------- |
79-
| `MRNIAH_BASE_PROFILE` | `mrniah_local` | Baseline OpenClaw profile. |
80-
| `MRNIAH_MEM_PROFILE` | `mrniah_mem` | Copy of the baseline with mem9 enabled. |
81-
| `MRNIAH_AGENT` | `main` | Agent passed through to `run_batch.py`. |
82-
| `MRNIAH_LIMIT` | `300` | Samples processed per run. |
83-
| `MRNIAH_LOCAL` | `1` | When `1`, adds `--local` to every OpenClaw invocation. |
84-
| `MEM9_BASE_URL` | `https://api.mem9.ai` | mem9 API endpoint used for the comparison run. |
85-
| `MRNIAH_RESET_MEM_PROFILE` | `0` | Set to `1` to delete the mem profile before cloning. |
162+
2. Creates `~/.openclaw-<mem-profile>` by cloning `~/.openclaw-<base-profile>` when the mem profile is missing, or when you pass `--reset-mem-profile`.
163+
3. Uses the hosted mem9 API by default (`https://api.mem9.ai`), or the endpoint you provide via `--mem9-base-url`.
164+
4. Chooses a mem9 isolation strategy via `--mem9-isolation`:
165+
- `tenant` (default): provisions a fresh mem9 space per case (strong isolation; recommended).
166+
- `clear`: provisions one mem9 space for the run and clears memories before/after each case.
167+
5. Chooses a mem9 history load strategy via `--mem9-load-method`:
168+
- `line-write` (default): replays the transcript by posting each JSONL message line to `v1alpha2 /memories` sequentially.
169+
- `import-session`: uploads the full transcript via `v1alpha1 /imports` (`file_type=session`) and polls the task.
170+
6. Installs the `openclaw-plugin` into the memory profile, adds `plugins.allow=["mem9"]`, and writes the tenant credentials into `plugins.entries.mem9.config`.
171+
7. Calls `run_batch.py` twice (baseline vs mem), writing into `results-${profile}` for baseline and `results-${mem_profile}` for the mem run.
172+
8. Prints accuracy for both runs and the delta.
173+
174+
Key flags for reproducibility:
175+
176+
- `--base-profile` / `--mem-profile` / `--agent` / `--limit`
177+
- `--mem9-base-url` / `--mem9-isolation` / `--mem9-load-method`
178+
- `--mem9-line-write-*` and `--mem9-import-*` (depending on load method)
179+
- `--mem9-trace-*`
180+
- `--parallel` / `--sequential`
181+
- `--openclaw-timeout`
182+
- `--reset-mem-profile`
183+
184+
Workspace note:
185+
186+
- The scripts configure each OpenClaw profile to use a benchmark workspace under `~/.openclaw-<profile>/workspace` (not under `~/.openclaw/`).
86187

87188
### 5. Score predictions
88189

@@ -92,10 +193,10 @@ python3 score.py [results/predictions.jsonl] [--max-errors 5]
92193

93194
- Splits each ground-truth answer into key phrases and checks whether each phrase appears as a substring in the model prediction (case-sensitive). The per-sample score is the fraction of matched phrases. Refusal responses are scored as 0.
94195
- Use `--max-errors` to print mismatched samples for manual inspection.
95-
- Point the script at the comparison artifacts (`results-mrniah_local/predictions.jsonl`, `results-mrniah_mem/predictions.jsonl`) to evaluate each profile independently.
196+
- Point the script at the comparison artifacts (`results-mrniah_local/predictions.jsonl`, `results-mrniah_mem/predictions.jsonl`) to evaluate each run independently.
96197

97198
### Troubleshooting tips
98199

99200
- Regenerating transcripts is safe—`mr-niah-transcript.py` deletes and recreates `output/` on every run.
100201
- If OpenClaw logs include ANSI escape sequences, `run_batch.py` strips them before parsing JSON. Check `results/raw/*.stderr.txt` when a session fails.
101-
- If the hosted mem9 API rejects provisioning or rate-limits requests, wait a bit and rerun, or point `MEM9_BASE_URL` to another mem9-compatible endpoint.
202+
- If the hosted mem9 API rejects provisioning or rate-limits requests, wait a bit and rerun, or point `--mem9-base-url` to another mem9-compatible endpoint.

0 commit comments

Comments
 (0)