You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: benchmark/MR-NIAH/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ Research teams have produced numerous memory benchmarks for chatbots, but their
11
11
-**Dataset mirroring (`fetch_data.py`)** – keeps a local `origin/` mirror of MiniMax’s MR-NIAH dumps.
12
12
-**Transcript bridge (`mr-niah-transcript.py`)** – rewrites raw turns into OpenClaw `session` JSON plus an `index`.
13
13
-**Batch execution (`run_batch.py`)** – rehydrates sessions into a profile, calls `openclaw agent`, and stores `results/`.
14
-
-**Comparison runner (`run_mem_compare.sh`)** – clones the baseline profile, installs the mem9 plugin, provisions a fresh mem9 space on the hosted mem9 API (or another configured mem9 endpoint), runs both profiles, and prints accuracy deltas.
14
+
-**Comparison runner (`run_mem_compare.sh`)** – clones the baseline profile, installs the mem9 plugin, provisions a fresh mem9 space on the hosted mem9 API (or another configured mem9 endpoint), runs both profiles, prints accuracy deltas, and (on successful full comparisons) writes a tar.gz archive containing results + logs. Supports `--model` / `--compact`, plus managed profiles (template + `.env`) to avoid baseline/mem drift; defaults to `benchmark/MR-NIAH/config/openclaw/`.
15
15
-**Scoring (`score.py`)** – invokes the MR-NIAH exact-match rubric so downstream results remain comparable to prior papers.
16
16
17
17
Directory layout, helper scripts, and agent responsibilities are summarized below:
@@ -6,10 +6,51 @@ This document explains how to prepare an OpenClaw profile, set up the required d
6
6
7
7
### OpenClaw profiles
8
8
9
-
1. Pick a baseline profile name (defaults to `mrniah_local` throughout the scripts) and initialize it with the OpenClaw CLI so that `~/.openclaw-<profile>/openclaw.json` exists.
10
-
2. Ensure the profile includes at least one agent (default: `main`) because `run_batch.py` and `run_mem_compare.sh` drop regenerated transcripts into `<profile>/agents/<agent>/sessions/` and update `sessions.json` automatically.
11
-
3. When you plan to run comparisons, keep the baseline profile pristine. The comparison script clones it to `~/.openclaw-${MRNIAH_MEM_PROFILE}` (default `mrniah_mem`) before installing the mem9 plugin, so everything that should be shared—API keys, transports, tools—must already live in the baseline directory.
12
-
4. If you are working from an existing team profile, copy the entire folder into `~/.openclaw-<profile>` (or let the CLI initialize it) before you start the pipeline; the scripts never copy data back into the repo.
9
+
There are two ways to run:
10
+
11
+
1)**Full baseline-vs-mem comparison (recommended)**: `run_mem_compare.sh` defaults to managed profiles and recreates fresh OpenClaw profiles per run from `benchmark/MR-NIAH/config/openclaw/`. You do not need to manually initialize `~/.openclaw-<profile>` beforehand.
12
+
13
+
2)**Single-profile batch runs** (e.g. calling `run_batch.py` directly): initialize your profile with the OpenClaw CLI so that `~/.openclaw-<profile>/openclaw.json` exists.
If you do not want to manually maintain two profiles (baseline + mem) and risk configuration drift (e.g. compaction settings), `run_mem_compare.sh` can recreate profiles from a template directory each run.
18
+
19
+
For full baseline-vs-mem comparisons, managed profiles are enabled by default to avoid accidental reuse of existing profiles. If you do not pass `--base-profile/--mem-profile`, the runner appends a `_yyyymmddhhmmss` suffix automatically.
20
+
21
+
Requirements:
22
+
23
+
- A template directory that contains at least an `openclaw.json` (it can also include `agents/`, `workspace/`, etc).
24
+
- An `.env` file that contains your secret API keys and any other required environment variables.
25
+
- The runner treats it as opaque and never prints it.
2) Edit `benchmark/MR-NIAH/config/openclaw/.env` to set your keys.
42
+
43
+
3) Ensure `benchmark/MR-NIAH/config/openclaw/openclaw.json` references the same variable names (it typically uses `${ENV_VAR}` placeholders).
44
+
45
+
Example (recreate baseline + mem from a local template, set model + compaction preset):
46
+
47
+
```
48
+
./run_mem_compare.sh \
49
+
--model "dashscope/qwen3.5-plus" \
50
+
--compact "safeguard-20k"
51
+
```
52
+
53
+
Compaction presets live under `benchmark/MR-NIAH/openclaw/compact/` (a default `safeguard-20k` preset is included).
13
54
14
55
### Software and infrastructure
15
56
@@ -18,7 +59,7 @@ This document explains how to prepare an OpenClaw profile, set up the required d
18
59
| Python 3.10+ & pip | Runs `fetch_data.py`, `mr-niah-transcript.py`, `run_batch.py`, and `score.py`. | Install dependencies with `python3 -m pip install -r requirements.txt` from the repo root if available, or install `requests`, `click`, and `rich` manually. |
19
60
| Git + network access to MiniMax’s MR-NIAH repo |`fetch_data.py` mirrors upstream datasets via GitHub. | Works with anonymous HTTPS; provide a token if your network requires it. |
20
61
| OpenClaw CLI (latest) | Executes agents for each regenerated session. | Verify `openclaw --version` works and that the CLI can run your chosen profile interactively. |
21
-
| Access to the hosted mem9 API (or another mem9-compatible endpoint) | Stores mem9 state whenever you run the comparison flow. | By default the script uses `https://api.mem9.ai`; set `MEM9_BASE_URL` if you want a different endpoint.|
62
+
| Access to the hosted mem9 API (or another mem9-compatible endpoint) | Stores mem9 state whenever you run the comparison flow. | By default the script uses `https://api.mem9.ai`; pass `--mem9-base-url` if you want a different endpoint. |
python3 run_batch.py --profile mrniah_local --agent main --local --limit 30
96
+
python3 run_batch.py --profile mrniah_local --agent main --limit 30
51
97
```
52
98
53
99
- The script copies each transcript into `<profile>/agents/<agent>/sessions/`, registers it in `sessions.json`, calls `openclaw agent --session-id ... --message "<question>" --json`, and stores both structured JSON and raw logs under `results/`.
54
100
- Key flags:
55
101
-`--profile` – target OpenClaw profile (must already exist as described above).
56
102
-`--agent` – agent directory name inside the profile. Defaults to `main`.
57
-
-`--local` – forwards OpenClaw’s `--local` flag, useful when the agent relies on local transports.
58
103
-`--limit` – cap the number of MR-NIAH samples processed.
104
+
-`--output-dir` – where to read `index.jsonl` and `sessions/*.jsonl` from (default: `output/`).
105
+
-`--import-sessions` – uploads the session transcript to mem9 via `/imports` before each agent turn. Requires mem9 tenant details via `--mem9-api-url/--mem9-tenant-id` (or env vars / profile config).
59
106
- Artifacts land in `results/predictions.jsonl` plus `results/raw/*.stdout.json` / `.stderr.txt`.
60
107
61
108
### 4. (Optional) Baseline vs mem9 comparison
62
109
63
110
```
64
-
MRNIAH_LIMIT=30 ./run_mem_compare.sh
111
+
./run_mem_compare.sh --limit 30
112
+
```
113
+
114
+
If you generated transcripts into a non-default output directory, pass the same location to the runner:
By default, the runner continues on per-case failures and records them into `predictions.jsonl`.
139
+
To stop immediately on the first failure, add `--fail-fast`.
140
+
141
+
To compare existing runs without re-running (e.g. baseline succeeded earlier, mem was re-run later):
142
+
143
+
```
144
+
./run_mem_compare.sh --compare
145
+
```
146
+
147
+
#### Common options
148
+
149
+
-`--model <provider/model>`: sets `agents.defaults.model.primary` for both baseline + mem profiles.
150
+
-`--compact <preset|path.json>`: applies a compaction preset to both baseline + mem profiles (`agents.defaults.contextTokens` + `agents.defaults.compaction`).
151
+
-`--model-context-window <n>`: best-effort patch of the selected model catalog entry in `openclaw.json` (`models.providers.*.models[].contextWindow`). This is only applied when the profile `openclaw.json` contains a matching model entry.
152
+
-`--mem9-base-url <url>`: overrides the default mem9 base URL for this run.
153
+
154
+
#### Post-processing (archive)
155
+
156
+
When you run a full baseline-vs-mem comparison (not `--profile`, not `--compare`, not `--case`, not `--resume`) and the script completes successfully, it automatically creates a tarball in `results-logs/` containing:
157
+
158
+
- both `results-<profile>/` directories
159
+
- the main compare log file
160
+
67
161
1. Verifies `output/index.jsonl` exists (generate it if missing).
68
-
2. Clones `~/.openclaw-${MRNIAH_BASE_PROFILE}` to `~/.openclaw-${MRNIAH_MEM_PROFILE}` unless you export `MRNIAH_RESET_MEM_PROFILE=1`.
69
-
3. Uses the hosted mem9 API by default (`https://api.mem9.ai`), or the endpoint you provide via `MEM9_BASE_URL`.
70
-
4. Provisions a fresh mem9 space for the run.
71
-
5. Installs the `openclaw-plugin` into the memory profile, adds `plugins.allow=["mem9"]`, and writes the tenant credentials into `plugins.entries.mem9.config`.
72
-
6. Calls `run_batch.py` twice (baseline vs mem), renaming each `results/` directory to `results-${profile}`.
|`MRNIAH_MEM_PROFILE`|`mrniah_mem`| Copy of the baseline with mem9 enabled. |
81
-
|`MRNIAH_AGENT`|`main`| Agent passed through to `run_batch.py`. |
82
-
|`MRNIAH_LIMIT`|`300`| Samples processed per run. |
83
-
|`MRNIAH_LOCAL`|`1`| When `1`, adds `--local` to every OpenClaw invocation. |
84
-
|`MEM9_BASE_URL`|`https://api.mem9.ai`| mem9 API endpoint used for the comparison run. |
85
-
|`MRNIAH_RESET_MEM_PROFILE`|`0`| Set to `1` to delete the mem profile before cloning. |
162
+
2. Creates `~/.openclaw-<mem-profile>` by cloning `~/.openclaw-<base-profile>` when the mem profile is missing, or when you pass `--reset-mem-profile`.
163
+
3. Uses the hosted mem9 API by default (`https://api.mem9.ai`), or the endpoint you provide via `--mem9-base-url`.
164
+
4. Chooses a mem9 isolation strategy via `--mem9-isolation`:
165
+
-`tenant` (default): provisions a fresh mem9 space per case (strong isolation; recommended).
166
+
-`clear`: provisions one mem9 space for the run and clears memories before/after each case.
167
+
5. Chooses a mem9 history load strategy via `--mem9-load-method`:
168
+
-`line-write` (default): replays the transcript by posting each JSONL message line to `v1alpha2 /memories` sequentially.
169
+
-`import-session`: uploads the full transcript via `v1alpha1 /imports` (`file_type=session`) and polls the task.
170
+
6. Installs the `openclaw-plugin` into the memory profile, adds `plugins.allow=["mem9"]`, and writes the tenant credentials into `plugins.entries.mem9.config`.
171
+
7. Calls `run_batch.py` twice (baseline vs mem), writing into `results-${profile}` for baseline and `results-${mem_profile}` for the mem run.
- Splits each ground-truth answer into key phrases and checks whether each phrase appears as a substring in the model prediction (case-sensitive). The per-sample score is the fraction of matched phrases. Refusal responses are scored as 0.
94
195
- Use `--max-errors` to print mismatched samples for manual inspection.
95
-
- Point the script at the comparison artifacts (`results-mrniah_local/predictions.jsonl`, `results-mrniah_mem/predictions.jsonl`) to evaluate each profile independently.
196
+
- Point the script at the comparison artifacts (`results-mrniah_local/predictions.jsonl`, `results-mrniah_mem/predictions.jsonl`) to evaluate each run independently.
96
197
97
198
### Troubleshooting tips
98
199
99
200
- Regenerating transcripts is safe—`mr-niah-transcript.py` deletes and recreates `output/` on every run.
100
201
- If OpenClaw logs include ANSI escape sequences, `run_batch.py` strips them before parsing JSON. Check `results/raw/*.stderr.txt` when a session fails.
101
-
- If the hosted mem9 API rejects provisioning or rate-limits requests, wait a bit and rerun, or point `MEM9_BASE_URL` to another mem9-compatible endpoint.
202
+
- If the hosted mem9 API rejects provisioning or rate-limits requests, wait a bit and rerun, or point `--mem9-base-url` to another mem9-compatible endpoint.
0 commit comments