Skip to content

Commit fe52083

Browse files
MakiDevelopclaude
andcommitted
docs: ADR 0006 (HttpEmbedder) + 2026-04-20 incident log + deploy footguns + CHANGELOG
2026-04-20 memory-hall 第一次 production incident 的治理層沉澱。兩個 cascading issues:(1) Ollama LLM queue 餓死 bge-m3 embed → HttpEmbedder 新路徑; (2) docker compose --force-recreate 把 docker run 建的 volume 當 orphan 清掉 → 47 筆資料靠早上 JSONL dump 救回。 - ADR 0006:HttpEmbedder 的設計理由、alternatives、API shape、health timeout 分離 - docs/operations/incident-2026-04-20-embed-queue.md:完整 timeline + root cause + operator guidance - docs/deploy.md:新增「Deploy footguns」段,納入 embed 隔離 / backup before force-recreate / macOS keychain / port alignment 四項 - CHANGELOG.md:首次建立,含 Unreleased(HttpEmbedder + health_embed_timeout_s)+ 0.2.0 / 0.1.0 歷史 Follow-up(TODO,未在本 commit): - deploy/memhall-dump.sh(每日 JSONL dump cron,本次救命資料的 cron 化) - docker-compose.yml 改用 bind mount 對齊 docs/deploy.md 的 primary 範例(目前是 named volume,這正是 incident #2 的根因) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a80d8cc commit fe52083

4 files changed

Lines changed: 254 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Changelog
2+
3+
All notable changes to memory-hall are documented here.
4+
5+
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). memory-hall uses versioned 0.x releases; see [ADR 0005](docs/adr/0005-v0.2-minimum-viable-contract.md) for what's frozen vs free-to-change at each 0.x version.
6+
7+
## [Unreleased]
8+
9+
### Added
10+
11+
- **`HttpEmbedder`** — second embedder backend alongside `OllamaEmbedder`. Speaks a minimal `POST /embed` / `{"texts": [...]}``{"dense_vecs": [...]}` contract. Opt in with `MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`. Rationale in [ADR 0006](docs/adr/0006-http-embedder-embed-queue-isolation.md).
12+
- **`health_embed_timeout_s`** config (default `3.0s`) — separate knob for `/v1/health` embed-probe timeout, independent from write-path `embed_timeout_s`. Fixes a 1-second hardcoded timeout that was too tight for remote embed services.
13+
- **Operator footgun docs** in [`docs/deploy.md`](docs/deploy.md) and a full incident writeup at [`docs/operations/incident-2026-04-20-embed-queue.md`](docs/operations/incident-2026-04-20-embed-queue.md). Covers Ollama-queue starvation, named-volume replacement during compose recreate, and macOS keychain non-interactive builds.
14+
15+
### Changed
16+
17+
- **`docker-compose.yml`** host port is now `9100:9000` (was `9000:9000`). Matches what existing operator docs and examples already assumed. If you pinned the old `9000` port in downstream callers, update before recreating the container.
18+
- **Health probe** (`_refresh_health_cache`) uses `health_embed_timeout_s` instead of `min(1.0, embed_timeout_s)`. This is a behavioral tightening of the v0.2 `/v1/health` contract only for the `degraded` threshold — the response shape is unchanged.
19+
20+
### Fixed
21+
22+
- Health probe no longer false-degrades when the embedder is a remote HTTP service with typical (~500ms–1s) cold-path latency.
23+
24+
## 0.2.0 — 2026-04-19
25+
26+
v0.2 minimum viable contract freeze. See [ADR 0005](docs/adr/0005-v0.2-minimum-viable-contract.md) for the full frozen surface.
27+
28+
Highlights:
29+
- `/v1/memory/write`, `/v1/memory/search`, `/v1/health` contracts declared stable for the v0.2.x line.
30+
- sqlite-vec v0.1.6 as the default vector store.
31+
- Content-hash–based deduplication (`(agent_id, namespace, type, content)` → deterministic `entry_id`).
32+
- Multi-tenant data model (single-tenant runtime in 0.2, multi-tenant deferred to 0.3+).
33+
34+
## 0.1.0 — 2026-04-18
35+
36+
Initial public release. See [ADR 0001](docs/adr/0001-drop-mem0.md) for the project's founding rationale (why memory-hall exists vs mem0 / LangMem / Zep).
37+
38+
Core:
39+
- SQLite + sqlite-vec storage.
40+
- Ollama (`bge-m3`) embeddings.
41+
- HTTP API (`/v1/memory/write`, `/v1/memory/search`, `/v1/memory/{id}`, `/v1/health`).
42+
- CLI (`memory-hall write / search / list / reindex-fts`).
43+
- Python embedded usage (`from memory_hall import Settings, build_runtime`).
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# ADR 0006 — HttpEmbedder: embed path isolation from LLM queue
2+
3+
- **Status**: Accepted
4+
- **Date**: 2026-04-20
5+
- **Related**: ADR 0005 (v0.2 Minimum Viable Contract), incident log `docs/operations/incident-2026-04-20-embed-queue.md`
6+
7+
## Context
8+
9+
Until 0.2.0, memory-hall had exactly one embedder backend: `OllamaEmbedder` against `MH_OLLAMA_BASE_URL`. On 2026-04-20 the production seven-agent stack hit an incident that exposed a structural problem with this:
10+
11+
**Ollama is a shared runner pool**. When LLM clients hammer `/api/generate` or `/v1/chat/completions` with multi-GB models (qwen3-vl, qwen3.5:35b-a3b, etc.), Ollama's scheduler evicts/loads models to fit GPU+system memory. `bge-m3` (small, fast, embed-only) gets starved: every embed request triggers a cold load that never wins against the LLM traffic.
12+
13+
Observed symptoms on the day:
14+
- `/v1/health` returned `embedder: degraded` continuously (1s probe timeout, but cold-load of bge-m3 through Ollama's queue took >10s).
15+
- `POST /v1/memory/write` succeeded with `202 Accepted` but entries stayed `sync_status: pending`, `indexed_at: null` indefinitely.
16+
- Direct `curl http://dgx:11434/api/embed` from the memory-hall container timed out at 30s even with all other models stopped — Ollama's eviction/load loop was saturated by a separate LLM client.
17+
18+
Meanwhile a **dedicated bge-m3 HTTP service** on the same embedder host (`:8790`, FastAPI + transformers) was consistently healthy. It does one thing — serve bge-m3 embeddings — and is not subject to Ollama's scheduler.
19+
20+
## Decision
21+
22+
**Add `HttpEmbedder` as a first-class embedder backend, selectable at runtime via `MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`.**
23+
24+
The existing `OllamaEmbedder` remains the default for backward compatibility. Operators who already have a dedicated bge-m3 HTTP service (or any service with the same API shape) can opt in without touching code.
25+
26+
### API shape assumed by HttpEmbedder
27+
28+
```
29+
POST /embed
30+
Request : {"texts": [str, str, ...]}
31+
Response : {"model": str, "dimension": int, "count": int, "dense_vecs": [[float, ...], ...]}
32+
```
33+
34+
This matches the reference dedicated bge-m3 service (and is trivial to wrap any embedding service that returns a vector list).
35+
36+
### Health probe separation
37+
38+
A secondary finding from the same incident: the health probe hardcoded `timeout=min(1.0, embed_timeout_s)`. That 1-second floor is fine for local Ollama, unreasonable for a remote HTTP service. Added `health_embed_timeout_s: float = 3.0` as a separate setting so operators can tune health-probe strictness independently from write-path timeout.
39+
40+
## Consequences
41+
42+
### Gains
43+
44+
- **No more LLM-queue starvation for embeddings.** An operator who points memory-hall at a dedicated embed service gets a hard isolation boundary from whatever else is hammering the LLM runner.
45+
- **Swappable embed backends.** The protocol is documented and minimal; anyone can write a 20-line wrapper in front of bge-m3, nomic-embed, or a cloud embed API, and memory-hall consumes it unchanged.
46+
- **Backward compatible.** Default remains `MH_EMBEDDER_KIND=ollama`; existing deployments do nothing.
47+
48+
### Costs
49+
50+
- **Two embedder codepaths** to maintain. Both are ~60 lines; drift risk is low but real. Covered by `tests/test_http_embedder.py` + `tests/test_smoke.py::test_health_uses_health_embed_timeout`.
51+
- **Operator now has two settings to understand.** `MH_EMBEDDER_KIND` is explicit and documented in `docker-compose.yml` comments; acceptable overhead.
52+
53+
### Non-goals
54+
55+
- Not solving "multi-embedder with automatic failover". A single kind at a time; if the chosen backend is down, the embedder is down. Failover is the operator's circuit breaker concern, not the engine's.
56+
- Not abstracting into a plugin system. Two concrete classes implementing the `Embedder` protocol is enough; adding plugin discovery is premature.
57+
58+
## Alternatives considered
59+
60+
### A. Stay on Ollama, preload bge-m3 permanently with `OLLAMA_KEEP_ALIVE=-1`
61+
62+
Rejected after direct test on the day of the incident: even with bge-m3 pinned, Ollama's scheduler still evicted it when LLM clients requested models whose total memory need exceeded free VRAM. The pin is advisory, not a hard reservation.
63+
64+
### B. Put nginx in front of Ollama to rewrite `/api/embed``:8790/embed`
65+
66+
Rejected: payload shapes differ (`{"input": ...}` vs `{"texts": [...]}`, `embeddings` vs `dense_vecs`). A translation layer in nginx is possible but ugly; a 60-line Python class is cleaner and testable.
67+
68+
### C. Make memory-hall embed in-process (no HTTP hop)
69+
70+
Rejected for now: requires shipping ~2GB of bge-m3 weights into the memory-hall image or as a sidecar. The "engine stays small" philosophy (README) argues against it. Operators who want in-process can wrap the CLI or Python entry points; the server path stays HTTP.
71+
72+
## Implementation summary
73+
74+
- `src/memory_hall/embedder/http_embedder.py` — new class, ~60 lines.
75+
- `src/memory_hall/config.py` — add `embedder_kind`, `embed_base_url`, `embed_dim`, `health_embed_timeout_s`.
76+
- `src/memory_hall/server/app.py` — factory branch on `embedder_kind`; health probe uses `health_embed_timeout_s`.
77+
- `docker-compose.yml` — pass-through envs with sane defaults.
78+
- `tests/test_http_embedder.py`, `tests/test_smoke.py` — coverage including dim mismatch, error propagation, empty input, and the new health-probe timeout behavior.
79+
80+
Total: +249 / -9, 7 files. 12 new/updated tests pass; existing test suite unaffected.

docs/deploy.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,30 @@ For production with zero-downtime needs, see v0.2 roadmap — not supported in v
9494
- **Cold standby only**: running primary and backup simultaneously will split-brain the writes.
9595
- **No automatic failover**: manual DNS / config switch.
9696
- **No encryption at rest**: SQLite files are plaintext. Use disk-level encryption (FileVault / LUKS) if your data warrants it.
97+
98+
## Deploy footguns (learned the hard way)
99+
100+
See [`docs/operations/incident-2026-04-20-embed-queue.md`](operations/incident-2026-04-20-embed-queue.md) for the full story. Short version:
101+
102+
### Don't embed through a shared Ollama
103+
104+
If the same Ollama instance serves large LLM clients, `bge-m3` will starve. Either point memory-hall at a dedicated embed service (`MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`) or keep Ollama exclusive to embeddings. See [ADR 0006](adr/0006-http-embedder-embed-queue-isolation.md).
105+
106+
### Back up before `docker compose up --force-recreate`
107+
108+
If your existing deployment was created with plain `docker run`, compose may replace your data volume on recreate. Always snapshot first:
109+
110+
```bash
111+
docker run --rm -v memory-hall_mh-data:/backup alpine \
112+
tar czf - /backup > memhall-backup-$(date +%F).tar.gz
113+
```
114+
115+
Or — and this is the pattern this doc has recommended since v0.1 — use a **bind mount** (`-v ~/data/memory-hall:/data`) instead of a named volume. Bind mounts are transparent, trivially backed up via `rsync`, and compose cannot silently swap them.
116+
117+
### macOS-specific: keychain must be unlocked for `docker compose build`
118+
119+
Docker Desktop's credential helper requires GUI keychain access. `ssh` into a Mac to build and you'll see `keychain cannot be accessed because the current session does not allow user interaction`. Run `security -v unlock-keychain ~/Library/Keychains/login.keychain-db` in an interactive session first, or build elsewhere and `docker save | docker load` across.
120+
121+
### Port alignment
122+
123+
This repo's `docker-compose.yml` exposes `9100:9000` (host:container). If your existing deployment was started with a different host port, callers coded against the old port will break on the first `force-recreate`. Grep your agent stack for the literal port number before redeploying.
Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Incident 2026-04-20 — Embed-path starvation + volume replacement during deploy
2+
3+
**Severity**: High (memhall write path degraded for ~6h; data loss narrowly avoided via backup)
4+
**Duration**: ~6 hours of `embedder: degraded` / `sync_status: pending` before detection
5+
**Resolved**: 2026-04-20 18:25 Taipei
6+
7+
Two cascading issues hit memhall's primary deployment (single Mac mini) on the same day. Both are operator-facing, not engine bugs. Documented here so others deploying memory-hall in similar topologies can avoid them.
8+
9+
---
10+
11+
## Issue 1 — Ollama LLM queue starves bge-m3
12+
13+
### Symptom
14+
15+
- `/v1/health` reported `embedder: degraded` for hours without recovery.
16+
- `POST /v1/memory/write` returned 202 but entries persisted with `sync_status: pending`, `indexed_at: null`.
17+
- Reading worked (lexical/FTS fallback), but semantic search scores collapsed because new writes weren't in the vector index.
18+
19+
### Root cause
20+
21+
memory-hall pointed its embedder at a shared Ollama instance (`MH_OLLAMA_BASE_URL=...:11434`). That Ollama was simultaneously serving large LLM clients (qwen3-vl, qwen3.5:35b, etc.) whose combined model weights exceeded available GPU memory. Ollama's scheduler entered a constant evict/load loop. `bge-m3` (small, fast) could not win a slot: every embed request saw a cold load that timed out before bge-m3 got loaded.
22+
23+
Direct test on the day:
24+
- `curl .../api/tags` — 200 in <1s (Ollama metadata is fine).
25+
- `curl .../api/embed -d '{"model":"bge-m3", ...}'` — 30s timeout, no response, even after `ollama stop` on the blocking LLM.
26+
27+
A dedicated bge-m3 HTTP service on the same host (`:8790`, just FastAPI + transformers) was consistently healthy throughout.
28+
29+
### Resolution
30+
31+
Introduced `HttpEmbedder` (see [ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md)). Set:
32+
33+
```
34+
MH_EMBEDDER_KIND=http
35+
MH_EMBED_BASE_URL=http://<dedicated-embed-host>:8790
36+
```
37+
38+
After redeploy: `/v1/health` immediately returned `ok`; new writes completed with `embedded: true` synchronously; semantic search scores recovered (RRF 0.033 / semantic 0.638 on the canonical test query, vs 0.016 / unavailable while degraded).
39+
40+
### Operator guidance
41+
42+
If you share one Ollama instance across multiple agent stacks, **do not use it for embeddings**. Ollama's scheduler is not designed for mixed small-frequent (embed) + large-rare (LLM) workloads. Either:
43+
44+
1. Run a dedicated embed service (any service with `POST /embed {"texts":[...]}``{"dense_vecs":[...]}` shape works with `HttpEmbedder`), or
45+
2. Dedicate an Ollama instance exclusively to embedding models (no LLM clients allowed).
46+
47+
---
48+
49+
## Issue 2 — Named-volume replaced when switching from `docker run` to `docker compose`
50+
51+
### Symptom
52+
53+
During the fix for Issue 1, redeploying via `docker compose up -d --force-recreate memory-hall` silently created a new empty `memory-hall_mh-data` named volume. The running container came up healthy but with **zero existing entries** visible.
54+
55+
### Root cause
56+
57+
The original deployment used `docker run -v memory-hall_mh-data:/data ...` (or a similarly-named volume), created ad-hoc. When `docker-compose.yml` declared a volume of the same short name (`mh-data`), compose treats the project namespace: the effective volume becomes `${project}_mh-data` = `memory-hall_mh-data` — but **only when Compose manages it**. An existing volume of the same literal name, created outside Compose, does not automatically inherit Compose project labels.
58+
59+
What actually happened in this deploy (reconstructed from `docker volume inspect` timestamps): the pre-existing volume was treated as an orphan by Compose and replaced with a freshly-created empty volume carrying the correct `com.docker.compose.project` labels. The old volume's data was not mounted into the new container.
60+
61+
Data was recovered from a JSONL dump that happened to be taken for unrelated reasons ~9 hours earlier. Without that dump, the 47 pre-existing entries would have been lost.
62+
63+
### Operator guidance (critical)
64+
65+
Before running `docker compose up --force-recreate` against a service that was previously started via plain `docker run`:
66+
67+
1. **Back up the data directory first.** For memhall:
68+
```bash
69+
docker run --rm -v memory-hall_mh-data:/backup alpine \
70+
tar czf - /backup > memhall-backup-$(date +%F).tar.gz
71+
```
72+
Or use the bind-mount layout recommended in [`docs/deploy.md`](../deploy.md) and snapshot the host path directly.
73+
74+
2. **Confirm which volume Compose will use.** `docker compose config` prints the resolved volume references. If Compose would create `${project}_<name>` but your old data is under just `<name>` (or a different path), you must either rename the old volume to match Compose's expected name, or reshape the compose file to point at the existing one explicitly.
75+
76+
3. **Prefer bind mounts over named volumes** for primary production data (the pattern `docs/deploy.md` already recommends). Bind mounts are transparent: the data is at a host path you control, backup is `rsync`, and Compose can't silently swap it.
77+
78+
4. **Keep a daily dump**, not just for disasters. A scheduled `GET /v1/memory?limit=1000&cursor=...` (or a CLI export) writing JSONL to a separate host or NAS is cheap insurance. We'll add a reference script under `deploy/` in a follow-up.
79+
80+
---
81+
82+
## Timeline
83+
84+
| Time (Taipei) | Event |
85+
|---------------|-------|
86+
| ~10:15 | memhall container started (original deployment; stayed Up 7h until intervention) |
87+
| 17:00 | User noticed `/v1/health` returned `degraded`; investigation started |
88+
| 17:10 | Root-caused to Ollama queue starvation; dedicated `:8790` bge-m3 service confirmed healthy |
89+
| 17:30 | `HttpEmbedder` class + config + tests implemented |
90+
| 17:55 | Deploy attempted via SSH; blocked by macOS keychain non-interactive limitation |
91+
| 18:00 | Deploy script re-run on mini's local Terminal after `security -v unlock-keychain` |
92+
| 18:10 | Port 6333 / 9100 conflicts resolved (`--no-deps`, compose port alignment 9000→9100) |
93+
| 18:15 | New container up and healthy — but pre-existing 47 entries missing from the new volume |
94+
| 18:20 | Restored from JSONL dump taken earlier in the day for unrelated reasons |
95+
| 18:25 | Full recovery: 49 entries visible, embedder=ok, semantic search scores recovered |
96+
97+
---
98+
99+
## Action items
100+
101+
- [x] Land `HttpEmbedder` + `health_embed_timeout_s`[ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md), merged.
102+
- [x] Document both issues in operator-facing docs — this file + `docs/deploy.md` footgun section.
103+
- [ ] Ship `deploy/memhall-dump.sh` — nightly JSONL dump to a separate host. Tracked in follow-up.
104+
- [ ] Align `docker-compose.yml` default volume strategy with `docs/deploy.md` (bind mount). Currently compose uses named volume; deploy.md recommends bind mount. That gap is the core of Issue 2.

0 commit comments

Comments
 (0)