docs: ADR 0006 (HttpEmbedder) + 2026-04-20 incident log + deploy footguns + CHANGELOG

MakiDevelop · claude · MakiDevelop · commit fe52083faf50 · 2026-04-20T17:38:57.000+08:00
2026-04-20 memory-hall 第一次 production incident 的治理層沉澱。兩個 cascading issues：(1) Ollama LLM queue 餓死 bge-m3 embed → HttpEmbedder 新路徑； (2) docker compose --force-recreate 把 docker run 建的 volume 當 orphan 清掉 → 47 筆資料靠早上 JSONL dump 救回。 - ADR 0006：HttpEmbedder 的設計理由、alternatives、API shape、health timeout 分離 - docs/operations/incident-2026-04-20-embed-queue.md：完整 timeline + root cause + operator guidance - docs/deploy.md：新增「Deploy footguns」段，納入 embed 隔離 / backup before force-recreate / macOS keychain / port alignment 四項 - CHANGELOG.md：首次建立，含 Unreleased（HttpEmbedder + health_embed_timeout_s）+ 0.2.0 / 0.1.0 歷史 Follow-up（TODO，未在本 commit）： - deploy/memhall-dump.sh（每日 JSONL dump cron，本次救命資料的 cron 化） - docker-compose.yml 改用 bind mount 對齊 docs/deploy.md 的 primary 範例（目前是 named volume，這正是 incident #2 的根因） Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,43 @@
+# Changelog
+
+All notable changes to memory-hall are documented here.
+
+Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). memory-hall uses versioned 0.x releases; see [ADR 0005](docs/adr/0005-v0.2-minimum-viable-contract.md) for what's frozen vs free-to-change at each 0.x version.
+
+## [Unreleased]
+
+### Added
+
+- **`HttpEmbedder`** — second embedder backend alongside `OllamaEmbedder`. Speaks a minimal `POST /embed` / `{"texts": [...]}` → `{"dense_vecs": [...]}` contract. Opt in with `MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`. Rationale in [ADR 0006](docs/adr/0006-http-embedder-embed-queue-isolation.md).
+- **`health_embed_timeout_s`** config (default `3.0s`) — separate knob for `/v1/health` embed-probe timeout, independent from write-path `embed_timeout_s`. Fixes a 1-second hardcoded timeout that was too tight for remote embed services.
+- **Operator footgun docs** in [`docs/deploy.md`](docs/deploy.md) and a full incident writeup at [`docs/operations/incident-2026-04-20-embed-queue.md`](docs/operations/incident-2026-04-20-embed-queue.md). Covers Ollama-queue starvation, named-volume replacement during compose recreate, and macOS keychain non-interactive builds.
+
+### Changed
+
+- **`docker-compose.yml`** host port is now `9100:9000` (was `9000:9000`). Matches what existing operator docs and examples already assumed. If you pinned the old `9000` port in downstream callers, update before recreating the container.
+- **Health probe** (`_refresh_health_cache`) uses `health_embed_timeout_s` instead of `min(1.0, embed_timeout_s)`. This is a behavioral tightening of the v0.2 `/v1/health` contract only for the `degraded` threshold — the response shape is unchanged.
+
+### Fixed
+
+- Health probe no longer false-degrades when the embedder is a remote HTTP service with typical (~500ms–1s) cold-path latency.
+
+## 0.2.0 — 2026-04-19
+
+v0.2 minimum viable contract freeze. See [ADR 0005](docs/adr/0005-v0.2-minimum-viable-contract.md) for the full frozen surface.
+
+Highlights:
+- `/v1/memory/write`, `/v1/memory/search`, `/v1/health` contracts declared stable for the v0.2.x line.
+- sqlite-vec v0.1.6 as the default vector store.
+- Content-hash–based deduplication (`(agent_id, namespace, type, content)` → deterministic `entry_id`).
+- Multi-tenant data model (single-tenant runtime in 0.2, multi-tenant deferred to 0.3+).
+
+## 0.1.0 — 2026-04-18
+
+Initial public release. See [ADR 0001](docs/adr/0001-drop-mem0.md) for the project's founding rationale (why memory-hall exists vs mem0 / LangMem / Zep).
+
+Core:
+- SQLite + sqlite-vec storage.
+- Ollama (`bge-m3`) embeddings.
+- HTTP API (`/v1/memory/write`, `/v1/memory/search`, `/v1/memory/{id}`, `/v1/health`).
+- CLI (`memory-hall write / search / list / reindex-fts`).
+- Python embedded usage (`from memory_hall import Settings, build_runtime`).
diff --git a/docs/adr/0006-http-embedder-embed-queue-isolation.md b/docs/adr/0006-http-embedder-embed-queue-isolation.md
@@ -0,0 +1,80 @@
+# ADR 0006 — HttpEmbedder: embed path isolation from LLM queue
+
+- **Status**: Accepted
+- **Date**: 2026-04-20
+- **Related**: ADR 0005 (v0.2 Minimum Viable Contract), incident log `docs/operations/incident-2026-04-20-embed-queue.md`
+
+## Context
+
+Until 0.2.0, memory-hall had exactly one embedder backend: `OllamaEmbedder` against `MH_OLLAMA_BASE_URL`. On 2026-04-20 the production seven-agent stack hit an incident that exposed a structural problem with this:
+
+**Ollama is a shared runner pool**. When LLM clients hammer `/api/generate` or `/v1/chat/completions` with multi-GB models (qwen3-vl, qwen3.5:35b-a3b, etc.), Ollama's scheduler evicts/loads models to fit GPU+system memory. `bge-m3` (small, fast, embed-only) gets starved: every embed request triggers a cold load that never wins against the LLM traffic.
+
+Observed symptoms on the day:
+- `/v1/health` returned `embedder: degraded` continuously (1s probe timeout, but cold-load of bge-m3 through Ollama's queue took >10s).
+- `POST /v1/memory/write` succeeded with `202 Accepted` but entries stayed `sync_status: pending`, `indexed_at: null` indefinitely.
+- Direct `curl http://dgx:11434/api/embed` from the memory-hall container timed out at 30s even with all other models stopped — Ollama's eviction/load loop was saturated by a separate LLM client.
+
+Meanwhile a **dedicated bge-m3 HTTP service** on the same embedder host (`:8790`, FastAPI + transformers) was consistently healthy. It does one thing — serve bge-m3 embeddings — and is not subject to Ollama's scheduler.
+
+## Decision
+
+**Add `HttpEmbedder` as a first-class embedder backend, selectable at runtime via `MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`.**
+
+The existing `OllamaEmbedder` remains the default for backward compatibility. Operators who already have a dedicated bge-m3 HTTP service (or any service with the same API shape) can opt in without touching code.
+
+### API shape assumed by HttpEmbedder
+
+```
+POST /embed
+  Request  : {"texts": [str, str, ...]}
+  Response : {"model": str, "dimension": int, "count": int, "dense_vecs": [[float, ...], ...]}
+```
+
+This matches the reference dedicated bge-m3 service (and is trivial to wrap any embedding service that returns a vector list).
+
+### Health probe separation
+
+A secondary finding from the same incident: the health probe hardcoded `timeout=min(1.0, embed_timeout_s)`. That 1-second floor is fine for local Ollama, unreasonable for a remote HTTP service. Added `health_embed_timeout_s: float = 3.0` as a separate setting so operators can tune health-probe strictness independently from write-path timeout.
+
+## Consequences
+
+### Gains
+
+- **No more LLM-queue starvation for embeddings.** An operator who points memory-hall at a dedicated embed service gets a hard isolation boundary from whatever else is hammering the LLM runner.
+- **Swappable embed backends.** The protocol is documented and minimal; anyone can write a 20-line wrapper in front of bge-m3, nomic-embed, or a cloud embed API, and memory-hall consumes it unchanged.
+- **Backward compatible.** Default remains `MH_EMBEDDER_KIND=ollama`; existing deployments do nothing.
+
+### Costs
+
+- **Two embedder codepaths** to maintain. Both are ~60 lines; drift risk is low but real. Covered by `tests/test_http_embedder.py` + `tests/test_smoke.py::test_health_uses_health_embed_timeout`.
+- **Operator now has two settings to understand.** `MH_EMBEDDER_KIND` is explicit and documented in `docker-compose.yml` comments; acceptable overhead.
+
+### Non-goals
+
+- Not solving "multi-embedder with automatic failover". A single kind at a time; if the chosen backend is down, the embedder is down. Failover is the operator's circuit breaker concern, not the engine's.
+- Not abstracting into a plugin system. Two concrete classes implementing the `Embedder` protocol is enough; adding plugin discovery is premature.
+
+## Alternatives considered
+
+### A. Stay on Ollama, preload bge-m3 permanently with `OLLAMA_KEEP_ALIVE=-1`
+
+Rejected after direct test on the day of the incident: even with bge-m3 pinned, Ollama's scheduler still evicted it when LLM clients requested models whose total memory need exceeded free VRAM. The pin is advisory, not a hard reservation.
+
+### B. Put nginx in front of Ollama to rewrite `/api/embed` → `:8790/embed`
+
+Rejected: payload shapes differ (`{"input": ...}` vs `{"texts": [...]}`, `embeddings` vs `dense_vecs`). A translation layer in nginx is possible but ugly; a 60-line Python class is cleaner and testable.
+
+### C. Make memory-hall embed in-process (no HTTP hop)
+
+Rejected for now: requires shipping ~2GB of bge-m3 weights into the memory-hall image or as a sidecar. The "engine stays small" philosophy (README) argues against it. Operators who want in-process can wrap the CLI or Python entry points; the server path stays HTTP.
+
+## Implementation summary
+
+- `src/memory_hall/embedder/http_embedder.py` — new class, ~60 lines.
+- `src/memory_hall/config.py` — add `embedder_kind`, `embed_base_url`, `embed_dim`, `health_embed_timeout_s`.
+- `src/memory_hall/server/app.py` — factory branch on `embedder_kind`; health probe uses `health_embed_timeout_s`.
+- `docker-compose.yml` — pass-through envs with sane defaults.
+- `tests/test_http_embedder.py`, `tests/test_smoke.py` — coverage including dim mismatch, error propagation, empty input, and the new health-probe timeout behavior.
+
+Total: +249 / -9, 7 files. 12 new/updated tests pass; existing test suite unaffected.
diff --git a/docs/deploy.md b/docs/deploy.md
@@ -94,3 +94,30 @@ For production with zero-downtime needs, see v0.2 roadmap — not supported in v
 - **Cold standby only**: running primary and backup simultaneously will split-brain the writes.
 - **No automatic failover**: manual DNS / config switch.
 - **No encryption at rest**: SQLite files are plaintext. Use disk-level encryption (FileVault / LUKS) if your data warrants it.
+
+## Deploy footguns (learned the hard way)
+
+See [`docs/operations/incident-2026-04-20-embed-queue.md`](operations/incident-2026-04-20-embed-queue.md) for the full story. Short version:
+
+### Don't embed through a shared Ollama
+
+If the same Ollama instance serves large LLM clients, `bge-m3` will starve. Either point memory-hall at a dedicated embed service (`MH_EMBEDDER_KIND=http` + `MH_EMBED_BASE_URL=...`) or keep Ollama exclusive to embeddings. See [ADR 0006](adr/0006-http-embedder-embed-queue-isolation.md).
+
+### Back up before `docker compose up --force-recreate`
+
+If your existing deployment was created with plain `docker run`, compose may replace your data volume on recreate. Always snapshot first:
+
+```bash
+docker run --rm -v memory-hall_mh-data:/backup alpine \
+    tar czf - /backup > memhall-backup-$(date +%F).tar.gz
+```
+
+Or — and this is the pattern this doc has recommended since v0.1 — use a **bind mount** (`-v ~/data/memory-hall:/data`) instead of a named volume. Bind mounts are transparent, trivially backed up via `rsync`, and compose cannot silently swap them.
+
+### macOS-specific: keychain must be unlocked for `docker compose build`
+
+Docker Desktop's credential helper requires GUI keychain access. `ssh` into a Mac to build and you'll see `keychain cannot be accessed because the current session does not allow user interaction`. Run `security -v unlock-keychain ~/Library/Keychains/login.keychain-db` in an interactive session first, or build elsewhere and `docker save | docker load` across.
+
+### Port alignment
+
+This repo's `docker-compose.yml` exposes `9100:9000` (host:container). If your existing deployment was started with a different host port, callers coded against the old port will break on the first `force-recreate`. Grep your agent stack for the literal port number before redeploying.
diff --git a/docs/operations/incident-2026-04-20-embed-queue.md b/docs/operations/incident-2026-04-20-embed-queue.md
@@ -0,0 +1,104 @@
+# Incident 2026-04-20 — Embed-path starvation + volume replacement during deploy
+
+**Severity**: High (memhall write path degraded for ~6h; data loss narrowly avoided via backup)
+**Duration**: ~6 hours of `embedder: degraded` / `sync_status: pending` before detection
+**Resolved**: 2026-04-20 18:25 Taipei
+
+Two cascading issues hit memhall's primary deployment (single Mac mini) on the same day. Both are operator-facing, not engine bugs. Documented here so others deploying memory-hall in similar topologies can avoid them.
+
+---
+
+## Issue 1 — Ollama LLM queue starves bge-m3
+
+### Symptom
+
+- `/v1/health` reported `embedder: degraded` for hours without recovery.
+- `POST /v1/memory/write` returned 202 but entries persisted with `sync_status: pending`, `indexed_at: null`.
+- Reading worked (lexical/FTS fallback), but semantic search scores collapsed because new writes weren't in the vector index.
+
+### Root cause
+
+memory-hall pointed its embedder at a shared Ollama instance (`MH_OLLAMA_BASE_URL=...:11434`). That Ollama was simultaneously serving large LLM clients (qwen3-vl, qwen3.5:35b, etc.) whose combined model weights exceeded available GPU memory. Ollama's scheduler entered a constant evict/load loop. `bge-m3` (small, fast) could not win a slot: every embed request saw a cold load that timed out before bge-m3 got loaded.
+
+Direct test on the day:
+- `curl .../api/tags` — 200 in <1s (Ollama metadata is fine).
+- `curl .../api/embed -d '{"model":"bge-m3", ...}'` — 30s timeout, no response, even after `ollama stop` on the blocking LLM.
+
+A dedicated bge-m3 HTTP service on the same host (`:8790`, just FastAPI + transformers) was consistently healthy throughout.
+
+### Resolution
+
+Introduced `HttpEmbedder` (see [ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md)). Set:
+
+```
+MH_EMBEDDER_KIND=http
+MH_EMBED_BASE_URL=http://<dedicated-embed-host>:8790
+```
+
+After redeploy: `/v1/health` immediately returned `ok`; new writes completed with `embedded: true` synchronously; semantic search scores recovered (RRF 0.033 / semantic 0.638 on the canonical test query, vs 0.016 / unavailable while degraded).
+
+### Operator guidance
+
+If you share one Ollama instance across multiple agent stacks, **do not use it for embeddings**. Ollama's scheduler is not designed for mixed small-frequent (embed) + large-rare (LLM) workloads. Either:
+
+1. Run a dedicated embed service (any service with `POST /embed {"texts":[...]}` → `{"dense_vecs":[...]}` shape works with `HttpEmbedder`), or
+2. Dedicate an Ollama instance exclusively to embedding models (no LLM clients allowed).
+
+---
+
+## Issue 2 — Named-volume replaced when switching from `docker run` to `docker compose`
+
+### Symptom
+
+During the fix for Issue 1, redeploying via `docker compose up -d --force-recreate memory-hall` silently created a new empty `memory-hall_mh-data` named volume. The running container came up healthy but with **zero existing entries** visible.
+
+### Root cause
+
+The original deployment used `docker run -v memory-hall_mh-data:/data ...` (or a similarly-named volume), created ad-hoc. When `docker-compose.yml` declared a volume of the same short name (`mh-data`), compose treats the project namespace: the effective volume becomes `${project}_mh-data` = `memory-hall_mh-data` — but **only when Compose manages it**. An existing volume of the same literal name, created outside Compose, does not automatically inherit Compose project labels.
+
+What actually happened in this deploy (reconstructed from `docker volume inspect` timestamps): the pre-existing volume was treated as an orphan by Compose and replaced with a freshly-created empty volume carrying the correct `com.docker.compose.project` labels. The old volume's data was not mounted into the new container.
+
+Data was recovered from a JSONL dump that happened to be taken for unrelated reasons ~9 hours earlier. Without that dump, the 47 pre-existing entries would have been lost.
+
+### Operator guidance (critical)
+
+Before running `docker compose up --force-recreate` against a service that was previously started via plain `docker run`:
+
+1. **Back up the data directory first.** For memhall:
+   ```bash
+   docker run --rm -v memory-hall_mh-data:/backup alpine \
+       tar czf - /backup > memhall-backup-$(date +%F).tar.gz
+   ```
+   Or use the bind-mount layout recommended in [`docs/deploy.md`](../deploy.md) and snapshot the host path directly.
+
+2. **Confirm which volume Compose will use.** `docker compose config` prints the resolved volume references. If Compose would create `${project}_<name>` but your old data is under just `<name>` (or a different path), you must either rename the old volume to match Compose's expected name, or reshape the compose file to point at the existing one explicitly.
+
+3. **Prefer bind mounts over named volumes** for primary production data (the pattern `docs/deploy.md` already recommends). Bind mounts are transparent: the data is at a host path you control, backup is `rsync`, and Compose can't silently swap it.
+
+4. **Keep a daily dump**, not just for disasters. A scheduled `GET /v1/memory?limit=1000&cursor=...` (or a CLI export) writing JSONL to a separate host or NAS is cheap insurance. We'll add a reference script under `deploy/` in a follow-up.
+
+---
+
+## Timeline
+
+| Time (Taipei) | Event |
+|---------------|-------|
+| ~10:15 | memhall container started (original deployment; stayed Up 7h until intervention) |
+| 17:00 | User noticed `/v1/health` returned `degraded`; investigation started |
+| 17:10 | Root-caused to Ollama queue starvation; dedicated `:8790` bge-m3 service confirmed healthy |
+| 17:30 | `HttpEmbedder` class + config + tests implemented |
+| 17:55 | Deploy attempted via SSH; blocked by macOS keychain non-interactive limitation |
+| 18:00 | Deploy script re-run on mini's local Terminal after `security -v unlock-keychain` |
+| 18:10 | Port 6333 / 9100 conflicts resolved (`--no-deps`, compose port alignment 9000→9100) |
+| 18:15 | New container up and healthy — but pre-existing 47 entries missing from the new volume |
+| 18:20 | Restored from JSONL dump taken earlier in the day for unrelated reasons |
+| 18:25 | Full recovery: 49 entries visible, embedder=ok, semantic search scores recovered |
+
+---
+
+## Action items
+
+- [x] Land `HttpEmbedder` + `health_embed_timeout_s` — [ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md), merged.
+- [x] Document both issues in operator-facing docs — this file + `docs/deploy.md` footgun section.
+- [ ] Ship `deploy/memhall-dump.sh` — nightly JSONL dump to a separate host. Tracked in follow-up.
+- [ ] Align `docker-compose.yml` default volume strategy with `docs/deploy.md` (bind mount). Currently compose uses named volume; deploy.md recommends bind mount. That gap is the core of Issue 2.