|
| 1 | +# Incident 2026-04-20 — Embed-path starvation + volume replacement during deploy |
| 2 | + |
| 3 | +**Severity**: High (memhall write path degraded for ~6h; data loss narrowly avoided via backup) |
| 4 | +**Duration**: ~6 hours of `embedder: degraded` / `sync_status: pending` before detection |
| 5 | +**Resolved**: 2026-04-20 18:25 Taipei |
| 6 | + |
| 7 | +Two cascading issues hit memhall's primary deployment (single Mac mini) on the same day. Both are operator-facing, not engine bugs. Documented here so others deploying memory-hall in similar topologies can avoid them. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Issue 1 — Ollama LLM queue starves bge-m3 |
| 12 | + |
| 13 | +### Symptom |
| 14 | + |
| 15 | +- `/v1/health` reported `embedder: degraded` for hours without recovery. |
| 16 | +- `POST /v1/memory/write` returned 202 but entries persisted with `sync_status: pending`, `indexed_at: null`. |
| 17 | +- Reading worked (lexical/FTS fallback), but semantic search scores collapsed because new writes weren't in the vector index. |
| 18 | + |
| 19 | +### Root cause |
| 20 | + |
| 21 | +memory-hall pointed its embedder at a shared Ollama instance (`MH_OLLAMA_BASE_URL=...:11434`). That Ollama was simultaneously serving large LLM clients (qwen3-vl, qwen3.5:35b, etc.) whose combined model weights exceeded available GPU memory. Ollama's scheduler entered a constant evict/load loop. `bge-m3` (small, fast) could not win a slot: every embed request saw a cold load that timed out before bge-m3 got loaded. |
| 22 | + |
| 23 | +Direct test on the day: |
| 24 | +- `curl .../api/tags` — 200 in <1s (Ollama metadata is fine). |
| 25 | +- `curl .../api/embed -d '{"model":"bge-m3", ...}'` — 30s timeout, no response, even after `ollama stop` on the blocking LLM. |
| 26 | + |
| 27 | +A dedicated bge-m3 HTTP service on the same host (`:8790`, just FastAPI + transformers) was consistently healthy throughout. |
| 28 | + |
| 29 | +### Resolution |
| 30 | + |
| 31 | +Introduced `HttpEmbedder` (see [ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md)). Set: |
| 32 | + |
| 33 | +``` |
| 34 | +MH_EMBEDDER_KIND=http |
| 35 | +MH_EMBED_BASE_URL=http://<dedicated-embed-host>:8790 |
| 36 | +``` |
| 37 | + |
| 38 | +After redeploy: `/v1/health` immediately returned `ok`; new writes completed with `embedded: true` synchronously; semantic search scores recovered (RRF 0.033 / semantic 0.638 on the canonical test query, vs 0.016 / unavailable while degraded). |
| 39 | + |
| 40 | +### Operator guidance |
| 41 | + |
| 42 | +If you share one Ollama instance across multiple agent stacks, **do not use it for embeddings**. Ollama's scheduler is not designed for mixed small-frequent (embed) + large-rare (LLM) workloads. Either: |
| 43 | + |
| 44 | +1. Run a dedicated embed service (any service with `POST /embed {"texts":[...]}` → `{"dense_vecs":[...]}` shape works with `HttpEmbedder`), or |
| 45 | +2. Dedicate an Ollama instance exclusively to embedding models (no LLM clients allowed). |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## Issue 2 — Named-volume replaced when switching from `docker run` to `docker compose` |
| 50 | + |
| 51 | +### Symptom |
| 52 | + |
| 53 | +During the fix for Issue 1, redeploying via `docker compose up -d --force-recreate memory-hall` silently created a new empty `memory-hall_mh-data` named volume. The running container came up healthy but with **zero existing entries** visible. |
| 54 | + |
| 55 | +### Root cause |
| 56 | + |
| 57 | +The original deployment used `docker run -v memory-hall_mh-data:/data ...` (or a similarly-named volume), created ad-hoc. When `docker-compose.yml` declared a volume of the same short name (`mh-data`), compose treats the project namespace: the effective volume becomes `${project}_mh-data` = `memory-hall_mh-data` — but **only when Compose manages it**. An existing volume of the same literal name, created outside Compose, does not automatically inherit Compose project labels. |
| 58 | + |
| 59 | +What actually happened in this deploy (reconstructed from `docker volume inspect` timestamps): the pre-existing volume was treated as an orphan by Compose and replaced with a freshly-created empty volume carrying the correct `com.docker.compose.project` labels. The old volume's data was not mounted into the new container. |
| 60 | + |
| 61 | +Data was recovered from a JSONL dump that happened to be taken for unrelated reasons ~9 hours earlier. Without that dump, the 47 pre-existing entries would have been lost. |
| 62 | + |
| 63 | +### Operator guidance (critical) |
| 64 | + |
| 65 | +Before running `docker compose up --force-recreate` against a service that was previously started via plain `docker run`: |
| 66 | + |
| 67 | +1. **Back up the data directory first.** For memhall: |
| 68 | + ```bash |
| 69 | + docker run --rm -v memory-hall_mh-data:/backup alpine \ |
| 70 | + tar czf - /backup > memhall-backup-$(date +%F).tar.gz |
| 71 | + ``` |
| 72 | + Or use the bind-mount layout recommended in [`docs/deploy.md`](../deploy.md) and snapshot the host path directly. |
| 73 | + |
| 74 | +2. **Confirm which volume Compose will use.** `docker compose config` prints the resolved volume references. If Compose would create `${project}_<name>` but your old data is under just `<name>` (or a different path), you must either rename the old volume to match Compose's expected name, or reshape the compose file to point at the existing one explicitly. |
| 75 | + |
| 76 | +3. **Prefer bind mounts over named volumes** for primary production data (the pattern `docs/deploy.md` already recommends). Bind mounts are transparent: the data is at a host path you control, backup is `rsync`, and Compose can't silently swap it. |
| 77 | + |
| 78 | +4. **Keep a daily dump**, not just for disasters. A scheduled `GET /v1/memory?limit=1000&cursor=...` (or a CLI export) writing JSONL to a separate host or NAS is cheap insurance. We'll add a reference script under `deploy/` in a follow-up. |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Timeline |
| 83 | + |
| 84 | +| Time (Taipei) | Event | |
| 85 | +|---------------|-------| |
| 86 | +| ~10:15 | memhall container started (original deployment; stayed Up 7h until intervention) | |
| 87 | +| 17:00 | User noticed `/v1/health` returned `degraded`; investigation started | |
| 88 | +| 17:10 | Root-caused to Ollama queue starvation; dedicated `:8790` bge-m3 service confirmed healthy | |
| 89 | +| 17:30 | `HttpEmbedder` class + config + tests implemented | |
| 90 | +| 17:55 | Deploy attempted via SSH; blocked by macOS keychain non-interactive limitation | |
| 91 | +| 18:00 | Deploy script re-run on mini's local Terminal after `security -v unlock-keychain` | |
| 92 | +| 18:10 | Port 6333 / 9100 conflicts resolved (`--no-deps`, compose port alignment 9000→9100) | |
| 93 | +| 18:15 | New container up and healthy — but pre-existing 47 entries missing from the new volume | |
| 94 | +| 18:20 | Restored from JSONL dump taken earlier in the day for unrelated reasons | |
| 95 | +| 18:25 | Full recovery: 49 entries visible, embedder=ok, semantic search scores recovered | |
| 96 | + |
| 97 | +--- |
| 98 | + |
| 99 | +## Action items |
| 100 | + |
| 101 | +- [x] Land `HttpEmbedder` + `health_embed_timeout_s` — [ADR 0006](../adr/0006-http-embedder-embed-queue-isolation.md), merged. |
| 102 | +- [x] Document both issues in operator-facing docs — this file + `docs/deploy.md` footgun section. |
| 103 | +- [ ] Ship `deploy/memhall-dump.sh` — nightly JSONL dump to a separate host. Tracked in follow-up. |
| 104 | +- [ ] Align `docker-compose.yml` default volume strategy with `docs/deploy.md` (bind mount). Currently compose uses named volume; deploy.md recommends bind mount. That gap is the core of Issue 2. |
0 commit comments