fix(astro): add batch-size override and ECONNRESET retry to incremental search indexer

chris-c-thomas · chris-c-thomas · commit 9c952ad2e6eb · 2026-04-24T22:46:17.000-05:00
Meilisearch can silently restart mid-task under memory pressure (observed
~60s crash cycles during FR bulk upserts on the 7.6 GiB VPS), causing
ECONNRESET on either the addDocuments POST or the waitForTask polling
that follows. The previous indexer died outright on the first failure
even though the submitted task was already persisted in LMDB and would
typically resume on server recovery.

Changes (apps/astro/scripts/index-search-incremental.ts):

- New flushWithRetry() in BatchIndexer waits for /health to return
  "available" (up to 180s) and retries the wait on the original taskUid
  rather than resubmitting the batch. Up to 5 attempts per flush.
- New --batch-size &lt;n&gt; CLI flag and MEILI_BATCH_SIZE env var override
  the default of 500 docs/batch. Smaller batches reduce per-flush
  Meilisearch memory and let crash recovery happen between batches
  instead of inside one.
- New --verbose-batches flag prints the first/last doc ID of every
  flushed batch, with stdout force-flushed so the last logged ID is
  durable through a crash. Combined with --batch-size 1 this isolates
  poison documents.

apps/astro/CLAUDE.md already documents these flags; this commit brings
the code in line with the documentation. The full-reindex sibling
(index-search.ts) has the same OOM-vulnerable pattern and should get
the same treatment in a follow-up — scoped out of this PR because only
the incremental script was field-validated on the VPS.
diff --git a/apps/astro/scripts/index-search-incremental.ts b/apps/astro/scripts/index-search-incremental.ts
@@ -217,9 +217,7 @@ class BatchIndexer {
       process.stdout.write(`  → flushing ${toSend.length} docs: ${label}\n`);
     }
 
-    const index = this.client.index(this.indexName);
-    const task = await index.addDocuments(toSend);
-    await this.client.tasks.waitForTask(task.taskUid, { timeout: 300_000 });
+    await this.flushWithRetry(toSend);
 
     this.totalSent += toSend.length;
     this.batchesSent++;
@@ -231,6 +229,64 @@ class BatchIndexer {
     }
   }
 
+  // Meilisearch can silently restart mid-task under memory pressure, causing
+  // ECONNRESET on either addDocuments POST or waitForTask polling. Submitted
+  // tasks are persisted in LMDB and typically resume on server recovery, so we
+  // wait for /health to return "available" and retry — rather than giving up
+  // after a short backoff that can easily expire inside one crash cycle
+  // (observed ~60s between crashes). waitForTask reuses the original taskUid
+  // so we wait for the already-enqueued task rather than resubmitting.
+  private async flushWithRetry(toSend: SearchDocument[]): Promise<void> {
+    const maxAttempts = 5;
+    const healthWaitMs = 180_000;
+    const index = this.client.index(this.indexName);
+    let taskUid: number | null = null;
+
+    for (let attempt = 1; attempt <= maxAttempts; attempt++) {
+      try {
+        if (taskUid === null) {
+          const task = await index.addDocuments(toSend);
+          taskUid = task.taskUid;
+        }
+        await this.client.tasks.waitForTask(taskUid, { timeout: 300_000 });
+        return;
+      } catch (err) {
+        if (attempt === maxAttempts) throw err;
+        const message = err instanceof Error ? err.message : String(err);
+        const firstId = toSend[0]?.id ?? "";
+        const context = taskUid !== null ? `waitForTask(${taskUid})` : "addDocuments";
+        process.stdout.write(
+          `  ⟳ attempt ${attempt}/${maxAttempts - 1} — ${context} failed (${message}) for batch starting ${firstId}\n`,
+        );
+        const recovered = await this.waitForMeiliHealth(healthWaitMs);
+        if (!recovered) {
+          process.stdout.write(`  ⟳ Meilisearch did not recover within ${healthWaitMs / 1000}s — giving up this batch\n`);
+          throw err;
+        }
+        // Small grace period after recovery lets Meilisearch finish its startup.
+        await new Promise((resolve) => setTimeout(resolve, 3000));
+      }
+    }
+  }
+
+  private async waitForMeiliHealth(maxWaitMs: number): Promise<boolean> {
+    const deadline = Date.now() + maxWaitMs;
+    const pollMs = 5000;
+    while (Date.now() < deadline) {
+      try {
+        const health = await this.client.health();
+        if (health.status === "available") {
+          process.stdout.write(`  ⟳ Meilisearch healthy — resuming\n`);
+          return true;
+        }
+      } catch {
+        // Connection refused / reset — Meilisearch still down or restarting
+      }
+      await new Promise((resolve) => setTimeout(resolve, pollMs));
+    }
+    return false;
+  }
+
   get total(): number {
     return this.totalSent;
   }