Skip to content

Commit 8bed840

Browse files
Merge branch 'main' into dependabot/npm_and_yarn/npm_and_yarn-da47bb892b
2 parents 7e3f6f4 + 5c0f34b commit 8bed840

42 files changed

Lines changed: 694 additions & 155 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ pnpm turbo build:api --filter=@lexbuild/api # Production build
130130
./scripts/deploy.sh --search-docker-seed # Seed Docker volume from VPS (recover after volume loss)
131131

132132
# Incremental content updates (from monorepo root)
133+
# Search indexing runs locally in Docker, not on the VPS — each update script's
134+
# final step delegates to `deploy.sh --search-docker --source <name>`.
133135
./scripts/update.sh # All sources incrementally
134136
./scripts/update.sh --source ecfr # One source
135137
./scripts/update.sh --skip-deploy # Local only
@@ -298,6 +300,9 @@ Note: identifiers use `/us/cfr/` (content type) not `/us/ecfr/` (data source). B
298300
- **Docker search index checkpoints**: The incremental indexing script writes checkpoint files (`.search-indexed-at-{source}`) into the content directory. For Docker runs, these are persisted in `downloads/.search-checkpoints/` and restored into the temp content dir on each run. If this directory is deleted, the next Docker index run will scan all files from scratch.
299301
- **Docker volume profiles**: `MEILI_PROFILE=dev|full` selects volume (`meili-data-dev` or `meili-data-full`). Dev mode runs without master key (`MEILI_ENV=development`). Full mode requires `MEILI_MASTER_KEY` for VPS-compatible data.
300302
- **Cloudflare "Managed robots.txt"**: When enabled, Cloudflare overwrites the site's `robots.txt` to block AI crawlers. For LexBuild (public domain legal content), this should be **OFF**. The custom `robots.txt` at `apps/astro/public/robots.txt` blocks AI crawlers from `/_astro/` (hashed static assets), `/nav/` (internal JSON), and `/api/` while allowing legal content.
303+
- **VPS PM2 logs live at `/home/ubuntu/pm2/logs/lexbuild/`**, not `~/.pm2/logs/`. The latter is legacy — only `pm2-logrotate-out.log` still writes there. Check the new path when debugging PM2-managed services.
304+
- **VPS has 6 GiB swap** at `/swapfile` (persisted in `/etc/fstab`). Added as defense against Meilisearch OOM during bulk upserts on a 7.6 GiB RAM Lightsail box. Don't remove.
305+
- **Stuck Meilisearch tasks crash-loop across restarts**: document-addition tasks that OOM Meilisearch are persisted in LMDB and re-attempted after every PM2 restart (observed ~60s crash cycle, 160+ restarts in 2.5 hours). Cancel via `curl -XPOST -H "Authorization: Bearer $MEILI_MASTER_KEY" "http://127.0.0.1:7700/tasks/cancel?uids=<list>"` — the cancellation typically executes during a healthy window even if the stuck task itself can't complete.
301306

302307
## When Adding New Source Types
303308

apps/astro/CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ npx tsx scripts/index-search-incremental.ts --set-checkpoint # Set checkpoint w
158158

159159
Script notes:
160160
- **generate-highlights.ts**: Forks child processes in 2k-file chunks (default, tunable via `--chunk-size N`) to avoid Shiki OOM. Each child is heap-capped at 2GB (`--max-old-space-size`). Uses `matter(raw, { cache: false })` to prevent gray-matter from caching every file in memory. Supports `--limit N` for testing. Changing themes requires updating both this script and `src/lib/shiki.ts`, then deleting existing `.highlighted.html` files.
161-
- **index-search.ts** and **index-search-incremental.ts**: Must be kept in sync — sources indexed, `SearchDocument` shape, and `configureIndex` settings must match. Both index USC, eCFR, and FR. Full reindex deletes and rebuilds; incremental upserts only changed files (mtime-based per-source checkpoints in `.search-indexed-at-{usc,ecfr,fr}`). Checkpoints are always written after indexing, even with `--source` — each source tracks independently. 500 docs/batch, 300s waitForTask timeout. Document IDs sanitized (dots/colons → underscores).
161+
- **index-search.ts** and **index-search-incremental.ts**: Must be kept in sync — sources indexed, `SearchDocument` shape, and `configureIndex` settings must match. Both index USC, eCFR, and FR. Full reindex deletes and rebuilds; incremental upserts only changed files (mtime-based per-source checkpoints in `.search-indexed-at-{usc,ecfr,fr}`). Checkpoints are always written after indexing, even with `--source` — each source tracks independently. Default 500 docs/batch (override with `--batch-size N` or `MEILI_BATCH_SIZE` env var), 300s waitForTask timeout. `--verbose-batches` logs first/last doc ID per flush — pair with `--batch-size 1` to bisect poison docs. Document IDs sanitized (dots/colons → underscores).
162162
- **generate-nav.ts**: Includes reserved title placeholders (USC 53, eCFR 35). Chapter grouping for eCFR derived from filesystem directories, not `_meta.json`.
163163
- **All pipeline scripts support `--source usc|ecfr|fr`**: `generate-nav.ts`, `generate-sitemap.ts`, `generate-highlights.ts`, `index-search.ts` (full), and `index-search-incremental.ts` all accept `--source` to process a single source. Sitemap `--source` doesn't rewrite the sitemap index (run without `--source` to rebuild the full index). Highlights `--source` filters by content path prefix.
164164

apps/astro/scripts/index-search-incremental.ts

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,15 +17,22 @@
1717
* npx tsx scripts/index-search-incremental.ts [content-dir] [--prune] [--source <name>]
1818
*
1919
* Options:
20-
* content-dir Path to content directory (default: ./content)
21-
* --prune Remove documents from the index for sections that no longer
22-
* exist on disk (compares Meilisearch IDs against filesystem)
23-
* --source <name> Only index a specific source: usc, ecfr, or fr
24-
* --set-checkpoint Write the checkpoint timestamp and exit (no indexing)
20+
* content-dir Path to content directory (default: ./content)
21+
* --prune Remove documents from the index for sections that no longer
22+
* exist on disk (compares Meilisearch IDs against filesystem)
23+
* --source <name> Only index a specific source: usc, ecfr, or fr
24+
* --set-checkpoint Write the checkpoint timestamp and exit (no indexing)
25+
* --batch-size <n> Override docs-per-batch (default 500). Smaller batches use
26+
* less Meilisearch memory per flush — useful to isolate
27+
* pathological docs or avoid OOM on memory-tight hosts.
28+
* --verbose-batches Print the first/last doc ID of each flushed batch. Combined
29+
* with --batch-size 1 this produces a per-doc log that makes
30+
* a poison document obvious (last printed ID before a crash).
2531
*
2632
* Environment:
2733
* MEILI_URL Meilisearch endpoint (default: http://127.0.0.1:7700)
2834
* MEILI_MASTER_KEY Master key for admin operations (default: none for dev)
35+
* MEILI_BATCH_SIZE Fallback for --batch-size when the flag is not set.
2936
*/
3037

3138
import { readdir, readFile, writeFile, stat } from "node:fs/promises";
@@ -38,7 +45,7 @@ import matter from "gray-matter";
3845
const MEILI_URL = process.env.MEILI_URL ?? "http://127.0.0.1:7700";
3946
const MEILI_MASTER_KEY = process.env.MEILI_MASTER_KEY ?? "";
4047
const INDEX_NAME = "lexbuild";
41-
const BATCH_SIZE = 500;
48+
const DEFAULT_BATCH_SIZE = 500;
4249
const BODY_TRUNCATE_CHARS = 5000;
4350
const CHECKPOINT_PREFIX = ".search-indexed-at";
4451
const SOURCES = ["usc", "ecfr", "fr"] as const;
@@ -184,6 +191,7 @@ class BatchIndexer {
184191
private readonly client: Meilisearch,
185192
private readonly indexName: string,
186193
private readonly batchSize: number,
194+
private readonly verbose: boolean = false,
187195
) {}
188196

189197
async add(doc: SearchDocument): Promise<void> {
@@ -201,6 +209,14 @@ class BatchIndexer {
201209
const toSend = this.batch;
202210
this.batch = [];
203211

212+
if (this.verbose) {
213+
const firstId = toSend[0]?.id ?? "";
214+
const lastId = toSend[toSend.length - 1]?.id ?? "";
215+
const label = toSend.length === 1 ? firstId : `${firstId}${lastId}`;
216+
// Force-flush stdout so the last logged batch is durable on crash.
217+
process.stdout.write(` → flushing ${toSend.length} docs: ${label}\n`);
218+
}
219+
204220
const index = this.client.index(this.indexName);
205221
const task = await index.addDocuments(toSend);
206222
await this.client.tasks.waitForTask(task.taskUid, { timeout: 300_000 });
@@ -533,13 +549,36 @@ async function main(): Promise<void> {
533549
let prune = false;
534550
let sourceFilter: "usc" | "ecfr" | "fr" | null = null;
535551
let setCheckpoint = false;
552+
let verboseBatches = false;
553+
554+
// Resolve batch size from env first; a --batch-size flag overrides.
555+
let batchSize = DEFAULT_BATCH_SIZE;
556+
const envBatch = process.env.MEILI_BATCH_SIZE;
557+
if (envBatch) {
558+
const parsed = Number.parseInt(envBatch, 10);
559+
if (!Number.isFinite(parsed) || parsed < 1) {
560+
console.error(`Invalid MEILI_BATCH_SIZE: ${envBatch}. Must be a positive integer.`);
561+
process.exit(1);
562+
}
563+
batchSize = parsed;
564+
}
536565

537566
for (let i = 0; i < args.length; i++) {
538567
const arg = args[i]!;
539568
if (arg === "--set-checkpoint") {
540569
setCheckpoint = true;
541570
} else if (arg === "--prune") {
542571
prune = true;
572+
} else if (arg === "--verbose-batches") {
573+
verboseBatches = true;
574+
} else if (arg === "--batch-size" && args[i + 1]) {
575+
const parsed = Number.parseInt(args[i + 1]!, 10);
576+
if (!Number.isFinite(parsed) || parsed < 1) {
577+
console.error(`Invalid --batch-size: ${args[i + 1]}. Must be a positive integer.`);
578+
process.exit(1);
579+
}
580+
batchSize = parsed;
581+
i++;
543582
} else if (arg === "--source" && args[i + 1]) {
544583
const val = args[i + 1]!;
545584
if (val !== "usc" && val !== "ecfr" && val !== "fr") {
@@ -570,6 +609,7 @@ async function main(): Promise<void> {
570609
console.log(`Meilisearch URL: ${MEILI_URL}`);
571610
console.log(`Index name: ${INDEX_NAME}`);
572611
console.log(`Mode: incremental upsert${prune ? " + prune" : ""}${sourceFilter ? ` (${sourceFilter} only)` : ""}`);
612+
console.log(`Batch size: ${batchSize}${verboseBatches ? " (verbose)" : ""}`);
573613
console.log(` Preserves the existing index. Adds new documents and updates existing ones.`);
574614

575615
const client = new Meilisearch({
@@ -617,7 +657,7 @@ async function main(): Promise<void> {
617657
}
618658
}
619659

620-
const indexer = new BatchIndexer(client, INDEX_NAME, BATCH_SIZE);
660+
const indexer = new BatchIndexer(client, INDEX_NAME, batchSize, verboseBatches);
621661

622662
// Track all expected IDs for pruning
623663
const expectedIds = new Set<string>();

fixtures/expected/duplicate-first.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Pub. L. 108–458, Nov. 17, 2006.)"
1818
---
1919

fixtures/expected/duplicate-other.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Added Pub. L. 107–296.)"
1818
---
1919

fixtures/expected/duplicate-second.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Pub. L. 110–181, Jan. 28, 2008.)"
1818
---
1919

fixtures/expected/layout.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Test source.)"
1818
---
1919

fixtures/expected/notes-all.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Added Pub. L. 104–199, § 3(a), Sept. 21, 1996.)"
1818
---
1919

fixtures/expected/notes-amendments-only.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Added Pub. L. 104–199, § 3(a), Sept. 21, 1996.)"
1818
---
1919

fixtures/expected/notes-none.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ positive_law: true
1313
currency: "unknown"
1414
last_updated: "2025-12-03"
1515
format_version: "1.1.0"
16-
generator: "lexbuild@1.24.0"
16+
generator: "lexbuild@1.24.1"
1717
source_credit: "(Added Pub. L. 104–199, § 3(a), Sept. 21, 1996.)"
1818
---
1919

0 commit comments

Comments
 (0)