Skip to content

Commit ffffd77

Browse files
docs: update READMEs and docs for multi-granularity converter
1 parent 6298069 commit ffffd77

17 files changed

Lines changed: 321 additions & 14 deletions

File tree

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
"@lexbuild/ecfr": patch
3+
"@lexbuild/core": patch
4+
"@lexbuild/usc": patch
5+
"@lexbuild/fr": patch
6+
"@lexbuild/cli": patch
7+
"@lexbuild/mcp": patch
8+
---
9+
10+
Refresh documentation for the single-pass multi-granularity converter. README.md files across the monorepo (root, `@lexbuild/cli`, `@lexbuild/usc`, `@lexbuild/ecfr`, `@lexbuild/core`, `apps/astro`) now document the `--granularities <list>` and `--output-<granularity>` flags on `convert-usc`/`convert-ecfr`, the new `granularities` option on `convertTitle`/`convertEcfrTitle`, the builder's `ReadonlySet<LevelType>` emit mode, and the updated `update-usc.sh`/`update-ecfr.sh` single-invocation pattern. Public docs on the Astro site and internal docs under `.claude/internal/docs/` were updated in parallel. No code changes in this release — the bump exists to ship the refreshed package README copy to npm.

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,7 @@ Note: identifiers use `/us/cfr/` (content type) not `/us/ecfr/` (data source). B
303303
- **VPS PM2 logs live at `/home/ubuntu/pm2/logs/lexbuild/`**, not `~/.pm2/logs/`. The latter is legacy — only `pm2-logrotate-out.log` still writes there. Check the new path when debugging PM2-managed services.
304304
- **VPS has 6 GiB swap** at `/swapfile` (persisted in `/etc/fstab`). Added as defense against Meilisearch OOM during bulk upserts on a 7.6 GiB RAM Lightsail box. Don't remove.
305305
- **Stuck Meilisearch tasks crash-loop across restarts**: document-addition tasks that OOM Meilisearch are persisted in LMDB and re-attempted after every PM2 restart (observed ~60s crash cycle, 160+ restarts in 2.5 hours). Cancel via `curl -XPOST -H "Authorization: Bearer $MEILI_MASTER_KEY" "http://127.0.0.1:7700/tasks/cancel?uids=<list>"` — the cancellation typically executes during a healthy window even if the stuck task itself can't complete.
306+
- **`_meta.json` / `README.md` carry wall-clock timestamps**: Converter outputs include a `generated_at` field. Byte-parity tests comparing outputs across runs must skip these files (assert existence, not content).
306307

307308
## When Adding New Source Types
308309

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,8 @@ Update scripts handle change detection, download, convert, and deploy in one com
143143
./scripts/update-usc.sh --skip-deploy
144144
```
145145

146+
`update-usc.sh` and `update-ecfr.sh` convert every granularity in one parse using the `--granularities` flag (see below), so the convert step no longer scales with the number of output granularities.
147+
146148
---
147149

148150
## Commands
@@ -173,6 +175,13 @@ lexbuild convert-usc --all # All downloaded ti
173175
lexbuild convert-usc --titles 1 -g chapter # Chapter-level output
174176
lexbuild convert-usc --titles 26 --dry-run # Preview without writing
175177
lexbuild convert-usc ./downloads/usc/xml/usc01.xml # Direct file path
178+
179+
# Emit section + chapter + title in a single parse
180+
lexbuild convert-usc --all \
181+
--granularities section,title,chapter \
182+
--output ./output \
183+
--output-title ./output-title \
184+
--output-chapter ./output-chapter
176185
```
177186

178187
| Option | Default | Description |
@@ -182,6 +191,10 @@ lexbuild convert-usc ./downloads/usc/xml/usc01.xml # Direct file path
182191
| `-i, --input-dir <dir>` | `./downloads/usc/xml` | Input XML directory |
183192
| `-o, --output <dir>` | `./output` | Output directory |
184193
| `-g, --granularity` | `section` | `section`, `chapter`, or `title` |
194+
| `--granularities <list>` || Comma-separated granularities. Mutually exclusive with `-g`. |
195+
| `--output-section <dir>` || Output dir for section when using `--granularities` (defaults to `--output`) |
196+
| `--output-chapter <dir>` || Output dir for chapter when using `--granularities` |
197+
| `--output-title <dir>` || Output dir for title when using `--granularities` |
185198
| `--link-style` | `plaintext` | `plaintext`, `canonical`, or `relative` |
186199
| `--no-include-source-credits` || Exclude source credits |
187200
| `--no-include-notes` || Exclude all notes |
@@ -235,6 +248,14 @@ lexbuild convert-ecfr --all # All downloaded ti
235248
lexbuild convert-ecfr --titles 17 -g part # Part-level output
236249
lexbuild convert-ecfr --all --dry-run # Preview without writing
237250
lexbuild convert-ecfr ./downloads/ecfr/xml/ECFR-title17.xml # Direct file path
251+
252+
# Emit all four granularities in a single parse
253+
lexbuild convert-ecfr --all \
254+
--granularities section,title,chapter,part \
255+
--output ./output \
256+
--output-title ./output-title \
257+
--output-chapter ./output-chapter \
258+
--output-part ./output-part
238259
```
239260

240261
| Option | Default | Description |
@@ -244,6 +265,11 @@ lexbuild convert-ecfr ./downloads/ecfr/xml/ECFR-title17.xml # Direct file path
244265
| `-i, --input-dir <dir>` | `./downloads/ecfr/xml` | Input XML directory |
245266
| `-o, --output <dir>` | `./output` | Output directory |
246267
| `-g, --granularity` | `section` | `section`, `part`, `chapter`, or `title` |
268+
| `--granularities <list>` || Comma-separated granularities. Mutually exclusive with `-g`. |
269+
| `--output-section <dir>` || Output dir for section when using `--granularities` (defaults to `--output`) |
270+
| `--output-part <dir>` || Output dir for part when using `--granularities` |
271+
| `--output-chapter <dir>` || Output dir for chapter when using `--granularities` |
272+
| `--output-title <dir>` || Output dir for title when using `--granularities` |
247273
| `--link-style` | `plaintext` | `plaintext`, `canonical`, or `relative` |
248274
| `--no-include-source-credits` || Exclude source credits |
249275
| `--no-include-notes` || Exclude all notes |

apps/astro/README.md

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,36 @@ The web application for [LexBuild](https://github.com/chris-c-thomas/LexBuild)
2222

2323
- **Node.js** >= 22
2424
- **pnpm** >= 10
25-
- **Converted content** — run the CLI to generate Markdown files before starting the app:
25+
- **Converted content** — run the CLI to generate Markdown files before starting the app. The app browses every granularity (section, chapter, title, and part for eCFR), so emit them all in a single parse using `--granularities`:
2626

2727
```bash
2828
# From the monorepo root
2929
pnpm turbo build
30-
node packages/cli/dist/index.js download-usc --all && node packages/cli/dist/index.js convert-usc --all
31-
node packages/cli/dist/index.js download-ecfr --all && node packages/cli/dist/index.js convert-ecfr --all
32-
node packages/cli/dist/index.js download-fr --recent 30 && node packages/cli/dist/index.js convert-fr --all
30+
31+
# USC: download once, then emit section + chapter + title from one parse
32+
node packages/cli/dist/index.js download-usc --all
33+
node packages/cli/dist/index.js convert-usc --all \
34+
--granularities section,title,chapter \
35+
--output ./output \
36+
--output-title ./output-title \
37+
--output-chapter ./output-chapter
38+
39+
# eCFR: download once, then emit all four granularities from one parse
40+
node packages/cli/dist/index.js download-ecfr --all
41+
node packages/cli/dist/index.js convert-ecfr --all \
42+
--granularities section,title,chapter,part \
43+
--output ./output \
44+
--output-title ./output-title \
45+
--output-chapter ./output-chapter \
46+
--output-part ./output-part
47+
48+
# Federal Register (no granularities — documents are atomic)
49+
node packages/cli/dist/index.js download-fr --recent 30
50+
node packages/cli/dist/index.js convert-fr --all
3351
```
3452

53+
For routine updates, use `./scripts/update.sh` — the wrapper scripts already use `--granularities` internally.
54+
3555
## Local Development
3656

3757
### 1. Link Content

apps/astro/src/content/docs/architecture/conversion-pipeline.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,24 @@ const builder = new ASTBuilder({
9090

9191
Levels above the emit level (for example, `title` and `chapter` when emitting at `section`) are tracked as lightweight `AncestorInfo` objects containing just the level type, identifier, number, and heading. Their child subtrees are never accumulated in memory.
9292

93+
### Multi-Level Emit
94+
95+
`emitAt` also accepts a `ReadonlySet<LevelType>`. Deeper levels fire first (sections before their containing title), and emitted nodes remain attached to their parents so a higher-level emission sees the full subtree. Attach-to-parent is gated by "any enclosing stack frame is itself an emit target" — this live stack check is what keeps the logic correct for USLM's permissive level nesting (for example, an appendix inside a part), where hierarchy index ordering would be misleading.
96+
97+
```typescript
98+
const byLevel = new Map<LevelType, CollectedSection[]>();
99+
100+
const builder = new ASTBuilder({
101+
emitAt: new Set(["section", "chapter", "title"]),
102+
onEmit: (node, context) => {
103+
const bucket = byLevel.get(node.levelType);
104+
if (bucket) bucket.push({ node, context });
105+
},
106+
});
107+
```
108+
109+
The converter uses this to collect per-level buckets in a single pass and write every requested granularity from the matching bucket. This is how `--granularities` on the CLI produces multiple output trees without re-parsing the XML.
110+
93111
## The Collect-Then-Write Pattern
94112

95113
Both USC and eCFR converters collect all emitted nodes synchronously during parsing, then write files after parsing completes. The collect phase pushes `{ node, context }` pairs into an array; the write phase iterates this array to render and write each file.

apps/astro/src/content/docs/cli/commands.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,10 @@ These flags are available on all convert commands:
5252
|---|---|---|
5353
| `-o, --output <dir>` | `./output` | Output directory for converted Markdown |
5454
| `-g, --granularity <level>` | `section` | Output granularity (varies by source) |
55+
| `--granularities <list>` | -- | Comma-separated granularities for multi-pass output. Mutually exclusive with `-g`. |
56+
| `--output-chapter <dir>` | -- | Output directory for chapter granularity (when using `--granularities`) |
57+
| `--output-title <dir>` | -- | Output directory for title granularity (when using `--granularities`) |
58+
| `--output-part <dir>` | -- | Output directory for part granularity, eCFR only (when using `--granularities`) |
5559
| `--link-style <style>` | `plaintext` | Cross-reference link style: `plaintext`, `relative`, or `canonical` |
5660
| `--dry-run` | off | Parse and report statistics without writing files |
5761
| `-v, --verbose` | off | Print detailed output including file paths |
@@ -63,6 +67,29 @@ These flags are available on all convert commands:
6367
> [!NOTE]
6468
> Setting any selective note flag (`--include-editorial-notes`, `--include-statutory-notes`, `--include-amendments`) automatically disables the broad `--include-notes` flag.
6569
70+
### Multi-Granularity Single-Pass Mode
71+
72+
Use `--granularities` to emit several granularity levels from a single parse of the source XML. Each listed granularity needs a matching output directory — section uses `--output` (or `--output-section`); chapter, title, and part (eCFR only) each take their own `--output-<granularity>` flag.
73+
74+
```bash
75+
# USC: three granularities in one parse
76+
lexbuild convert-usc --all \
77+
--granularities section,title,chapter \
78+
--output ./output \
79+
--output-title ./output-title \
80+
--output-chapter ./output-chapter
81+
82+
# eCFR: four granularities in one parse
83+
lexbuild convert-ecfr --all \
84+
--granularities section,title,chapter,part \
85+
--output ./output \
86+
--output-title ./output-title \
87+
--output-chapter ./output-chapter \
88+
--output-part ./output-part
89+
```
90+
91+
`--granularities` is mutually exclusive with `-g/--granularity`. The builder parses the XML once and emits at every requested level, so multi-granularity runs are roughly ~40–50% faster than N separate single-granularity invocations.
92+
6693
## Full Pipeline Examples
6794

6895
### U.S. Code
@@ -110,7 +137,7 @@ For routine updates, wrapper scripts handle the full pipeline (detect changes, d
110137
./scripts/update-usc.sh # USC, checks release point
111138
```
112139

113-
Each script auto-detects what changed and only processes updates. See [Incremental Updates](/docs/guides/bulk-download#incremental-updates) for details.
140+
Each script auto-detects what changed and only processes updates. `update-usc.sh` and `update-ecfr.sh` convert all granularities in one call using `--granularities` (see above), so the convert step parses the XML once per title rather than once per granularity. See [Incremental Updates](/docs/guides/bulk-download#incremental-updates) for details.
114141

115142
## Getting Help
116143

apps/astro/src/content/docs/cli/sources/ecfr.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,21 @@ lexbuild convert-ecfr --all -g part
9999
lexbuild convert-ecfr --titles 17 -g title
100100
```
101101

102+
### Multi-Granularity (Single Pass)
103+
104+
Emit all four granularities from one parse:
105+
106+
```bash
107+
lexbuild convert-ecfr --all \
108+
--granularities section,title,chapter,part \
109+
--output ./output \
110+
--output-title ./output-title \
111+
--output-chapter ./output-chapter \
112+
--output-part ./output-part
113+
```
114+
115+
`--granularities` is mutually exclusive with `-g`. Because the XML is parsed once and fanned out to each requested directory, this is roughly ~40–50% faster than running `convert-ecfr` four times with different `-g` values.
116+
102117
### Notes Filtering
103118

104119
Notes filtering works the same as for the U.S. Code:

apps/astro/src/content/docs/cli/sources/us-code.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,20 @@ lexbuild convert-usc --all -g chapter
8787
lexbuild convert-usc --titles 26 -g title
8888
```
8989

90+
### Multi-Granularity (Single Pass)
91+
92+
Emit section, chapter, and title outputs from one parse:
93+
94+
```bash
95+
lexbuild convert-usc --all \
96+
--granularities section,title,chapter \
97+
--output ./output \
98+
--output-title ./output-title \
99+
--output-chapter ./output-chapter
100+
```
101+
102+
`--granularities` is mutually exclusive with `-g`. Because the XML is parsed once and fanned out to each requested directory, this is roughly ~40–50% faster than running `convert-usc` three times with different `-g` values.
103+
90104
### Notes Filtering
91105

92106
All notes (editorial, statutory, and amendment history) are included by default. Disable them entirely or filter selectively:

apps/astro/src/content/docs/guides/bulk-download.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,29 @@ lexbuild convert-usc --all -g title -o ./output/title
197197

198198
Produces 54 files for USC, 50 for eCFR. Files can be large (1-100 MB). Title-level files include extra frontmatter fields: `chapter_count`, `section_count`, and `total_token_estimate`.
199199

200+
### All Granularities in One Pass
201+
202+
If you need more than one granularity, `--granularities` emits them from a single parse of the source XML (~40–50% faster than running `convert-*` N times):
203+
204+
```bash
205+
# USC: section + chapter + title from one parse
206+
lexbuild convert-usc --all \
207+
--granularities section,title,chapter \
208+
--output ./output \
209+
--output-title ./output-title \
210+
--output-chapter ./output-chapter
211+
212+
# eCFR: all four granularities from one parse
213+
lexbuild convert-ecfr --all \
214+
--granularities section,title,chapter,part \
215+
--output ./output \
216+
--output-title ./output-title \
217+
--output-chapter ./output-chapter \
218+
--output-part ./output-part
219+
```
220+
221+
`--granularities` is mutually exclusive with `-g/--granularity`.
222+
200223
> [!NOTE]
201224
> The `-o` flag appends source subdirectories automatically. `convert-usc -o /some/path` writes to `/some/path/usc/`, not `/some/path/` directly.
202225

apps/astro/src/content/docs/guides/rag-pipeline.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,14 +41,18 @@ You can convert the same source at different granularity levels to support diffe
4141
| `chapter` / `part` | 10-100k tokens | Topic-level context, summarization with large context models |
4242
| `title` | 100k-2M tokens | Whole-title analysis, full corpus summarization |
4343

44-
Generate multiple granularities to different output directories:
44+
Generate multiple granularities to different output directories in a single parse:
4545

4646
```bash
47-
lexbuild convert-usc --all -g section -o ./output/section
48-
lexbuild convert-usc --all -g chapter -o ./output/chapter
49-
lexbuild convert-usc --all -g title -o ./output/title
47+
lexbuild convert-usc --all \
48+
--granularities section,title,chapter \
49+
--output ./output/section \
50+
--output-title ./output/title \
51+
--output-chapter ./output/chapter
5052
```
5153

54+
`--granularities` emits every requested level from one read of the XML. For a single granularity, use the older `-g <level> -o <dir>` form.
55+
5256
## Parsing Frontmatter
5357

5458
Use [gray-matter](https://github.com/jonschlinkert/gray-matter) (JavaScript) or any YAML frontmatter parser to separate metadata from body text.

0 commit comments

Comments
 (0)