Skip to content

Commit cbf5ea1

Browse files
chore: update README.md
1 parent 0db1865 commit cbf5ea1

1 file changed

Lines changed: 97 additions & 23 deletions

File tree

README.md

Lines changed: 97 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Convert U.S. legislative XML into structured Markdown for AI and RAG systems.
44

5-
> **Status: In Development** -- law2md is under active development. Phase 1 (foundation) is complete. The tool can convert U.S. Code XML to section-level Markdown with frontmatter metadata. See [Project Status](#project-status) for details on what works today and what is planned.
5+
> **Status: In Development** -- Phases 1 and 2 are complete. The tool converts U.S. Code XML to section-level or chapter-level Markdown with frontmatter, tables, filterable notes, cross-reference link resolution, and metadata indexes. See [Project Status](#project-status) for details.
66
77
---
88

@@ -20,9 +20,14 @@ Legal texts are among the most frequently cited sources in AI systems, yet there
2020

2121
- **Streaming SAX parser** -- processes XML files of any size (including 100MB+ titles like Title 26) with bounded memory usage
2222
- **Section-level output** -- each section of the U.S. Code becomes its own Markdown file, sized appropriately for RAG chunk windows
23-
- **YAML frontmatter** -- every file includes structured metadata (identifier, title, chapter, section, positive law status, source credit, currency)
23+
- **Chapter-level output** -- optional mode that inlines all sections into per-chapter files
24+
- **YAML frontmatter** -- every file includes structured metadata (identifier, title, chapter, part, section, positive law status, source credit, currency)
2425
- **Structural fidelity** -- preserves the full USLM hierarchy from title down through subsubitem, using bold inline numbering that mirrors legal citation convention
25-
- **Source credits and notes** -- editorial notes, statutory notes, amendment history, and source credits are included with the statutory text
26+
- **Tables** -- XHTML tables and USLM layout tables converted to Markdown pipe tables
27+
- **Cross-reference links** -- resolve within the output corpus as relative links, or fall back to OLRC website URLs
28+
- **Filterable notes** -- editorial notes, statutory notes, and amendment history can be selectively included or excluded
29+
- **Metadata indexes** -- `_meta.json` sidecar files at title and chapter levels with section listings and token estimates
30+
- **Source credits** -- enactment source annotations included with each section
2631

2732
## Installation
2833

@@ -53,10 +58,22 @@ Place the extracted XML files in a directory of your choice.
5358
### Convert to Markdown
5459

5560
```bash
56-
# Convert a single title
61+
# Convert a single title (section-level output)
5762
node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output
5863

59-
# With verbose output
64+
# Chapter-level output (sections inlined into chapter files)
65+
node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output -g chapter
66+
67+
# With cross-reference links resolved to OLRC URLs
68+
node packages/cli/dist/index.js convert path/to/usc05.xml -o ./output --link-style canonical
69+
70+
# Include only amendment notes
71+
node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output --include-amendments
72+
73+
# Exclude all notes
74+
node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output --no-include-notes
75+
76+
# Verbose output showing all written files
6077
node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output -v
6178
```
6279

@@ -70,31 +87,54 @@ Arguments:
7087
7188
Options:
7289
-o, --output <dir> Output directory (default: "./output")
90+
-g, --granularity <level> Output granularity: "section" or "chapter"
91+
(default: "section")
7392
--link-style <style> Cross-reference style: "plaintext", "canonical",
7493
or "relative" (default: "plaintext")
7594
--no-include-source-credits Exclude source credit annotations
95+
--no-include-notes Exclude all notes
96+
--include-editorial-notes Include editorial notes only
97+
--include-statutory-notes Include statutory notes only
98+
--include-amendments Include amendment history notes only
7699
-v, --verbose Enable verbose logging
77100
-h, --help Display help
78101
```
79102

103+
When multiple `--include-*-notes` flags are specified, they combine additively. Specifying any selective flag automatically switches from the default "include all notes" behavior to "include only selected categories."
104+
80105
## Output Format
81106

82-
### Directory Structure
107+
### Section-Level Directory Structure (default)
83108

84109
```
85110
output/
86111
usc/
87112
title-01/
113+
_meta.json
88114
chapter-01/
115+
_meta.json
89116
section-1.md
90117
section-2.md
91118
...
92119
chapter-02/
120+
_meta.json
93121
section-101.md
94122
...
95123
```
96124

97-
Title directories are zero-padded to two digits (`title-01` through `title-54`). Chapter directories follow the same convention. Section files use the section number as-is, which may be alphanumeric (e.g., `section-106a.md`, `section-7801.md`).
125+
### Chapter-Level Directory Structure (`--granularity chapter`)
126+
127+
```
128+
output/
129+
usc/
130+
title-01/
131+
_meta.json
132+
chapter-01.md
133+
chapter-02.md
134+
chapter-03.md
135+
```
136+
137+
Title directories are zero-padded to two digits (`title-01` through `title-54`). Chapter directories and files follow the same convention. Section files use the section number as-is, which may be alphanumeric (e.g., `section-106a.md`, `section-7801.md`).
98138

99139
### Markdown Structure
100140

@@ -114,8 +154,8 @@ positive_law: true
114154
currency: "119-73"
115155
last_updated: "2025-12-03"
116156
format_version: "1.0.0"
117-
generator: "law2md@0.1.0"
118-
source_credit: "(Added Pub. L. 104199, § 3(a), Sept. 21, 1996, ...)"
157+
generator: "law2md@0.2.0"
158+
source_credit: "(Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)"
119159
---
120160
```
121161

@@ -130,50 +170,84 @@ Columbia, the Commonwealth of Puerto Rico, or any other territory...
130170

131171
---
132172

133-
**Source Credit**: (Added Pub. L. 104199, § 3(a), Sept. 21, 1996, ...)
173+
**Source Credit**: (Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)
134174

135175
## Editorial Notes
136176

137177
### Amendments
138178

139-
2022 amended section generally. Prior to amendment, text read as follows: ...
179+
2022-- amended section generally. Prior to amendment, text read as follows: ...
140180
```
141181

142182
Subsections and below use bold inline numbering (`**(a)**`, `**(1)**`, `**(A)**`, `**(i)**`) rather than Markdown headings. This preserves a flat document structure that works well with embedding models and chunking strategies. The numbering scheme follows standard legal citation convention.
143183

184+
### Metadata Indexes
185+
186+
Each title and chapter directory includes a `_meta.json` file with structured metadata for programmatic access:
187+
188+
```json
189+
{
190+
"format_version": "1.0.0",
191+
"identifier": "/us/usc/t5",
192+
"title_number": 5,
193+
"title_name": "Government Organization and Employees",
194+
"stats": {
195+
"chapter_count": 63,
196+
"section_count": 1162,
197+
"total_tokens_estimate": 2207855
198+
},
199+
"chapters": [
200+
{
201+
"identifier": "/us/usc/t5/ptI/ch1",
202+
"number": 1,
203+
"name": "Organization",
204+
"directory": "chapter-01",
205+
"sections": [
206+
{
207+
"identifier": "/us/usc/t5/s101",
208+
"number": "101",
209+
"name": "Executive departments",
210+
"file": "section-101.md",
211+
"token_estimate": 4200,
212+
"has_notes": true,
213+
"status": "current"
214+
}
215+
]
216+
}
217+
]
218+
}
219+
```
220+
144221
For the complete output format specification, see [docs/OUTPUT_FORMAT.md](docs/OUTPUT_FORMAT.md).
145222

146223
## Project Status
147224

148-
### Phase 1: Foundation -- Complete
225+
### Phase 1: Foundation -- Complete (v0.1.0)
149226

150-
The core conversion pipeline is functional. `law2md` can convert any U.S. Code title XML file to section-level Markdown with frontmatter, source credits, editorial notes, and statutory notes.
227+
Core conversion pipeline: SAX streaming parser, AST builder with section-emit pattern, Markdown renderer, YAML frontmatter generator, USC converter, and CLI `convert` command.
151228

152-
Verified against Title 1 (General Provisions): 39 sections across 3 chapters converted in under 1 second.
229+
### Phase 2: Content Fidelity -- Complete (v0.2.0)
153230

154-
### Phase 2: Content Fidelity -- Planned
231+
Content quality improvements: whitespace normalization, cross-reference link resolver (relative/canonical/plaintext), XHTML and USLM layout table conversion, notes filtering with CLI flags, `_meta.json` sidecar index generation, and chapter-level granularity mode.
155232

156-
- Cross-reference link resolution (relative links within the output corpus, OLRC fallback URLs)
157-
- XHTML table and USLM layout table conversion
158-
- Table of contents generation
159-
- Notes filtering (selective inclusion of editorial, statutory, amendment notes)
160-
- Chapter-level granularity mode
161-
- `_meta.json` sidecar index generation
233+
Verified against Title 1 (39 sections) and Title 5 (1162 sections across 63 chapters, converted in under 1 second).
162234

163235
### Phase 3: Scale and Download -- Planned
164236

165237
- Built-in OLRC download command (`law2md download`)
166238
- Memory profiling and optimization for large titles (Title 26, Title 42)
167239
- Concurrent file writes
168240
- Dry-run mode
169-
- Appendix title handling
241+
- Appendix title handling (Titles 5, 11, 18, 28)
242+
- Edge case handling for duplicate section numbers
170243

171244
### Phase 4: Polish and Publish -- Planned
172245

173246
- npm package publication
174247
- GitHub Actions CI/CD
175248
- Snapshot test suite for output stability
176249
- Token estimation via `tiktoken` in metadata indexes
250+
- Title and chapter README.md generation
177251

178252
## Repository Structure
179253

@@ -197,7 +271,7 @@ The project is structured as a monorepo managed with [pnpm](https://pnpm.io/) wo
197271
```bash
198272
pnpm install # Install dependencies
199273
pnpm turbo build # Build all packages
200-
pnpm turbo test # Run all tests
274+
pnpm turbo test # Run all tests (103 tests)
201275
pnpm turbo lint # Lint all packages
202276
pnpm turbo typecheck # Type-check all packages
203277
```

0 commit comments

Comments
 (0)