You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+97-23Lines changed: 97 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Convert U.S. legislative XML into structured Markdown for AI and RAG systems.
4
4
5
-
> **Status: In Development** -- law2md is under active development. Phase 1 (foundation) is complete. The tool can convert U.S. Code XML to section-level Markdown with frontmattermetadata. See [Project Status](#project-status) for details on what works today and what is planned.
5
+
> **Status: In Development** -- Phases 1 and 2 are complete. The tool converts U.S. Code XML to section-level or chapter-level Markdown with frontmatter, tables, filterable notes, cross-reference link resolution, and metadata indexes. See [Project Status](#project-status) for details.
6
6
7
7
---
8
8
@@ -20,9 +20,14 @@ Legal texts are among the most frequently cited sources in AI systems, yet there
20
20
21
21
-**Streaming SAX parser** -- processes XML files of any size (including 100MB+ titles like Title 26) with bounded memory usage
22
22
-**Section-level output** -- each section of the U.S. Code becomes its own Markdown file, sized appropriately for RAG chunk windows
23
-
-**YAML frontmatter** -- every file includes structured metadata (identifier, title, chapter, section, positive law status, source credit, currency)
23
+
-**Chapter-level output** -- optional mode that inlines all sections into per-chapter files
24
+
-**YAML frontmatter** -- every file includes structured metadata (identifier, title, chapter, part, section, positive law status, source credit, currency)
24
25
-**Structural fidelity** -- preserves the full USLM hierarchy from title down through subsubitem, using bold inline numbering that mirrors legal citation convention
25
-
-**Source credits and notes** -- editorial notes, statutory notes, amendment history, and source credits are included with the statutory text
26
+
-**Tables** -- XHTML tables and USLM layout tables converted to Markdown pipe tables
27
+
-**Cross-reference links** -- resolve within the output corpus as relative links, or fall back to OLRC website URLs
28
+
-**Filterable notes** -- editorial notes, statutory notes, and amendment history can be selectively included or excluded
29
+
-**Metadata indexes** -- `_meta.json` sidecar files at title and chapter levels with section listings and token estimates
30
+
-**Source credits** -- enactment source annotations included with each section
26
31
27
32
## Installation
28
33
@@ -53,10 +58,22 @@ Place the extracted XML files in a directory of your choice.
--include-editorial-notes Include editorial notes only
97
+
--include-statutory-notes Include statutory notes only
98
+
--include-amendments Include amendment history notes only
76
99
-v, --verbose Enable verbose logging
77
100
-h, --help Display help
78
101
```
79
102
103
+
When multiple `--include-*-notes` flags are specified, they combine additively. Specifying any selective flag automatically switches from the default "include all notes" behavior to "include only selected categories."
104
+
80
105
## Output Format
81
106
82
-
### Directory Structure
107
+
### Section-Level Directory Structure (default)
83
108
84
109
```
85
110
output/
86
111
usc/
87
112
title-01/
113
+
_meta.json
88
114
chapter-01/
115
+
_meta.json
89
116
section-1.md
90
117
section-2.md
91
118
...
92
119
chapter-02/
120
+
_meta.json
93
121
section-101.md
94
122
...
95
123
```
96
124
97
-
Title directories are zero-padded to two digits (`title-01` through `title-54`). Chapter directories follow the same convention. Section files use the section number as-is, which may be alphanumeric (e.g., `section-106a.md`, `section-7801.md`).
Title directories are zero-padded to two digits (`title-01` through `title-54`). Chapter directories and files follow the same convention. Section files use the section number as-is, which may be alphanumeric (e.g., `section-106a.md`, `section-7801.md`).
2022— amended section generally. Prior to amendment, text read as follows: ...
179
+
2022-- amended section generally. Prior to amendment, text read as follows: ...
140
180
```
141
181
142
182
Subsections and below use bold inline numbering (`**(a)**`, `**(1)**`, `**(A)**`, `**(i)**`) rather than Markdown headings. This preserves a flat document structure that works well with embedding models and chunking strategies. The numbering scheme follows standard legal citation convention.
143
183
184
+
### Metadata Indexes
185
+
186
+
Each title and chapter directory includes a `_meta.json` file with structured metadata for programmatic access:
187
+
188
+
```json
189
+
{
190
+
"format_version": "1.0.0",
191
+
"identifier": "/us/usc/t5",
192
+
"title_number": 5,
193
+
"title_name": "Government Organization and Employees",
194
+
"stats": {
195
+
"chapter_count": 63,
196
+
"section_count": 1162,
197
+
"total_tokens_estimate": 2207855
198
+
},
199
+
"chapters": [
200
+
{
201
+
"identifier": "/us/usc/t5/ptI/ch1",
202
+
"number": 1,
203
+
"name": "Organization",
204
+
"directory": "chapter-01",
205
+
"sections": [
206
+
{
207
+
"identifier": "/us/usc/t5/s101",
208
+
"number": "101",
209
+
"name": "Executive departments",
210
+
"file": "section-101.md",
211
+
"token_estimate": 4200,
212
+
"has_notes": true,
213
+
"status": "current"
214
+
}
215
+
]
216
+
}
217
+
]
218
+
}
219
+
```
220
+
144
221
For the complete output format specification, see [docs/OUTPUT_FORMAT.md](docs/OUTPUT_FORMAT.md).
145
222
146
223
## Project Status
147
224
148
-
### Phase 1: Foundation -- Complete
225
+
### Phase 1: Foundation -- Complete (v0.1.0)
149
226
150
-
The core conversion pipeline is functional. `law2md` can convert any U.S. Code title XML file to section-level Markdown with frontmatter, source credits, editorial notes, and statutory notes.
0 commit comments