law2md converts U.S. legislative XML (USLM schema) into structured Markdown for AI/RAG ingestion. It is a monorepo built with Turborepo, pnpm workspaces, TypeScript, and Node.js.
law2md/
├── packages/
│ ├── core/ # @law2md/core — XML parsing, AST, Markdown rendering, shared utilities
│ ├── usc/ # @law2md/usc — U.S. Code-specific element handlers and downloader
│ └── cli/ # law2md — CLI binary (the published npm package users install)
├── downloads/
│ └── usc/
│ └── xml/ # Full USC XML files (usc01.xml ... usc54.xml) — gitignored
├── fixtures/
│ ├── fragments/ # Small synthetic XML snippets for unit tests
│ └── expected/ # Expected output snapshots for integration tests
├── docs/ # Architecture, output format spec, extension guide
├── turbo.json # Turborepo pipeline config
└── CLAUDE.md # This file
- Runtime: Node.js >= 20 LTS (ESM)
- Language: TypeScript 5.x, strict mode, no
anyunless explicitly justified - XML Parsing:
saxes(SAX streaming) +@xmldom/xmldom(DOM for fragments) - CLI:
commander - Validation:
zod - YAML:
yamlpackage - Zip:
yauzl - Token Counting: character/4 heuristic
- Logging:
pino - Testing:
vitest - Build:
tsup - Linting: ESLint +
@typescript-eslint - Formatting: Prettier
- Monorepo: Turborepo + pnpm workspaces
# Install dependencies (from repo root)
pnpm install
# Build all packages
pnpm turbo build
# Build a specific package
pnpm turbo build --filter=@law2md/core
# Run all tests
pnpm turbo test
# Run tests for a specific package
pnpm turbo test --filter=@law2md/usc
# Type check
pnpm turbo typecheck
# Lint
pnpm turbo lint
# Dev mode (watch + rebuild)
pnpm turbo dev
# Run the CLI locally during development
node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./test-output
node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output
node packages/cli/dist/index.js download --titles 1- pnpm workspaces with
workspace:*protocol for internal deps - ESM only (
"type": "module"in all package.json files) - Strict mode:
strict: true,noUncheckedIndexedAccess: true,exactOptionalPropertyTypes: true - Use
import typefor type-only imports - Prefer
interfaceovertypefor object shapes (better error messages, declaration merging) - All exported functions and types must have JSDoc comments
- Use
unknownoverany; ifanyis truly needed, add// eslint-disable-next-line @typescript-eslint/no-explicit-anywith a comment explaining why - Barrel exports via
index.tsin each packagesrc/
- Files:
kebab-case.ts - Types/Interfaces:
PascalCase(e.g.,SectionNode,ConvertOptions) - Functions:
camelCase(e.g.,parseIdentifier,renderSection) - Constants:
UPPER_SNAKE_CASEfor true constants (e.g.,USLM_NAMESPACE) - Enum-like objects:
PascalCasekeys usingas constsatisfies pattern
- Use custom error classes extending
Errorwithcausechaining - XML parsing errors: warn and continue (log malformed elements, don't crash on anomalous structures)
- File I/O errors: throw with context (file path, operation attempted)
- Never swallow errors silently — at minimum, log at
warnlevel
- Co-locate test files:
parser.ts→parser.test.tsin same directory - Use
describeblocks mirroring the module's exported API - Snapshot tests for Markdown output stability (update snapshots intentionally, not casually)
- Name test cases descriptively:
it("converts <subsection> with chapeau to indented bold-lettered paragraph")
Official USLM reference documents are in docs/reference/uslm/:
uslm-user-guide.pdf— OLRC user guide (v0.1.4, Oct 2013). Covers abstract/concrete model, identification, referencing, metadata, versioning, and presentation models.uslm-schema-and-css/USLM-1.0.xsd— Original schema (July 2013)uslm-schema-and-css/USLM-1.0.15.xsd— Patched schema (Sept 2013, adds remote namespace resolution for DC and XHTML)uslm-schema-and-css/usctitle.css— Browser rendering stylesheetuslm-schema-and-css/dc.xsd,dcterms.xsd,dcmitype.xsd— Dublin Core metadata schemasuslm-schema-and-css/xhtml-1.0.xsd,xml.xsd— Supporting schemas
The XML files use the USLM 1.0 schema (patch level 1.0.15). Namespace: http://xml.house.gov/schemas/uslm/1.0
<uscDoc identifier="/us/usc/t1">
<meta>
<dc:title>Title 1</dc:title>
<dc:type>USCTitle</dc:type>
<docNumber>1</docNumber>
<property role="is-positive-law">yes</property>
</meta>
<main>
<title identifier="/us/usc/t1">
<num value="1">Title 1—</num>
<heading>GENERAL PROVISIONS</heading>
<chapter identifier="/us/usc/t1/ch1">
<num value="1">CHAPTER 1—</num>
<heading>RULES OF CONSTRUCTION</heading>
<section identifier="/us/usc/t1/s1">
<num value="1">§ 1.</num>
<heading>Words denoting number, gender, and so forth</heading>
<content>...</content>
<sourceCredit>(...)</sourceCredit>
<notes type="uscNote">...</notes>
</section>
</chapter>
</title>
</main>
</uscDoc>title > subtitle > chapter > subchapter > article > subarticle > part > subpart > division > subdivision
> section (PRIMARY LEVEL)
> subsection > paragraph > subparagraph > clause > subclause > item > subitem > subsubitem
Additional level elements: <preliminary> (outside main hierarchy), <compiledAct>, <courtRules>/<courtRule>, <reorganizationPlans>/<reorganizationPlan> (title appendices).
Important: The schema intentionally does NOT enforce strict hierarchy — any <level> can nest inside any <level>. This is a deliberate design choice, not a bug.
| Element | Purpose | Key Attributes |
|---|---|---|
<uscDoc> |
Document root | identifier |
<title> |
USC title | identifier |
<chapter> |
Chapter container | identifier |
<section> |
Primary legal unit | identifier |
<num> |
Number designation | value (normalized) |
<heading> |
Element name/title | — |
<content> |
Text content block | — |
<chapeau> |
Text before sub-levels | — |
<continuation> |
Text after or between sub-levels | — |
<proviso> |
"Provided that..." text | — |
<ref> |
Cross-reference | href (canonical URI) |
<date> |
Date | date (ISO format) |
<sourceCredit> |
Enactment source | — |
<note> |
Note (various types) | topic, role |
<notes> |
Note container | type (e.g., "uscNote") |
<quotedContent> |
Quoted legal text | origin |
<def> / <term> |
Definition / defined term | — |
<toc> / <tocItem> |
Table of contents | — |
<layout> / <column> |
Column-oriented display | leaders, colspan |
<table> (XHTML ns) |
HTML table | Standard HTML attrs |
USLM uses canonical URI paths as identifiers:
/us/usc/t{title}/s{section}/{subsection}/{paragraph}
Examples:
/us/usc/t1 — Title 1
/us/usc/t1/ch1 — Chapter 1 of Title 1
/us/usc/t1/s1 — Section 1 of Title 1
/us/usc/t1/s1/a — Subsection (a) of Section 1
/us/usc/t1/s1/a/2 — Paragraph (2) of Subsection (a)
Reference prefixes (big levels): t = title, st = subtitle, ch = chapter, sch = subchapter, art = article, p = part, sp = subpart, d = division, sd = subdivision, s = section. Small levels (subsection and below) use their number directly without a prefix.
Full reference URL structure: [item][work][!lang][/portion][@temporal][.manifestation]
- Only
/us/usc/...references are converted to relative Markdown links /us/stat/...(Statutes at Large),/us/pl/...(Public Law) render as plain text citations@portionon<ref>extends a reference established via@idref(composable)
Default (USLM): http://xml.house.gov/schemas/uslm/1.0
Dublin Core: http://purl.org/dc/elements/1.1/
DC Terms: http://purl.org/dc/terms/
XHTML: http://www.w3.org/1999/xhtml
XSI: http://www.w3.org/2001/XMLSchema-instance
Tables use the XHTML namespace. Always check namespace when handling <table> elements — USLM <layout> uses the default namespace, XHTML <table> uses http://www.w3.org/1999/xhtml.
Notes have two independent classification axes:
@type: placement —"inline","footnote","endnote","uscNote"(after sourceCredit)@topic: semantic category —"amendments","codification","changeOfName","crossReferences","effectiveDateOfAmendment","miscellaneous","repeals","regulations","dispositionOfSections","enacting"
The schema also defines concrete note subtypes: <sourceCredit>, <statutoryNote>, <editorialNote>, <changeNote> (records non-substantive changes, usually in square brackets).
Within <notes type="uscNote"> containers, <note role="crossHeading"> elements with <heading> containing "Editorial Notes" or "Statutory Notes" act as section dividers. Notes following a cross-heading belong to that category until the next cross-heading.
Elements can carry @status indicating their legal state. The schema defines 18 values: proposed, withdrawn, cancelled, pending, operational, suspended, renumbered, repealed, expired, terminated, hadItsEffect, omitted, notAdopted, transferred, redesignated, reserved, vacant, crossReference, unknown.
Current release point page: https://uscode.house.gov/download/download.shtml
Individual title XML zip:
https://uscode.house.gov/download/releasepoints/us/pl/{congress}/{law}/xml_usc{NN}@{congress}-{law}.zip
All titles XML zip:
https://uscode.house.gov/download/releasepoints/us/pl/{congress}/{law}/xml_uscAll@{congress}-{law}.zip
Where {NN} is zero-padded title number (01-54), {congress} is Congress number, {law} is public law number.
Example (current as of early 2026): xml_usc01@119-73not60.zip
Note: Release points can include exclusion suffixes (e.g., 119-73not60 means "through PL 119-73, excluding PL 119-60"). The current release point is hardcoded in packages/usc/src/downloader.ts as CURRENT_RELEASE_POINT.
The zip contains a single XML file named like usc01.xml.
output/usc/title-{NN}/chapter-{NN}/section-{N}.md
- Title dirs:
title-01throughtitle-54(zero-padded to 2 digits) - Chapter dirs:
chapter-01,chapter-02, etc. (zero-padded to 2 digits) - Section files:
section-{N}.mdwhere N is the section number (NOT zero-padded, since section numbers can be alphanumeric likesection-7801) - Subchapter dirs nest inside chapter dirs when present
-
SAX over DOM: Large titles (26, 42) can exceed 100MB XML. SAX streaming keeps memory bounded. DOM is used only for small fragment inspection.
-
Section as the atomic unit: A section is the smallest citable legal unit in the U.S. Code. Subsections, paragraphs, etc. are rendered within the section file, not as separate files.
-
Frontmatter + sidecar index: Both YAML frontmatter on every .md file AND
_meta.jsonper directory. Frontmatter enables file-level RAG ingestion. Sidecar enables index-based retrieval without parsing every file. -
Relative cross-reference links: Cross-refs within the converted corpus use relative markdown links. Refs to unconverted titles fall back to OLRC website URLs.
-
Notes are opt-in: By default, only the core statutory text and source credits are included. Notes (editorial, statutory, amendments) require explicit CLI flags. This keeps default output lean for RAG.
-
Streaming output: Sections are written to disk as they are parsed. The converter never holds an entire title's worth of AST in memory simultaneously.
-
Footnotes: Rendered as Markdown footnotes (
[^N]at reference site,[^N]: textat bottom of section file). -
Appendix titles: Separate output directories (e.g.,
title-05-appendix/) for titles with appendices (5, 11, 18, 28). -
Token estimation: Uses character/4 heuristic for token counts in
_meta.json. -
Table of Disposition: Excluded from section-level output. Included in title-level README.md.
- XHTML namespace tables:
<table>elements in USC XML are in the XHTML namespace, not the USLM namespace. The SAX parser must handle namespace-aware element names. - Anomalous structures: Some sections have non-standard nesting (e.g.,
<paragraph>directly under<section>without a<subsection>). Handlers must not assume strict hierarchy. - Empty/repealed sections: Some sections contain only a
<note>with status information (e.g., "Repealed" or "Transferred"). These should still produce an output file with appropriate frontmatter. - Roman numeral numbering: Clauses use lowercase Roman numerals (i, ii, iii), subclauses use uppercase (I, II, III). The
<num>element's@valueattribute contains the normalized form. - Inline XHTML in content:
<b>,<i>,<sub>,<sup>elements appear inline within text content. They are in the USLM namespace, not XHTML. - Multiple
<p>elements in content: A single<content>or<note>may contain multiple<p>elements. Each should be a separate paragraph in Markdown output. - Permissive content model:
<content>usesprocessContents="lax"withnamespace="##any"— it can contain elements from any namespace, including embedded XHTML. The SAX parser must handle unexpected elements gracefully. <continuation>is interstitial: Not just "after sub-levels" but also between elements of the same level. Handle as a text block in whatever position it appears.- Element versioning: Elements can have
@startPeriod/@endPeriod/@statusfor point-in-time variants. Multiple versions of the same element may coexist in the document.
Note: The extension architecture is aspirational. No pluggable handler interfaces exist yet — element handling is built into the
ASTBuilderclass. Seedocs/extending.mdfor details.
- Create a new package:
packages/cfr/(orpackages/state-il/, etc.) - Implement a converter function analogous to
convertTitle()in@law2md/usc - Extend or adapt the
ASTBuilderfor source-specific elements - Add a new CLI command in
packages/cli - Reuse
@law2md/corefor XML parsing, AST types, Markdown rendering, and frontmatter - Add source-specific download logic if applicable
- Document the source's XML schema in the package README