Skip to content

Latest commit

 

History

History
325 lines (242 loc) · 10.9 KB

File metadata and controls

325 lines (242 loc) · 10.9 KB

law2md

CI npm license issues pull requests

Convert the United States Code into structured Markdown for AI and RAG Systems.


Overview

law2md is a command-line tool that converts XML files of the United States Code published by the Office of the Law Revision Counsel into clean, structured Markdown optimized for AI ingestion, retrieval-augmented generation (RAG), and legal research workflows. The OLRC provides a user guide for the United States Legislative Markup.

The U.S. Code comprises 54 titles of federal statutory law. The official XML is deeply nested, laden with presentation markup, and difficult to work with directly. law2md transforms this XML into per-section, or optional per-chapter, Markdown files with YAML frontmatter, predictable file paths, and content sized for typical embedding models.

Features

  • Built-in downloader -- fetch individual titles or the entire U.S. Code directly from OLRC
  • Streaming SAX parser -- processes XML files of any size (including 100MB+ titles) with bounded memory
  • Section-level output -- each section becomes its own Markdown file, sized for RAG chunk windows
  • Chapter-level output -- optional mode that inlines all sections into per-chapter files
  • YAML frontmatter -- structured metadata on every file (identifier, title, chapter, section, status, source credit)
  • Structural fidelity -- preserves the full USLM hierarchy using bold inline numbering that mirrors legal citation convention
  • Cross-reference links -- resolved as relative links within the corpus, or as OLRC website URLs
  • Filterable notes -- editorial notes, statutory notes, and amendment history can be selectively included or excluded
  • Metadata indexes -- _meta.json sidecar files with section listings and token estimates
  • Tables -- XHTML tables and USLM layout tables converted to Markdown pipe tables
  • Dry-run mode -- preview conversion stats without writing files
  • Appendix handling -- titles with appendices (5, 11, 18, 28) output to separate directories

Installation

From npm

npm install -g law2md

From source

Requires Node.js >= 20 and pnpm >= 10.

git clone https://github.com/chris-c-thomas/law2md.git
cd law2md
pnpm install
pnpm turbo build

Quick Start

# Download Title 1 (smallest title, good for testing)
law2md download --titles 1

# Convert to Markdown
law2md convert ./downloads/usc/xml/usc01.xml -o ./output

# Download and convert multiple titles at once
law2md download --titles 1-5 && law2md convert --titles 1-5

Usage

Download

Fetch U.S. Code XML files directly from the Office of the Law Revision Counsel:

# Download a single title
law2md download --titles 1

# Download multiple titles (range)
law2md download --titles 1-5

# Download specific titles (mixed)
law2md download --titles 1-5,8,11

# Download all 54 titles
law2md download --all

# Use a specific release point
law2md download --titles 26 --release-point 119-73not60

Or download manually from the OLRC download page.

Convert

# Convert a single XML file
law2md convert ./downloads/usc/xml/usc01.xml -o ./output

# Convert by title number (uses default input directory)
law2md convert --titles 1

# Convert multiple titles
law2md convert --titles 1-5,8,11

# Convert with a custom input directory
law2md convert --titles 1-5 -i ./my-xml-files

# Chapter-level output
law2md convert ./downloads/usc/xml/usc01.xml -o ./output -g chapter

# Cross-reference links resolved to OLRC URLs
law2md convert ./downloads/usc/xml/usc05.xml -o ./output --link-style canonical

# Include only amendment notes
law2md convert ./downloads/usc/xml/usc01.xml -o ./output --include-amendments

# Exclude all notes
law2md convert ./downloads/usc/xml/usc01.xml -o ./output --no-include-notes

# Dry-run: preview stats without writing files
law2md convert ./downloads/usc/xml/usc42.xml -o ./output --dry-run

CLI Reference

law2md convert [input] [options]
Arguments:
  input                          Path to a USC XML file (optional if --titles is used)

Options:
  --titles <spec>                Title(s) to convert: single (1), range (1-5),
                                 or mixed (1-5,8,11)
  -i, --input-dir <dir>         Input directory for XML files
                                 (default: "./downloads/usc/xml")
  -o, --output <dir>             Output directory (default: "./output")
  -g, --granularity <level>      "section" or "chapter" (default: "section")
  --link-style <style>           "plaintext", "canonical", or "relative"
                                 (default: "plaintext")
  --no-include-source-credits    Exclude source credit annotations
  --no-include-notes             Exclude all notes
  --include-editorial-notes      Include editorial notes only
  --include-statutory-notes      Include statutory notes only
  --include-amendments           Include amendment history notes only
  --dry-run                      Parse and report without writing files
  -v, --verbose                  Enable verbose logging
  -h, --help                     Display help

law2md download [options]

Options:
  --titles <spec>                Title(s) to download: single (1), range (1-5),
                                 or mixed (1-5,8,11)
  --all                          Download all 54 titles
  -o, --output <dir>             Output directory (default: "./downloads/usc/xml")
  --release-point <point>        OLRC release point (default: current)
  -h, --help                     Display help

When multiple --include-*-notes flags are specified, they combine additively.


Output Format

Directory Structure

output/
  usc/
    title-01/
      README.md
      _meta.json
      chapter-01/
        _meta.json
        section-1.md
        section-2.md
        ...
      chapter-02/
        _meta.json
        section-101.md
        ...

Title directories are zero-padded (title-01 through title-54). Chapter directories follow the same convention. Section files use the section number as-is, which may be alphanumeric (e.g., section-106a.md, section-7801.md).

Markdown Structure

Each section file consists of YAML frontmatter followed by statutory text:

---
identifier: "/us/usc/t1/s7"
title: "1 USC § 7 - Marriage"
title_number: 1
title_name: "GENERAL PROVISIONS"
section_number: "7"
section_name: "Marriage"
chapter_number: 1
chapter_name: "RULES OF CONSTRUCTION"
positive_law: true
currency: "119-73"
last_updated: "2025-12-03"
format_version: "1.0.0"
generator: "law2md@0.5.0"
source_credit: "(Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)"
---
# § 7. Marriage

**(a)** For the purposes of any Federal law, rule, or regulation in which
marital status is a factor, an individual shall be considered married if...

**(b)** In this section, the term "State" means a State, the District of
Columbia, the Commonwealth of Puerto Rico, or any other territory...

---

**Source Credit**: (Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)

Subsections and below use bold inline numbering (**(a)**, **(1)**, **(A)**, **(i)**) rather than Markdown headings. This preserves a flat document structure optimized for embedding models and chunking strategies.

Metadata Indexes

Each title directory includes a _meta.json file for programmatic access:

{
  "format_version": "1.0.0",
  "identifier": "/us/usc/t5",
  "title_number": 5,
  "title_name": "Government Organization and Employees",
  "stats": {
    "chapter_count": 63,
    "section_count": 1162,
    "total_tokens_estimate": 2207855
  },
  "chapters": [
    {
      "identifier": "/us/usc/t5/ptI/ch1",
      "number": 1,
      "name": "Organization",
      "directory": "chapter-01",
      "sections": [
        {
          "identifier": "/us/usc/t5/s101",
          "number": "101",
          "name": "Executive departments",
          "file": "section-101.md",
          "token_estimate": 4200,
          "has_notes": true,
          "status": "current"
        }
      ]
    }
  ]
}

For the complete output format specification, see docs/output-format.md.


Project Structure

law2md/
  packages/
    core/          @law2md/core -- XML parsing, AST, Markdown rendering
    usc/           @law2md/usc -- U.S. Code conversion logic and downloader
    cli/           law2md -- CLI entry point (the published npm package)
  fixtures/
    fragments/     Small XML snippets for unit tests
    expected/      Expected output snapshots
  docs/            Architecture, output format spec, extension guide

The project is a monorepo managed with pnpm workspaces and Turborepo. The separation into core and usc packages is designed to support additional legal source types (CFR, state statutes) by adding new packages that share the core infrastructure.


Development

pnpm install               # Install dependencies
pnpm turbo build           # Build all packages
pnpm turbo test            # Run all tests
pnpm turbo lint            # Lint all packages
pnpm turbo typecheck       # Type-check all packages

See CONTRIBUTING.md for the full contributor guide.


Documentation


Data Sources

law2md processes XML published by the Office of the Law Revision Counsel (OLRC) of the U.S. House of Representatives. The XML uses the United States Legislative Markup (USLM) 1.0 schema.

The U.S. Code XML is public domain and freely available at uscode.house.gov/download/download.shtml.


License

MIT. See LICENSE.