diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 013d283f..a9fdf977 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -1,10 +1,16 @@ # law2md Workspace Instructions ## Scope -These instructions apply to all work in this repository. Keep changes minimal, targeted, and consistent with existing package boundaries. + +These instructions apply to all work in this repository. Keep changes minimal, targeted, and consistent with existing package boundaries. See `CLAUDE.md` for the full USLM schema reference, design decisions, and common pitfalls. + +## Project Overview + +`law2md` converts U.S. legislative XML (USLM schema) into structured Markdown for AI/RAG ingestion. It is a monorepo built with Turborepo, pnpm workspaces, TypeScript, and Node.js. ## Build and Test -Run commands from the repository root. + +Run commands from the repository root. Always use `pnpm`, not `npm`. ```bash pnpm install @@ -14,47 +20,126 @@ pnpm turbo typecheck pnpm turbo lint ``` -Useful package-scoped pattern: +Package-scoped pattern: ```bash pnpm turbo --filter=@law2md/core +pnpm turbo --filter=@law2md/usc +pnpm turbo --filter=law2md +``` + +Run the CLI locally during development: + +```bash +node packages/cli/dist/index.js download --titles 1 +node packages/cli/dist/index.js convert --all +node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output ``` ## Architecture + This is a Turborepo + pnpm monorepo with three packages: -- `packages/core` (`@law2md/core`): namespace-aware XML parsing, AST building, markdown rendering, shared utilities. -- `packages/usc` (`@law2md/usc`): USC-specific conversion and OLRC downloading logic. -- `packages/cli` (`law2md`): CLI commands (`convert`, `download`) and user-facing command surface. +- `packages/core` (`@law2md/core`): namespace-aware XML parsing (SAX via `saxes`), AST building, Markdown rendering, frontmatter generation, shared utilities. +- `packages/usc` (`@law2md/usc`): USC-specific conversion pipeline and OLRC downloader. Contains `convertTitle()` which orchestrates ReadStream → SAX → AST → Markdown → file writer. +- `packages/cli` (`law2md`): CLI commands (`convert`, `download`), terminal UI (`chalk`, `ora`, `cli-table3`), and user-facing command surface. + +Respect boundaries: keep generic parsing/rendering logic in `core`, USC-specific behavior in `usc`, and CLI orchestration in `cli`. Internal packages use `workspace:*` protocol for dependencies. + +### Key files -Respect boundaries: keep generic parsing/rendering logic in `core`, USC-specific behavior in `usc`, and CLI orchestration in `cli`. +- `packages/core/src/xml/parser.ts` — SAX streaming parser with namespace normalization +- `packages/core/src/ast/builder.ts` — Stack-based XML-to-AST construction with section-emit pattern +- `packages/core/src/markdown/renderer.ts` — Stateless AST-to-Markdown conversion +- `packages/core/src/markdown/frontmatter.ts` — YAML frontmatter generation +- `packages/core/src/xml/namespace.ts` — Namespace constants and element classification sets +- `packages/usc/src/converter.ts` — Full USC conversion pipeline orchestrator +- `packages/usc/src/downloader.ts` — OLRC download logic, `CURRENT_RELEASE_POINT` constant +- `packages/cli/src/ui.ts` — Terminal output formatting (spinners, tables, summary blocks) +- `packages/cli/src/parse-titles.ts` — Title spec parser (`1-5,8,11`) + +## Tech Stack + +- **Runtime**: Node.js >= 20 LTS (ESM only) +- **Language**: TypeScript 5.x, strict mode +- **XML Parsing**: `saxes` (SAX streaming) +- **CLI**: `commander`, `chalk`, `ora`, `cli-table3` +- **YAML**: `yaml` package +- **Zip**: `yauzl` +- **Testing**: `vitest` +- **Build**: `tsup` +- **Linting**: ESLint + `@typescript-eslint` + Prettier +- **Versioning**: `@changesets/cli` with lockstep versioning ## Code Style -- Use TypeScript strict mode conventions already configured in the repo. -- Use ESM imports/exports only. -- Prefer `interface` for object shapes. -- Use `import type` for type-only imports. -- Avoid `any`; use `unknown` unless a justified exception is required. -- Add JSDoc for exported functions and types. -- Keep file naming and symbol naming consistent with existing conventions. + +- TypeScript strict mode: `strict: true`, `noUncheckedIndexedAccess: true`, `exactOptionalPropertyTypes: true` +- ESM imports/exports only (`"type": "module"` in all package.json files) +- Prefer `interface` over `type` for object shapes +- Use `import type` for type-only imports +- Avoid `any`; use `unknown` unless a justified exception is required with an eslint-disable comment +- Add JSDoc for all exported functions and types +- Barrel exports via `index.ts` in each package `src/` +- Files: `kebab-case.ts` +- Types/Interfaces: `PascalCase` +- Functions: `camelCase` +- Constants: `UPPER_SNAKE_CASE` +- Prettier: double quotes, trailing commas, 100 char print width + +## Error Handling + +- Use custom error classes extending `Error` with `cause` chaining +- XML parsing errors: warn and continue (don't crash on anomalous structures) +- File I/O errors: throw with context (file path, operation attempted) +- Never swallow errors silently — at minimum, log at `warn` level ## Testing Conventions -- Co-locate tests with implementation files (`*.test.ts`). -- Prefer descriptive test names. -- Preserve and intentionally update markdown snapshots when behavior changes. + +- Co-locate tests with implementation files (`parser.ts` → `parser.test.ts`) +- Use `describe` blocks mirroring the module's exported API +- Name test cases descriptively: `it("converts with chapeau to indented bold-lettered paragraph")` +- Snapshot tests in `packages/usc/src/snapshot.test.ts` with expected output in `fixtures/expected/` +- Update snapshots intentionally: `cd packages/usc && pnpm exec vitest run --update` +- Fixtures: `fixtures/fragments/` (synthetic XML, committed), `fixtures/expected/` (snapshots, committed) +- Commit messages: [conventional commits](https://www.conventionalcommits.org/) (e.g., `feat(core):`, `fix(usc):`, `docs:`) + +## Key Design Decisions + +- **SAX over DOM**: Large titles exceed 100MB. SAX streaming keeps memory bounded. +- **Section as atomic unit**: Each section is its own Markdown file. Subsections render inline, not as separate files. +- **Collect-then-write**: Sections are collected during SAX streaming and written after the stream completes. +- **Frontmatter + sidecar**: YAML frontmatter on every .md file AND `_meta.json` per directory. +- **Notes are opt-in**: Default output includes only statutory text and source credits. Notes require CLI flags. +- **Token estimation**: character/4 heuristic in `_meta.json`. ## XML/USLM Pitfalls -- Treat XML as namespace-aware: XHTML tables are in `http://www.w3.org/1999/xhtml`. -- Do not assume strict legal hierarchy nesting in input XML. -- Handle anomalous/repealed/empty sections without crashing; output should still be produced when applicable. -- Handle interstitial `` and multi-paragraph `` correctly. + +- Treat XML as namespace-aware: XHTML tables are in `http://www.w3.org/1999/xhtml`, inline ``/`` are in the USLM namespace. +- Do not assume strict legal hierarchy nesting — the schema is intentionally permissive. +- Handle anomalous/repealed/empty sections without crashing; output should still be produced. +- Handle interstitial `` (between same-level elements, not just after sub-levels). +- Handle multi-paragraph `` (multiple `

` elements). +- `

` inside `` must not emit standalone files — track `quotedContentDepth`. +- Some titles have duplicate section numbers — output disambiguated with `-2` suffix. + +## Output File Naming + +``` +output/usc/title-{NN}/chapter-{NN}/section-{N}.md +``` + +- Title dirs: zero-padded (`title-01` through `title-54`) +- Chapter dirs: zero-padded (`chapter-01`, `chapter-02`) +- Section files: NOT zero-padded, may be alphanumeric (`section-7801.md`, `section-106a.md`) +- Appendix titles: separate directories (`title-05-appendix/`) ## References -Link to source docs instead of duplicating details: - -- `CLAUDE.md` -- `CONTRIBUTING.md` -- `docs/architecture.md` -- `docs/extending.md` -- `docs/output-format.md` -- `docs/xml-element-reference.md` + +See these docs for deeper detail: + +- `CLAUDE.md` — Full USLM schema reference, identifier format, namespaces, notes taxonomy, status values, download URLs +- `CONTRIBUTING.md` — Setup, workflow, PR checklist, changesets +- `docs/architecture.md` — System overview, package design, data flow +- `docs/output-format.md` — Directory layout, frontmatter schema, metadata indexes, RAG guidance +- `docs/xml-element-reference.md` — Element-by-element conversion reference +- `docs/extending.md` — Guide for adding new legal source types diff --git a/.gitignore b/.gitignore index 7ee8b3f4..afceaec7 100644 --- a/.gitignore +++ b/.gitignore @@ -49,9 +49,7 @@ coverage/ # Claude Code runtime state (personal MCP config, settings, local overrides) /.claude/ - -# Shared MCP config (.mcp.json) is intentionally NOT ignored — it contains -# no secrets and is committed so all contributors get shared MCP servers. +.mcp.json # Cursor .cursorignore diff --git a/.mcp.json b/.mcp.json deleted file mode 100644 index 4eef5b40..00000000 --- a/.mcp.json +++ /dev/null @@ -1,10 +0,0 @@ -{ - "mcpServers": { - "context7": { - "type": "stdio", - "command": "npx", - "args": ["-y", "@anthropic-ai/mcp-remote@latest", "https://mcp.context7.com/mcp"], - "env": {} - } - } -} diff --git a/CHANGELOG.md b/CHANGELOG.md index 2700336a..50935a49 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,13 @@ and this project adheres to [Conventional Commits](https://www.conventionalcommi ## [Unreleased] +## [0.7.1] + +### Changed + +- **Organization**: General repository maintenance and cleanup. + + ## [0.7.0] ### Added diff --git a/CLAUDE.md b/CLAUDE.md index bf730260..da1a89bd 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -27,18 +27,18 @@ law2md/ - **Runtime**: Node.js >= 20 LTS (ESM) - **Language**: TypeScript 5.x, strict mode, no `any` unless explicitly justified -- **XML Parsing**: `saxes` (SAX streaming) + `@xmldom/xmldom` (DOM for fragments) +- **XML Parsing**: `saxes` (SAX streaming) - **CLI**: `commander` -- **Validation**: `zod` +- **CLI Output**: `chalk`, `ora`, `cli-table3` - **YAML**: `yaml` package - **Zip**: `yauzl` - **Token Counting**: character/4 heuristic -- **Logging**: `pino` - **Testing**: `vitest` - **Build**: `tsup` - **Linting**: ESLint + `@typescript-eslint` -- **Formatting**: Prettier +- **Formatting**: Prettier (double quotes, trailing commas, 100 char print width) - **Monorepo**: Turborepo + pnpm workspaces +- **Versioning**: `@changesets/cli` with lockstep versioning across all packages ## Build & Dev Commands @@ -68,9 +68,11 @@ pnpm turbo lint pnpm turbo dev # Run the CLI locally during development -node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./test-output -node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output +node packages/cli/dist/index.js download --all node packages/cli/dist/index.js download --titles 1 +node packages/cli/dist/index.js convert --all +node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output +node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./test-output ``` ## Code Conventions @@ -251,7 +253,7 @@ https://uscode.house.gov/download/releasepoints/us/pl/{congress}/{law}/xml_uscAl Where `{NN}` is zero-padded title number (01-54), `{congress}` is Congress number, `{law}` is public law number. -Example (current as of early 2026): `xml_usc01@119-73not60.zip` +Example: `xml_usc01@119-73not60.zip` Note: Release points can include exclusion suffixes (e.g., `119-73not60` means "through PL 119-73, excluding PL 119-60"). The current release point is hardcoded in `packages/usc/src/downloader.ts` as `CURRENT_RELEASE_POINT`. @@ -267,10 +269,12 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md - Chapter dirs: `chapter-01`, `chapter-02`, etc. (zero-padded to 2 digits) - Section files: `section-{N}.md` where N is the section number (NOT zero-padded, since section numbers can be alphanumeric like `section-7801`) - Subchapter dirs nest inside chapter dirs when present +- Appendix titles: separate directories (e.g., `title-05-appendix/`) for titles 5, 11, 18, 28 +- Duplicate sections: disambiguated with `-2`, `-3` suffix (e.g., `section-3598.md`, `section-3598-2.md`) ## Key Design Decisions -1. **SAX over DOM**: Large titles (26, 42) can exceed 100MB XML. SAX streaming keeps memory bounded. DOM is used only for small fragment inspection. +1. **SAX over DOM**: Large titles (26, 42) can exceed 100MB XML. SAX streaming keeps memory bounded. DOM is not used. 2. **Section as the atomic unit**: A section is the smallest citable legal unit in the U.S. Code. Subsections, paragraphs, etc. are rendered within the section file, not as separate files. @@ -284,11 +288,11 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md 7. **Footnotes**: Rendered as Markdown footnotes (`[^N]` at reference site, `[^N]: text` at bottom of section file). -8. **Appendix titles**: Separate output directories (e.g., `title-05-appendix/`) for titles with appendices (5, 11, 18, 28). +8. **Token estimation**: Uses character/4 heuristic for token counts in `_meta.json`. Precise `tiktoken`-based counting is a planned enhancement. -9. **Token estimation**: Uses character/4 heuristic for token counts in `_meta.json`. +9. **Table of Disposition**: Excluded from section-level output. Included in title-level README.md. -10. **Table of Disposition**: Excluded from section-level output. Included in title-level README.md. +10. **Collect-then-write pattern**: Sections are collected during SAX streaming and written after the stream completes, avoiding async issues during SAX event processing. ## Common Pitfalls @@ -301,6 +305,8 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md - **Permissive content model**: `` uses `processContents="lax"` with `namespace="##any"` — it can contain elements from any namespace, including embedded XHTML. The SAX parser must handle unexpected elements gracefully. - **`` is interstitial**: Not just "after sub-levels" but also between elements of the same level. Handle as a text block in whatever position it appears. - **Element versioning**: Elements can have `@startPeriod`/`@endPeriod`/`@status` for point-in-time variants. Multiple versions of the same element may coexist in the document. +- **Quoted content sections**: `
` elements inside `` (quoted bills in statutory notes) must not be emitted as standalone files. Track `quotedContentDepth` to suppress emission. +- **Duplicate section numbers**: Some titles have multiple sections with the same number within a chapter (e.g., Title 5). Output files are disambiguated with `-2` suffixes. ## When Adding New Source Types (CFR, State Statutes) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index e70c0786..f1c27f59 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,6 +1,6 @@ # Contributing to law2md -Thanks for your interest in contributing to law2md! This guide covers the basics for getting set up and submitting changes. +Thanks for your interest in contributing! This guide covers everything you need to get set up and submit changes. ## Prerequisites @@ -16,6 +16,12 @@ pnpm install pnpm turbo build ``` +Verify everything is working: + +```bash +pnpm turbo test && pnpm turbo lint && pnpm turbo typecheck +``` + ## Development Workflow ### Common Commands @@ -31,17 +37,19 @@ pnpm turbo dev # Watch mode (rebuild on change) To scope commands to a single package: ```bash -pnpm turbo test --filter=@law2md/core +pnpm turbo build --filter=@law2md/core pnpm turbo test --filter=@law2md/usc pnpm turbo test --filter=law2md ``` ### Running the CLI Locally +After building, run the CLI directly from the dist output: + ```bash -node packages/cli/dist/index.js convert path/to/usc01.xml -o ./output -node packages/cli/dist/index.js download --titles 1 # saves to ./downloads/usc/xml/ -node packages/cli/dist/index.js convert --titles 1-5 # convert multiple titles +node packages/cli/dist/index.js download --titles 1 +node packages/cli/dist/index.js convert --titles 1-5 -o ./output +node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./output ``` ### Formatting @@ -51,6 +59,8 @@ pnpm format # Auto-format all files pnpm format:check # Check formatting without writing ``` +Formatting is enforced by Prettier (double quotes, trailing commas, 100 char print width). + ## Project Structure ``` @@ -60,21 +70,32 @@ packages/ cli/ law2md — CLI entry point (the published npm package) ``` -The `core` package provides the general-purpose pipeline. The `usc` package adds U.S. Code-specific handling. The `cli` package wires everything together as a command-line tool. +The `core` package provides the general-purpose XML-to-Markdown pipeline. The `usc` package adds U.S. Code-specific handling. The `cli` package wires everything together as a command-line tool. Internal packages use `workspace:*` protocol for dependencies. ## Code Conventions -- **TypeScript strict mode** — `strict: true`, `noUncheckedIndexedAccess: true` +### TypeScript + +- **Strict mode** — `strict: true`, `noUncheckedIndexedAccess: true`, `exactOptionalPropertyTypes: true` - **ESM only** — all packages use `"type": "module"` - **`import type`** for type-only imports - **`interface`** over `type` for object shapes - **`unknown`** over `any` — if `any` is truly needed, add an eslint-disable comment with justification -- **Files**: `kebab-case.ts` -- **Types/Interfaces**: `PascalCase` -- **Functions**: `camelCase` -- **Constants**: `UPPER_SNAKE_CASE` -See [CLAUDE.md](CLAUDE.md) for the full conventions reference, USLM schema details, and design decisions. +### Naming + +| Category | Convention | Example | +|----------|-----------|---------| +| Files | `kebab-case.ts` | `ast-builder.ts` | +| Types / Interfaces | `PascalCase` | `SectionNode`, `ConvertOptions` | +| Functions | `camelCase` | `parseIdentifier`, `renderSection` | +| Constants | `UPPER_SNAKE_CASE` | `USLM_NAMESPACE` | + +### Error Handling + +- XML parsing errors: warn and continue (don't crash on anomalous structures) +- File I/O errors: throw with context (file path, operation attempted) +- Never swallow errors silently — at minimum, log at `warn` level ## Testing @@ -85,6 +106,12 @@ pnpm turbo test # Run all tests pnpm turbo test --filter=@law2md/usc # Run one package ``` +Name test cases descriptively: + +```ts +it("converts with chapeau to indented bold-lettered paragraph") +``` + ### Snapshot Tests Output stability is protected by snapshot tests in `packages/usc/src/snapshot.test.ts`. Expected output files live in `fixtures/expected/`. @@ -107,7 +134,10 @@ Review the diff in `fixtures/expected/` to confirm only intended changes, then c 1. Fork the repository and create a feature branch from `main` 2. Make your changes -3. Ensure all checks pass: `pnpm turbo build && pnpm turbo test && pnpm turbo lint && pnpm turbo typecheck` +3. Ensure all checks pass: + ```bash + pnpm turbo build && pnpm turbo test && pnpm turbo lint && pnpm turbo typecheck + ``` 4. Write descriptive commit messages using [conventional commits](https://www.conventionalcommits.org/) (e.g., `feat(core):`, `fix(usc):`, `docs:`) 5. Open a pull request against `main` diff --git a/README.md b/README.md index 58ef8c76..24a49348 100644 --- a/README.md +++ b/README.md @@ -1,50 +1,71 @@ # law2md -[![CI](https://img.shields.io/github/actions/workflow/status/chris-c-thomas/law2md/ci.yml?style=flat-square&label=CI)](https://github.com/chris-c-thomas/law2md/actions/workflows/ci.yml) [![npm](https://img.shields.io/npm/v/law2md?style=flat-square)](https://www.npmjs.com/package/law2md) +[![CI](https://img.shields.io/github/actions/workflow/status/chris-c-thomas/law2md/ci.yml?style=flat-square&label=CI)](https://github.com/chris-c-thomas/law2md/actions/workflows/ci.yml) +[![TypeScript](https://img.shields.io/badge/TypeScript-strict-blue?style=flat-square)](https://www.typescriptlang.org/) +[![Node](https://img.shields.io/node/v/law2md?style=flat-square)](https://nodejs.org/) [![license](https://img.shields.io/github/license/chris-c-thomas/law2md?style=flat-square)](LICENSE) -[![issues](https://img.shields.io/github/issues/chris-c-thomas/law2md?style=flat-square)](https://github.com/chris-c-thomas/law2md/issues) -[![pull requests](https://img.shields.io/github/issues-pr/chris-c-thomas/law2md?style=flat-square)](https://github.com/chris-c-thomas/law2md/pulls) -Convert the United States Code into structured Markdown for AI and RAG Systems. +CLI tool to download and convert the entire [United States Code](https://uscode.house.gov/) from official XML (USLM Schema) into structured Markdown that's optimized for AI ingestion, RAG pipelines, and semantic search. + +## Table of Contents + +- [Overview](#overview) +- [Features](#features) +- [Installation](#installation) +- [Quick Start](#quick-start) +- [Usage](#usage) +- [Output](#output) +- [Performance](#performance) +- [Project Structure](#project-structure) +- [Development](#development) +- [Documentation](#documentation) +- [Data Sources](#data-sources) +- [Roadmap](#roadmap) +- [Contributing](#contributing) +- [License](#license) --- ## Overview -`law2md` is a command-line tool that converts the [XML files](https://uscode.house.gov/download/download.shtml) of the United States Code, published by the [Office of the Law Revision Counsel](https://uscode.house.gov/), into clean, structured Markdown optimized for AI ingestion, retrieval-augmented generation (RAG), and legal research workflows. - -The U.S. Code comprises 54 titles of federal statutory law. The official XML is deeply nested, laden with presentation markup, and difficult to work with directly. `law2md` transforms this XML into per-section, or optional per-chapter, Markdown files with YAML frontmatter, predictable file paths, and content sized for typical embedding models. +The U.S. Code comprises 54 titles of federal statutory law. The [Office of the Law Revision Counsel](https://uscode.house.gov/about_office.xhtml) (OLRC) publishes the official text as [deeply nested XML](https://uscode.house.gov/download/download.shtml) using the [United States Legislative Markup](https://uscode.house.gov/download/resources/USLM-User-Guide.pdf) (USLM) schema. These files are dense, laden with presentation markup, and difficult to work with directly. -The [OLRC](https://uscode.house.gov/about_office.xhtml) provides a user guide for the [United States Legislative Markup](https://uscode.house.gov/download/resources/USLM-User-Guide.pdf). +`law2md` transforms this XML into per-section Markdown files with YAML frontmatter, predictable file paths, and content sized for typical embedding model context windows — making the entire U.S. Code accessible to LLMs, vector databases, and legal research tools. +## Features -### Features - -- **Built-in downloader** -- fetch individual titles or the entire U.S. Code directly from OLRC -- **Streaming SAX parser** -- processes XML files of any size (including 100MB+ titles) with bounded memory -- **Section-level output** -- each section becomes its own Markdown file, sized for RAG chunk windows -- **Chapter-level output** -- optional mode that inlines all sections into per-chapter files -- **YAML frontmatter** -- structured metadata on every file (identifier, title, chapter, section, status, source credit) -- **Structural fidelity** -- preserves the full USLM hierarchy using bold inline numbering that mirrors legal citation convention -- **Cross-reference links** -- resolved as relative links within the corpus, or as OLRC website URLs -- **Filterable notes** -- editorial notes, statutory notes, and amendment history can be selectively included or excluded -- **Metadata indexes** -- `_meta.json` sidecar files with section listings and token estimates -- **Tables** -- XHTML tables and USLM layout tables converted to Markdown pipe tables -- **Dry-run mode** -- preview conversion stats without writing files -- **Appendix handling** -- titles with appendices (5, 11, 18, 28) output to separate directories +- **Built-in downloader** — fetch individual titles or the entire U.S. Code directly from OLRC +- **Streaming SAX parser** — handles XML files of any size (100MB+) with bounded memory +- **Section-level output** — each section becomes its own Markdown file, sized for RAG chunk windows +- **Chapter-level output** — optional mode that inlines all sections into per-chapter files +- **YAML frontmatter** — structured metadata on every file (identifier, title, chapter, section, status, source credit) +- **Structural fidelity** — preserves the full USLM hierarchy using bold inline numbering that mirrors legal citation conventions +- **Cross-reference links** — resolved as relative links within the corpus, or as OLRC website URLs +- **Filterable notes** — editorial notes, statutory notes, and amendment history can be selectively included or excluded +- **Metadata indexes** — `_meta.json` sidecar files with section listings and token estimates +- **Tables** — XHTML tables and USLM layout tables converted to Markdown pipe tables +- **Dry-run mode** — preview conversion stats without writing files +- **Appendix handling** — titles with appendices (5, 11, 18, 28) output to separate directories --- -## Installation +## Install -### From npm +### npm ```bash npm install -g law2md ``` -### From source +### npx + +```bash +npx law2md download --all +npx law2md convert --all +``` + +### Source Requires [Node.js](https://nodejs.org/) >= 20 and [pnpm](https://pnpm.io/) >= 10. @@ -76,7 +97,7 @@ law2md download --titles 1-5 && law2md convert --titles 1-5 ### Download -Fetch U.S. Code XML files directly from the Office of the Law Revision Counsel: +Fetch U.S. Code XML files directly from the OLRC: ```bash # Download a single title @@ -88,7 +109,7 @@ law2md download --titles 1-5 # Download specific titles (mixed) law2md download --titles 1-5,8,11 -# Download all 54 titles +# Download all 54 titles (uses a single bulk zip) law2md download --all # Use a specific release point @@ -106,7 +127,7 @@ law2md convert --all # Convert a single XML file law2md convert ./downloads/usc/xml/usc01.xml -o ./output -# Convert by title number (uses default input directory) +# Convert by title number law2md convert --titles 1 # Convert multiple titles @@ -115,29 +136,27 @@ law2md convert --titles 1-5,8,11 # Convert with a custom input directory law2md convert --titles 1-5 -i ./my-xml-files -# Chapter-level output -law2md convert ./downloads/usc/xml/usc01.xml -o ./output -g chapter +# Chapter-level output (one file per chapter) +law2md convert --titles 1 -o ./output -g chapter # Cross-reference links resolved to OLRC URLs -law2md convert ./downloads/usc/xml/usc05.xml -o ./output --link-style canonical +law2md convert --titles 5 -o ./output --link-style canonical # Include only amendment notes -law2md convert ./downloads/usc/xml/usc01.xml -o ./output --include-amendments +law2md convert --titles 1 -o ./output --include-amendments # Exclude all notes -law2md convert ./downloads/usc/xml/usc01.xml -o ./output --no-include-notes +law2md convert --titles 1 -o ./output --no-include-notes -# Dry-run: preview stats without writing files -law2md convert ./downloads/usc/xml/usc42.xml -o ./output --dry-run +# Dry run — preview stats without writing files +law2md convert --titles 42 --dry-run ``` ### CLI Reference -```bash -law2md convert [input] [options] ``` +law2md convert [input] [options] -```text Arguments: input Path to a USC XML file (optional if --titles or --all is used) @@ -147,7 +166,7 @@ Options: or mixed (1-5,8,11) --all Convert all downloaded titles found in --input-dir - -i, --input-dir Input directory for XML files + -i, --input-dir Input directory for XML files (default: "./downloads/usc/xml") -o, --output Output directory (default: "./output") -g, --granularity "section" or "chapter" (default: "section") @@ -161,14 +180,17 @@ Options: --dry-run Parse and report without writing files -v, --verbose Enable verbose logging -h, --help Display help +``` +``` law2md download [options] Options: --titles Title(s) to download: single (1), range (1-5), or mixed (1-5,8,11) --all Download all 54 titles - -o, --output Output directory (default: "./downloads/usc/xml") + -o, --output Output directory + (default: "./downloads/usc/xml") --release-point OLRC release point (default: current) -h, --help Display help ``` @@ -177,11 +199,11 @@ When multiple `--include-*-notes` flags are specified, they combine additively. --- -## Output Format +## Output ### Directory Structure -```text +``` output/ usc/ title-01/ @@ -218,7 +240,7 @@ positive_law: true currency: "119-73" last_updated: "2025-12-03" format_version: "1.0.0" -generator: "law2md@0.5.0" +generator: "law2md@0.7.0" source_credit: "(Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...)" --- ``` @@ -237,11 +259,11 @@ Columbia, the Commonwealth of Puerto Rico, or any other territory... **Source Credit**: (Added Pub. L. 104-199, § 3(a), Sept. 21, 1996, ...) ``` -Subsections and below use bold inline numbering (`**(a)**`, `**(1)**`, `**(A)**`, `**(i)**`) rather than Markdown headings. This preserves a flat document structure optimized for embedding models and chunking strategies. +Subsections and below use bold inline numbering (`**(a)**`, `**(1)**`, `**(A)**`, `**(i)**`) rather than Markdown headings, preserving a flat document structure optimized for embedding models and chunking strategies. ### Metadata Indexes -Each title directory includes a `_meta.json` file for programmatic access: +Each directory includes a `_meta.json` sidecar file for programmatic access: ```json { @@ -280,21 +302,37 @@ For the complete output format specification, see [docs/output-format.md](docs/o --- +## Performance + +The full U.S. Code — all 54 titles (53 with content; Title 53 is reserved), over 60,000 sections, ~85 million estimated tokens — converts in under 20 seconds on a modern machine. SAX streaming keeps memory bounded even for the largest titles: + +| Title | XML Size | Sections | ~Tokens | Duration | +|-------|----------|----------|---------|----------| +| Title 1 - General Provisions | 0.3 MB | 39 | 35K | 0.04s | +| Title 10 - Armed Forces | 50.7 MB | 3,847 | 6.0M | 1.4s | +| Title 26 - Internal Revenue Code | 53.2 MB | 2,160 | 6.4M | 1.1s | +| Title 42 - Public Health | 107.3 MB | 8,460 | 14.7M | 2.7s | +| **All 54 titles** | **~650 MB** | **60,215** | **~85M** | **~18s** | + +--- + ## Project Structure -```text +``` law2md/ packages/ - core/ @law2md/core -- XML parsing, AST, Markdown rendering - usc/ @law2md/usc -- U.S. Code conversion logic and downloader - cli/ law2md -- CLI entry point (the published npm package) + core/ @law2md/core — XML parsing, AST, Markdown rendering + usc/ @law2md/usc — U.S. Code downloader and conversion logic + cli/ law2md — CLI entry point fixtures/ - fragments/ Small XML snippets for unit tests + fragments/ XML snippets for unit tests expected/ Expected output snapshots - docs/ Architecture, output format spec, extension guide + docs/ Architecture, XML reference, output format, exending ``` -The project is a monorepo managed with [pnpm](https://pnpm.io/) workspaces and [Turborepo](https://turbo.build/). The separation into `core` and `usc` packages is designed to support additional legal source types (CFR, state statutes) by adding new packages that share the core infrastructure. +The project is a monorepo managed with [pnpm](https://pnpm.io/) workspaces and [Turborepo](https://turbo.build/). + +The separation into `core` and `usc` packages is designed to support additional legal source types (CFR, state statutes) in the future by adding new packages that share the core infrastructure. --- @@ -303,9 +341,10 @@ The project is a monorepo managed with [pnpm](https://pnpm.io/) workspaces and [ ```bash pnpm install # Install dependencies pnpm turbo build # Build all packages -pnpm turbo test # Run all tests +pnpm turbo test # Run all 176 tests pnpm turbo lint # Lint all packages pnpm turbo typecheck # Type-check all packages +pnpm turbo dev # Watch mode (rebuild on change) ``` See [CONTRIBUTING.md](CONTRIBUTING.md) for the full contributor guide. @@ -314,10 +353,12 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for the full contributor guide. ## Documentation -- [Output Format Specification](docs/output-format.md) -- directory layout, frontmatter schema, metadata indexes, RAG guidance -- [Architecture](docs/architecture.md) -- system overview, package design, data flow, memory profile -- [XML Element Reference](docs/xml-element-reference.md) -- USLM element mapping and Markdown output -- [Extending](docs/extending.md) -- guide for adding new legal source types +| Document | Description | +|----------|-------------| +| [Output Format](docs/output-format.md) | Directory layout, frontmatter schema, metadata indexes, RAG guidance | +| [Architecture](docs/architecture.md) | System overview, package design, data flow, memory profile | +| [XML Element Reference](docs/xml-element-reference.md) | USLM element mapping and Markdown output | +| [Extending](docs/extending.md) | Guide for adding new legal source types | --- @@ -325,10 +366,47 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for the full contributor guide. `law2md` processes XML published by the [Office of the Law Revision Counsel](https://uscode.house.gov/) (OLRC) of the U.S. House of Representatives. The XML uses the United States Legislative Markup (USLM) 1.0 schema. -The U.S. Code XML is public domain and freely available at [uscode.house.gov/download/download.shtml](https://uscode.house.gov/download/download.shtml). +The U.S. Code XML is **public domain** and freely available at [uscode.house.gov/download/download.shtml](https://uscode.house.gov/download/download.shtml). + +--- + +## Roadmap + +Features and enhancements that are currently planned. + +Feel free to open an [issue](https://github.com/chris-c-thomas/law2md/issues) or start a [discussion](https://github.com/chris-c-thomas/law2md/discussions) to talk about any of these. Ideas and contributions are always welcome. + +**Output** + +- [ ] Additional output formats — plain text, JSON, and JSONL +- [ ] Precise token counting via `tiktoken` (`--precise-tokens`) +- [ ] Section diff between OLRC release points + +**Sources** + +- [ ] Code of Federal Regulations (CFR) +- [ ] State statutes +- [ ] Incremental update support for new OLRC release points + +**Metadata** + +- [ ] Parent path metadata — full structural ancestry per section +- [ ] Related sections — sibling references for contextual RAG retrieval +- [ ] Cross-reference graph export (JSON/GraphML) + +**Tooling** + +- [ ] MCP server for AI-assisted legal research +- [ ] Embedding pipeline integration + +--- + +## Contributing + +Please see [CONTRIBUTING.md](CONTRIBUTING.md) for setup instructions, code conventions, testing guidelines, and the PR checklist. --- ## License -MIT. See [LICENSE](LICENSE). +[MIT](LICENSE) diff --git a/packages/cli/CHANGELOG.md b/packages/cli/CHANGELOG.md index b2608b4d..7b5714dd 100644 --- a/packages/cli/CHANGELOG.md +++ b/packages/cli/CHANGELOG.md @@ -1,5 +1,17 @@ # law2md +## 0.8.0 + +### Minor Changes + +- Cleanup repo + +### Patch Changes + +- Updated dependencies + - @law2md/core@0.8.0 + - @law2md/usc@0.8.0 + ## 0.7.0 ### Minor Changes diff --git a/packages/cli/package.json b/packages/cli/package.json index 83f480bb..383668d3 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -1,6 +1,6 @@ { "name": "law2md", - "version": "0.7.0", + "version": "0.8.0", "description": "Convert U.S. legislative XML (USLM) to structured Markdown for AI/RAG ingestion", "type": "module", "main": "./dist/index.js", diff --git a/packages/core/CHANGELOG.md b/packages/core/CHANGELOG.md index 5033151f..fb14d8b6 100644 --- a/packages/core/CHANGELOG.md +++ b/packages/core/CHANGELOG.md @@ -1,5 +1,11 @@ # @law2md/core +## 0.8.0 + +### Minor Changes + +- Cleanup repo + ## 0.7.0 ### Minor Changes diff --git a/packages/core/package.json b/packages/core/package.json index 7f19e206..424e574c 100644 --- a/packages/core/package.json +++ b/packages/core/package.json @@ -1,6 +1,6 @@ { "name": "@law2md/core", - "version": "0.7.0", + "version": "0.8.0", "description": "Core XML parsing, AST, and Markdown rendering for law2md", "type": "module", "main": "./dist/index.js", diff --git a/packages/usc/CHANGELOG.md b/packages/usc/CHANGELOG.md index 8fc50889..aebfc4bf 100644 --- a/packages/usc/CHANGELOG.md +++ b/packages/usc/CHANGELOG.md @@ -1,5 +1,16 @@ # @law2md/usc +## 0.8.0 + +### Minor Changes + +- Cleanup repo + +### Patch Changes + +- Updated dependencies + - @law2md/core@0.8.0 + ## 0.7.0 ### Minor Changes diff --git a/packages/usc/package.json b/packages/usc/package.json index 3db8521e..242d8a69 100644 --- a/packages/usc/package.json +++ b/packages/usc/package.json @@ -1,6 +1,6 @@ { "name": "@law2md/usc", - "version": "0.7.0", + "version": "0.8.0", "description": "U.S. Code-specific element handlers and downloader for law2md", "type": "module", "main": "./dist/index.js",