Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 114 additions & 29 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
# law2md Workspace Instructions

## Scope
These instructions apply to all work in this repository. Keep changes minimal, targeted, and consistent with existing package boundaries.

These instructions apply to all work in this repository. Keep changes minimal, targeted, and consistent with existing package boundaries. See `CLAUDE.md` for the full USLM schema reference, design decisions, and common pitfalls.

## Project Overview

`law2md` converts U.S. legislative XML (USLM schema) into structured Markdown for AI/RAG ingestion. It is a monorepo built with Turborepo, pnpm workspaces, TypeScript, and Node.js.

## Build and Test
Run commands from the repository root.

Run commands from the repository root. Always use `pnpm`, not `npm`.

```bash
pnpm install
Expand All @@ -14,47 +20,126 @@ pnpm turbo typecheck
pnpm turbo lint
```

Useful package-scoped pattern:
Package-scoped pattern:

```bash
pnpm turbo <task> --filter=@law2md/core
pnpm turbo <task> --filter=@law2md/usc
pnpm turbo <task> --filter=law2md
```

Run the CLI locally during development:

```bash
node packages/cli/dist/index.js download --titles 1
node packages/cli/dist/index.js convert --all
node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output
```

## Architecture

This is a Turborepo + pnpm monorepo with three packages:

- `packages/core` (`@law2md/core`): namespace-aware XML parsing, AST building, markdown rendering, shared utilities.
- `packages/usc` (`@law2md/usc`): USC-specific conversion and OLRC downloading logic.
- `packages/cli` (`law2md`): CLI commands (`convert`, `download`) and user-facing command surface.
- `packages/core` (`@law2md/core`): namespace-aware XML parsing (SAX via `saxes`), AST building, Markdown rendering, frontmatter generation, shared utilities.
- `packages/usc` (`@law2md/usc`): USC-specific conversion pipeline and OLRC downloader. Contains `convertTitle()` which orchestrates ReadStream → SAX → AST → Markdown → file writer.
- `packages/cli` (`law2md`): CLI commands (`convert`, `download`), terminal UI (`chalk`, `ora`, `cli-table3`), and user-facing command surface.

Respect boundaries: keep generic parsing/rendering logic in `core`, USC-specific behavior in `usc`, and CLI orchestration in `cli`. Internal packages use `workspace:*` protocol for dependencies.

### Key files

Respect boundaries: keep generic parsing/rendering logic in `core`, USC-specific behavior in `usc`, and CLI orchestration in `cli`.
- `packages/core/src/xml/parser.ts` — SAX streaming parser with namespace normalization
- `packages/core/src/ast/builder.ts` — Stack-based XML-to-AST construction with section-emit pattern
- `packages/core/src/markdown/renderer.ts` — Stateless AST-to-Markdown conversion
- `packages/core/src/markdown/frontmatter.ts` — YAML frontmatter generation
- `packages/core/src/xml/namespace.ts` — Namespace constants and element classification sets
- `packages/usc/src/converter.ts` — Full USC conversion pipeline orchestrator
- `packages/usc/src/downloader.ts` — OLRC download logic, `CURRENT_RELEASE_POINT` constant
- `packages/cli/src/ui.ts` — Terminal output formatting (spinners, tables, summary blocks)
- `packages/cli/src/parse-titles.ts` — Title spec parser (`1-5,8,11`)

## Tech Stack

- **Runtime**: Node.js >= 20 LTS (ESM only)
- **Language**: TypeScript 5.x, strict mode
- **XML Parsing**: `saxes` (SAX streaming)
- **CLI**: `commander`, `chalk`, `ora`, `cli-table3`
- **YAML**: `yaml` package
- **Zip**: `yauzl`
- **Testing**: `vitest`
- **Build**: `tsup`
- **Linting**: ESLint + `@typescript-eslint` + Prettier
- **Versioning**: `@changesets/cli` with lockstep versioning

## Code Style
- Use TypeScript strict mode conventions already configured in the repo.
- Use ESM imports/exports only.
- Prefer `interface` for object shapes.
- Use `import type` for type-only imports.
- Avoid `any`; use `unknown` unless a justified exception is required.
- Add JSDoc for exported functions and types.
- Keep file naming and symbol naming consistent with existing conventions.

- TypeScript strict mode: `strict: true`, `noUncheckedIndexedAccess: true`, `exactOptionalPropertyTypes: true`
- ESM imports/exports only (`"type": "module"` in all package.json files)
- Prefer `interface` over `type` for object shapes
- Use `import type` for type-only imports
- Avoid `any`; use `unknown` unless a justified exception is required with an eslint-disable comment
- Add JSDoc for all exported functions and types
- Barrel exports via `index.ts` in each package `src/`
- Files: `kebab-case.ts`
- Types/Interfaces: `PascalCase`
- Functions: `camelCase`
- Constants: `UPPER_SNAKE_CASE`
- Prettier: double quotes, trailing commas, 100 char print width

## Error Handling

- Use custom error classes extending `Error` with `cause` chaining
- XML parsing errors: warn and continue (don't crash on anomalous structures)
- File I/O errors: throw with context (file path, operation attempted)
- Never swallow errors silently — at minimum, log at `warn` level

## Testing Conventions
- Co-locate tests with implementation files (`*.test.ts`).
- Prefer descriptive test names.
- Preserve and intentionally update markdown snapshots when behavior changes.

- Co-locate tests with implementation files (`parser.ts` → `parser.test.ts`)
- Use `describe` blocks mirroring the module's exported API
- Name test cases descriptively: `it("converts <subsection> with chapeau to indented bold-lettered paragraph")`
- Snapshot tests in `packages/usc/src/snapshot.test.ts` with expected output in `fixtures/expected/`
- Update snapshots intentionally: `cd packages/usc && pnpm exec vitest run --update`
- Fixtures: `fixtures/fragments/` (synthetic XML, committed), `fixtures/expected/` (snapshots, committed)
- Commit messages: [conventional commits](https://www.conventionalcommits.org/) (e.g., `feat(core):`, `fix(usc):`, `docs:`)

## Key Design Decisions

- **SAX over DOM**: Large titles exceed 100MB. SAX streaming keeps memory bounded.
- **Section as atomic unit**: Each section is its own Markdown file. Subsections render inline, not as separate files.
- **Collect-then-write**: Sections are collected during SAX streaming and written after the stream completes.
- **Frontmatter + sidecar**: YAML frontmatter on every .md file AND `_meta.json` per directory.
- **Notes are opt-in**: Default output includes only statutory text and source credits. Notes require CLI flags.
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Notes are opt-in” bullet doesn’t match current CLI behavior: convert defaults to including notes (--include-notes is true by default, with --no-include-notes to exclude). Consider rewording this to reflect that selective note categories are opt-in, or change it to “notes are included by default”.

Suggested change
- **Notes are opt-in**: Default output includes only statutory text and source credits. Notes require CLI flags.
- **Notes are included by default**: `convert` emits notes unless `--no-include-notes` is passed; selective note categories are opt-in via CLI flags.

Copilot uses AI. Check for mistakes.
- **Token estimation**: character/4 heuristic in `_meta.json`.

## XML/USLM Pitfalls
- Treat XML as namespace-aware: XHTML tables are in `http://www.w3.org/1999/xhtml`.
- Do not assume strict legal hierarchy nesting in input XML.
- Handle anomalous/repealed/empty sections without crashing; output should still be produced when applicable.
- Handle interstitial `<continuation>` and multi-paragraph `<content>` correctly.

- Treat XML as namespace-aware: XHTML tables are in `http://www.w3.org/1999/xhtml`, inline `<b>`/`<i>` are in the USLM namespace.
- Do not assume strict legal hierarchy nesting — the schema is intentionally permissive.
- Handle anomalous/repealed/empty sections without crashing; output should still be produced.
- Handle interstitial `<continuation>` (between same-level elements, not just after sub-levels).
- Handle multi-paragraph `<content>` (multiple `<p>` elements).
- `<section>` inside `<quotedContent>` must not emit standalone files — track `quotedContentDepth`.
- Some titles have duplicate section numbers — output disambiguated with `-2` suffix.

## Output File Naming

```
output/usc/title-{NN}/chapter-{NN}/section-{N}.md
```

- Title dirs: zero-padded (`title-01` through `title-54`)
- Chapter dirs: zero-padded (`chapter-01`, `chapter-02`)
- Section files: NOT zero-padded, may be alphanumeric (`section-7801.md`, `section-106a.md`)
- Appendix titles: separate directories (`title-05-appendix/`)

## References
Link to source docs instead of duplicating details:

- `CLAUDE.md`
- `CONTRIBUTING.md`
- `docs/architecture.md`
- `docs/extending.md`
- `docs/output-format.md`
- `docs/xml-element-reference.md`

See these docs for deeper detail:

- `CLAUDE.md` — Full USLM schema reference, identifier format, namespaces, notes taxonomy, status values, download URLs
- `CONTRIBUTING.md` — Setup, workflow, PR checklist, changesets
- `docs/architecture.md` — System overview, package design, data flow
- `docs/output-format.md` — Directory layout, frontmatter schema, metadata indexes, RAG guidance
- `docs/xml-element-reference.md` — Element-by-element conversion reference
- `docs/extending.md` — Guide for adding new legal source types
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,7 @@ coverage/

# Claude Code runtime state (personal MCP config, settings, local overrides)
/.claude/

# Shared MCP config (.mcp.json) is intentionally NOT ignored — it contains
# no secrets and is committed so all contributors get shared MCP servers.
.mcp.json

# Cursor
.cursorignore
Expand Down
10 changes: 0 additions & 10 deletions .mcp.json

This file was deleted.

7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,13 @@ and this project adheres to [Conventional Commits](https://www.conventionalcommi

## [Unreleased]

## [0.7.1]

### Changed

- **Organization**: General repository maintenance and cleanup.

Comment on lines +10 to +15
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Root CHANGELOG adds a 0.7.1 entry, but all packages in this PR are bumped to 0.8.0 with their own 0.8.0 changelog entries. This mixed versioning is likely to confuse readers/users; consider updating the root changelog to 0.8.0 as well (or clarify what the root changelog versioning represents).

Copilot uses AI. Check for mistakes.

## [0.7.0]

### Added
Expand Down
28 changes: 17 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,18 @@ law2md/

- **Runtime**: Node.js >= 20 LTS (ESM)
- **Language**: TypeScript 5.x, strict mode, no `any` unless explicitly justified
- **XML Parsing**: `saxes` (SAX streaming) + `@xmldom/xmldom` (DOM for fragments)
- **XML Parsing**: `saxes` (SAX streaming)
- **CLI**: `commander`
- **Validation**: `zod`
- **CLI Output**: `chalk`, `ora`, `cli-table3`
- **YAML**: `yaml` package
- **Zip**: `yauzl`
- **Token Counting**: character/4 heuristic
- **Logging**: `pino`
- **Testing**: `vitest`
- **Build**: `tsup`
- **Linting**: ESLint + `@typescript-eslint`
- **Formatting**: Prettier
- **Formatting**: Prettier (double quotes, trailing commas, 100 char print width)
- **Monorepo**: Turborepo + pnpm workspaces
- **Versioning**: `@changesets/cli` with lockstep versioning across all packages

## Build & Dev Commands

Expand Down Expand Up @@ -68,9 +68,11 @@ pnpm turbo lint
pnpm turbo dev

# Run the CLI locally during development
node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./test-output
node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output
node packages/cli/dist/index.js download --all
node packages/cli/dist/index.js download --titles 1
node packages/cli/dist/index.js convert --all
node packages/cli/dist/index.js convert --titles 1-5 -o ./test-output
node packages/cli/dist/index.js convert ./downloads/usc/xml/usc01.xml -o ./test-output
```

## Code Conventions
Expand Down Expand Up @@ -251,7 +253,7 @@ https://uscode.house.gov/download/releasepoints/us/pl/{congress}/{law}/xml_uscAl

Where `{NN}` is zero-padded title number (01-54), `{congress}` is Congress number, `{law}` is public law number.

Example (current as of early 2026): `xml_usc01@119-73not60.zip`
Example: `xml_usc01@119-73not60.zip`

Note: Release points can include exclusion suffixes (e.g., `119-73not60` means "through PL 119-73, excluding PL 119-60"). The current release point is hardcoded in `packages/usc/src/downloader.ts` as `CURRENT_RELEASE_POINT`.

Expand All @@ -267,10 +269,12 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md
- Chapter dirs: `chapter-01`, `chapter-02`, etc. (zero-padded to 2 digits)
- Section files: `section-{N}.md` where N is the section number (NOT zero-padded, since section numbers can be alphanumeric like `section-7801`)
- Subchapter dirs nest inside chapter dirs when present
- Appendix titles: separate directories (e.g., `title-05-appendix/`) for titles 5, 11, 18, 28
- Duplicate sections: disambiguated with `-2`, `-3` suffix (e.g., `section-3598.md`, `section-3598-2.md`)

## Key Design Decisions

1. **SAX over DOM**: Large titles (26, 42) can exceed 100MB XML. SAX streaming keeps memory bounded. DOM is used only for small fragment inspection.
1. **SAX over DOM**: Large titles (26, 42) can exceed 100MB XML. SAX streaming keeps memory bounded. DOM is not used.

2. **Section as the atomic unit**: A section is the smallest citable legal unit in the U.S. Code. Subsections, paragraphs, etc. are rendered within the section file, not as separate files.

Expand All @@ -284,11 +288,11 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md

7. **Footnotes**: Rendered as Markdown footnotes (`[^N]` at reference site, `[^N]: text` at bottom of section file).

8. **Appendix titles**: Separate output directories (e.g., `title-05-appendix/`) for titles with appendices (5, 11, 18, 28).
8. **Token estimation**: Uses character/4 heuristic for token counts in `_meta.json`. Precise `tiktoken`-based counting is a planned enhancement.

9. **Token estimation**: Uses character/4 heuristic for token counts in `_meta.json`.
9. **Table of Disposition**: Excluded from section-level output. Included in title-level README.md.

10. **Table of Disposition**: Excluded from section-level output. Included in title-level README.md.
10. **Collect-then-write pattern**: Sections are collected during SAX streaming and written after the stream completes, avoiding async issues during SAX event processing.

## Common Pitfalls

Expand All @@ -301,6 +305,8 @@ output/usc/title-{NN}/chapter-{NN}/section-{N}.md
- **Permissive content model**: `<content>` uses `processContents="lax"` with `namespace="##any"` — it can contain elements from any namespace, including embedded XHTML. The SAX parser must handle unexpected elements gracefully.
- **`<continuation>` is interstitial**: Not just "after sub-levels" but also between elements of the same level. Handle as a text block in whatever position it appears.
- **Element versioning**: Elements can have `@startPeriod`/`@endPeriod`/`@status` for point-in-time variants. Multiple versions of the same element may coexist in the document.
- **Quoted content sections**: `<section>` elements inside `<quotedContent>` (quoted bills in statutory notes) must not be emitted as standalone files. Track `quotedContentDepth` to suppress emission.
- **Duplicate section numbers**: Some titles have multiple sections with the same number within a chapter (e.g., Title 5). Output files are disambiguated with `-2` suffixes.

## When Adding New Source Types (CFR, State Statutes)

Expand Down
Loading
Loading