This file provides guidance to Codex (Codex.ai/code) when working with code in this repository.
pnpm install # Install dependencies (uses corepack, pinned in packageManager field)
pnpm test # Run all tests (vitest, watch mode)
pnpm exec vitest run # Run all tests once (no watch)
pnpm exec vitest run tests/extract/extractCase.test.ts # Run a single test file
pnpm exec vitest run -t "extracts volume" # Run tests matching name pattern
pnpm build # Build with tsdown (ESM + CJS + DTS)
pnpm typecheck # Type-check with tsc --noEmit
pnpm lint # Lint with Biome
pnpm format # Format with Biome (auto-fix)
pnpm size # Check bundle size limits
pnpm changeset # Create a changeset for the next releaseThis is a TypeScript port of Python eyecite — a legal citation extraction library with zero runtime dependencies.
Citations flow through a 4-stage pipeline: clean → tokenize → extract → (resolve)
- Clean (
src/clean/): Strip HTML, normalize whitespace/Unicode, fix smart quotes. Builds aTransformationMapto track position shifts. - Tokenize (
src/tokenize/): Apply regex patterns fromsrc/patterns/to find citation candidates. Intentionally broad — captures potential matches without validation. - Extract (
src/extract/): Parse metadata from tokens (volume, reporter, page, court, year). Each citation type has its own extractor (extractCase.ts,extractStatute.ts, etc.). The main orchestrator isextractCitations.ts.- Case extraction is split into parser/semantic modules (
caseCore,caseEnvelope,casePostfix,caseParentheticals,caseNameScanner,caseNameSemantics,casePartySemantics,caseReporterSemantics,caseCitationDraft).extractCase.tsshould stay an orchestrator: parse syntax, interpret semantics, apply semantics to the draft, then finalize. dates.tsprovides date parsing utilities (parseMonth,parseDate,toIsoDate) for structured date extraction from parentheticals.
- Case extraction is split into parser/semantic modules (
- Resolve (
src/resolve/): Link short-form citations (Id., supra, short-form case) to their full antecedents.DocumentResolveruses scope boundaries and Levenshtein matching.
Opt-in via extractCitations(text, { detectFootnotes: true }). Runs before cleaning on the raw text to preserve newline structure. Two strategies:
- HTML (
src/footnotes/htmlDetector.ts): Regex-based tag scanner for<footnote>,<fn>, and elements with footnote class/id attributes. No DOM dependency. - Plain text (
src/footnotes/textDetector.ts): Finds separator lines (5+ dashes/underscores) followed by numbered markers (1.,FN1.,[1],n.1).
detectFootnotes(text) selects the strategy (HTML first, text fallback) and returns a FootnoteMap (array of { start, end, footnoteNumber } zones). The pipeline maps zones through TransformationMap to clean-text coordinates, then tags citations with inFootnote/footnoteNumber via binary search. The "footnote" scope strategy in the resolver enforces zone-based isolation: Id. is strict (same zone only), supra/shortFormCase can cross from footnotes to body.
Annotation (src/annotate/) and reporter data (src/data/) are separate entry points to enable tree-shaking.
The Span type carries dual positions: cleanStart/cleanEnd (for internal parsing) and originalStart/originalEnd (for user-facing results). TransformationMap maps between them using a lookahead algorithm (maxLookAhead=20) in cleanText.ts:rebuildPositionMaps.
fullSpan(optional) extends from case name through final closing parenthetical (including chained parens and subsequent history). The corespanfield remains citation-core-only for backward compatibility.
Citations use a discriminated union on the type field: case | statute | journal | neutral | publicLaw | federalRegister | statutesAtLarge | id | supra | shortFormCase. All share CitationBase (text, span, confidence, matchedText, processTimeMs). Switch on citation.type for type-safe field access.
- Volume is
number | string— numeric for standard volumes, string for hyphenated (e.g., "1984-1")
Three package entry points configured in tsdown.config.ts and package.json:
eyecite-ts→src/index.ts(core extraction + resolution)eyecite-ts/data→src/data/index.ts(reporter database, lazy-loaded)eyecite-ts/annotate→src/annotate/index.ts(text annotation)
@/* maps to src/* in both tsconfig.json and vitest.config.ts.
- Formatter/Linter: Biome 2.x — spaces, 100-char line width, double quotes, trailing commas, semicolons as needed
noAssignInExpressions: off— regex exec loops use assignment-in-while patternnoExplicitAny: errorandnoImplicitAnyLet: error— strict typing enforcednoForEach: off— forEach is allowed- Patterns are defined in
src/patterns/with aPatterninterface (id,regex,description,type) - Regex patterns must avoid nested quantifiers to prevent ReDoS
Tests mirror source in tests/ with the same directory structure. Integration tests live in tests/integration/. Vitest 4 is used — test options go as the second argument: it(name, { timeout }, fn).
- CI: GitHub Actions — lint, typecheck, test (Node 18/20/22 matrix), build + size check
- Coverage: Vitest
--coveragerequires Node 20+ (node:inspector/promises). CI only runs coverage on Node 22. - Releases: Changesets —
pnpm changesetto add, merge to main creates "Version Packages" PR, merging that publishes to npm with provenance - Package manager: pnpm 10 via corepack. Build script allowlist in
pnpm-workspace.yaml. - Each fix/feature branch needs a changeset:
pnpm changeset→ select patch/minor/major → write summary