openlegaldata
diff --git a/‎.github/workflows/bench.yml‎
Lines changed: 40 additions & 0 deletions b/‎.github/workflows/bench.yml‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 28 additions & 2 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 28 additions & 2 deletions
diff --git a/‎.gitignore‎
Lines changed: 10 additions & 0 deletions b/‎.gitignore‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 160 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 160 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 32 additions & 18 deletions b/‎CLAUDE.md‎
Lines changed: 32 additions & 18 deletions
@@ -0,0 +1,40 @@
+name: Benchmark
+
+on:
+  pull_request:
+    branches: [master]
+    paths:
+      - "src/**"
+      - "benchmarks/**"
+      - "pyproject.toml"
+
+jobs:
+  bench-ci:
+    name: Benchmark (CI subset)
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install dependencies
+        run: make install
+
+      - name: Validate fixtures
+        run: make bench-validate BENCH_ARGS="-d benchmarks/fixtures"
+
+      - name: Run benchmark on CI subset
+        run: make bench-ci BENCH_ARGS="--json -o bench-result.json"
+
+      - name: Show results
+        run: cat bench-result.json | python -m json.tool
+
+      - name: Upload benchmark result
+        uses: actions/upload-artifact@v4
+        with:
+          name: bench-result
+          path: bench-result.json
@@ -7,9 +7,13 @@ on:
     branches: [master]
 
 jobs:
-  check:
+  # Minimal install + test across all supported Python versions.
+  # No coverage gate — the minimal install skips the ML engines, so
+  # coverage on them would always be low here.
+  test:
     runs-on: ubuntu-latest
     strategy:
+      fail-fast: false
       matrix:
         python-version: ["3.11", "3.12", "3.13"]
 
@@ -21,11 +25,33 @@ jobs:
         with:
           python-version: ${{ matrix.python-version }}
 
-      - name: Install dependencies
+      - name: Install dev dependencies
         run: make install
 
       - name: Lint
         run: make lint
 
+      - name: Test
+        run: make test
+
+  # Full install + coverage gate on one Python version.
+  # Installs [crf,transformers,adapters] so the ML modules are
+  # exercised and the fail_under=90 check is meaningful.
+  coverage:
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install with all extras
+        run: |
+          make install
+          .venv/bin/pip install -e ".[crf,transformers,adapters]"
+
       - name: Test with coverage
         run: make test-cov
@@ -108,3 +108,13 @@ media/courts/*
 logs/*
 
 workingdir
+
+# Trained CRF model (regenerate with `make train-crf`)
+src/refex/data/crf_model.pkl
+src/refex/data/crf_model.crfsuite
+
+# Transformer training artifacts
+models/
+data/hf_bio/
+wandb/
+
@@ -0,0 +1,160 @@
+# Changelog
+
+## 0.5.0 — Refactor 2026
+
+Major refactoring of the extraction pipeline.  Adds a typed API
+layer, multiple output adapters, short-form citation resolution,
+format-aware input handling (plain / HTML / Markdown), optional
+CRF and transformer inference engines, and a published
+German-legal-citation fine-tune on Hugging Face Hub.
+
+Regex F1 unchanged on the validation split; the transformer engine
+adds a measurable +4.9 pp span-overlap F1 over the regex baseline.
+
+### New Features
+
+- **Typed citation models** (Stream C): `LawCitation`, `CaseCitation`,
+  `Span`, `CitationRelation`, `ExtractionResult` — frozen dataclasses
+  with `__slots__`.
+- **Strategy-based orchestrator** (Stream C): `CitationExtractor` with
+  pluggable `Extractor` engines; default uses `RegexLawExtractor` +
+  `RegexCaseExtractor`.
+- **Output adapters** (Stream D):
+  - `to_jsonl()` / `to_json()` — primary JSONL output matching the
+    benchmark spec.
+  - `to_spacy_doc()` — spaCy Doc-compatible dict (no spaCy dep
+    required).
+  - `to_hf_bio()` — HuggingFace BIO token-classification format.
+  - `to_gliner()` — GLiNER span format.
+  - `to_web_annotation()` — W3C Web Annotation Data Model.
+  - `to_akn_ref()` — Akoma Ntoso / LegalDocML.de XML fragments.
+- **Artikel / Grundgesetz support** (Stream E): `Art.` / `Artikel`
+  patterns for German constitutional and EU law citations.
+- **Short-form resolution** (Stream I): bare `§ 5` inherits the book
+  from a prior `§ 3 BGB`; reporter citations (BGHZ, BVerfGE, …)
+  linked to their prior full case citations.
+- **Relation detection** (Stream I): `i.V.m.`, `vgl.`, `a.a.O.`,
+  `ebenda`, `siehe dort` detected between adjacent citations.
+- **Input format handling** (Stream J): `Document` model with
+  format-aware normalization (plain / HTML / Markdown), offset maps
+  for span round-tripping.
+- **Reporter citation extraction**: `BGHZ 132, 105`, `NJW 2003, 1234`,
+  etc. — ~40 German legal reporter abbreviations recognized.
+- **`STRUCTURE_KEYS`**: frozenset of 21 valid structure dict keys
+  for `LawCitation.structure`.
+- **CRF inference engine** (Stream F): `RegexCRFExtractor` with
+  `sklearn-crfsuite` feature extractor + streaming trainer.
+- **Transformer inference engine** (Stream G): `TransformerExtractor`
+  with sliding-window tokenisation, first-token-of-word aggregation,
+  CPU / CUDA / MPS inference, and batched `extract_batch(...)`.
+- **Published default transformer model**:
+  [`openlegaldata/legal-reference-extraction-base-de`](https://huggingface.co/openlegaldata/legal-reference-extraction-base-de)
+  (CC BY-NC 4.0) — a fine-tune of `EuroBERT/EuroBERT-210m` for
+  German legal law / case citation BIO tagging.
+  `refex.engines.transformer.DEFAULT_MODEL` points at this repo, so
+  `TransformerExtractor()` with no args loads it by default.
+- **`default_unit` column** in `law_book_codes.txt`.  Optional
+  tab-separated `<unit>` column (`article` / `paragraph`); when
+  present, overrides the text-prefix heuristic in
+  `_law_markers_to_citations`.  23 high-confidence annotations
+  curated (`GG` / `EUV` = article; `BGB` / `HGB` / `StGB` / `StPO` /
+  `ZPO` / … = paragraph).  New `get_unit_hint(code)` helper on the
+  law extractor mixin.
+- **Structure key-level accuracy metric** in `BenchmarkResult`
+  (A2c).  `field_accuracy['structure']` accumulates per-key
+  `correct` / `incorrect` / `missing_pred` / `missing_gold` on
+  exact-matched law pairs.
+- **Relation-edge F1 metric** in `BenchmarkResult` (A2d).
+  `relation_exact: PRF` scored as `(source_span, target_span,
+  relation)` triples.  Benchmark runner accepts `extract_fn`
+  returning either `list[Citation]` (legacy) or `(citations,
+  relations)`.
+- **`REFEX_PRECISE_BOOK_REGEX` env var**.  Toggles
+  `use_precise_book_regex` at runtime for A/B measurement of the
+  precise vs generic book-code regex.  Default `True` (matches the
+  exact-F1 optimization metric).
+
+### Improvements
+
+- **Precise law book regex** (B7): 1,948 law book codes loaded from
+  the bundled data file, sorted longest-first with a generic
+  fallback.
+- **Pre-compiled regex patterns** (B5): all patterns compiled once
+  at init instead of per call.
+- **Fixed mutable class defaults** (B1, B6): `RefMarker.references`
+  and `law_book_codes` are now instance-level.
+- **Fixed `Ref.__eq__`** (B3): returns `NotImplemented` for foreign
+  types.
+- **Fixed `Ref.__hash__`** (B4): hashes the full field tuple, not
+  `__repr__`.
+- **Interval-based marker masking** in `law.py`.  Each extraction
+  phase now collects match spans and applies a single
+  O(len(content)) mask pass at the end of the phase instead of
+  O(N × len) per-marker calls.  +1 % throughput; F1 unchanged.
+
+### Breaking
+
+- **Removed `refex.compat.to_ref_marker_string`.**  It emitted the
+  legacy ``[ref=UUID]…[/ref]`` inline-marker string.  Use
+  `ExtractionResult.citations` directly and a serializer from
+  `refex.serializers` (e.g. `to_jsonl`, `to_web_annotation`) for
+  persistence / round-tripping.
+- **Removed `RefMarker.replace_content`** and the
+  `_MARKER_OPEN_FORMAT` / `_MARKER_CLOSE_FORMAT` constants — only
+  `to_ref_marker_string` called them.  `RefMarker.set_uuid` is still
+  present for `citations_to_ref_markers`.
+- **Removed dead model surface:** `BaseRef.sentence`,
+  `RefMarker.get_length` / `get_start_position` / `get_end_position`,
+  `Ref.get_law_repr` / `get_case_repr`, `@total_ordering` on `Ref`
+  — none had external callers.
+- **Deleted legacy `src/refex/extractors/law.py`** (410 LOC,
+  pre-refactor) and **renamed `law_dnc.py` → `law.py`**; the
+  divide-and-conquer extractor now lives at the canonical filename.
+
+### Deprecations
+
+- `RefExtractor.extract(is_html=True)` emits a `DeprecationWarning`.
+  Use `CitationExtractor().extract(text, fmt="html")` instead.
+- The `[ref=UUID]...[/ref]` marker format is deprecated.  Use JSONL
+  output via `to_jsonl()` for new integrations.
+
+### Backward Compatibility
+
+- `RefExtractor` is preserved and works as before.
+- `RefExtractor.extract_citations()` bridges to the new typed API.
+- `refex.compat.citations_to_ref_markers()` converts typed
+  `ExtractionResult` → legacy `list[RefMarker]` for Open Legal
+  Data's internal pipeline.
+- All legacy tests remain green.
+
+### Closed follow-ups (measured and rejected)
+
+- **Aho–Corasick court-name index** — pure-Python variant regressed
+  throughput −35.6 % because Python's C ``re`` engine beats a
+  single-pass scan on typical docs; a C-backed AC dep is out of
+  scope.
+- **Per-`(doc_id, fn_span)` court cache** — 16.9 % same-fn
+  recurrence is real, but court resolution is position-dependent;
+  cache-first regresses span F1, fresh-first with cache fallback
+  regresses court-field accuracy and throughput.
+
+### Metrics (benchmark validation split, 821 docs)
+
+| Engine                     | span F1 (exact) | span F1 (overlap) | Throughput |
+|----------------------------|----------------:|------------------:|-----------:|
+| Regex baseline (CPU)       |           0.734 |             0.815 |  ~470 docs/s |
+| Regex + CRF (CPU)          |           0.741 |             0.842 |   ~90 docs/s |
+| Transformer EuroBERT (MPS) |           0.509 |         **0.913** |  ~1.5 docs/s |
+| Regex + Transformer (MPS)  |       **0.743** |             0.852 |  ~1.5 docs/s |
+
+### Tests & benchmarks
+
+- **~347 tests** covering the new typed API, engines, adapters,
+  document normalization, benchmark metrics (including the new
+  structure and relation-edge metrics), law / case extractor edge
+  cases, and the regex interval-masking / env-var / default-model
+  internals.
+
+## 0.4.2
+
+- Previous release.
@@ -3,19 +3,28 @@
 ## Project layout
 
 - `src/refex/` — source package (src layout, installed via `pip install -e .`)
+- `src/refex/orchestrator.py` — `CitationExtractor` (main entry point)
+- `src/refex/citations.py` — typed citation models (`LawCitation`, `CaseCitation`, `Span`)
+- `src/refex/document.py` — `Document` model, HTML/Markdown normalization, offset mapping
+- `src/refex/engines/regex.py` — regex-based extraction engines
+- `src/refex/extractors/` — internal regex engines: `law.py` (divide-and-conquer multi-ref matcher) and `case.py` (file-number + court heuristic)
+- `src/refex/serializers.py` — output format adapters (JSONL, BIO, spaCy, etc.)
+- `src/refex/resolver.py` — short-form citation resolution (a.a.O., ebenda, i.V.m.)
 - `src/refex/data/` — bundled data files (`law_book_codes.txt`, `file_number_codes.csv`)
-- `src/refex/extractors/` — law and case reference extractors
+- `benchmarks/` — benchmark runner, metrics, adapter, validator, fixtures
 - `tests/` — pytest test suite
-- `tests/resources/` — test fixture files (German legal text snippets)
-- `tests/conftest.py` — shared fixtures (`extractor`, `law_extractor`, `case_extractor`) and helpers (`assert_refs`, `get_book_codes_from_file`)
 
 ## Development commands
 
 ```
-make install   # create .venv, install editable + dev deps (auto-detects uv vs pip)
-make test      # pytest
-make lint      # ruff check + format check
-make format    # ruff auto-fix + format
+make install       # create .venv, install editable + dev deps (auto-detects uv vs pip)
+make test          # pytest
+make lint          # ruff check + format check
+make format        # ruff auto-fix + format
+make bench-ci      # benchmark against vendored CI fixtures
+make bench-dev     # benchmark against full validation split
+make bench-validate # dataset integrity checks
+make diagnose      # error analysis on validation split
 ```
 
 ## Key conventions
@@ -25,25 +34,30 @@ make format    # ruff auto-fix + format
 - Data files accessed via `importlib.resources.files("refex") / "data"`, not `os.path`.
 - Regex strings use raw string literals (`r"..."`) to avoid escape sequence warnings.
 - Ruff rules: `E, F, I, UP, W`. Line length 120. E501 suppressed in tests (German legal text fixtures).
-- No runtime dependencies. Dev deps: `pytest`, `ruff`.
+- No runtime dependencies. Optional extras: `[adapters]` (spaCy), `[crf]` (sklearn-crfsuite), `[transformers]` (transformers + torch), `[training]` (wandb + seqeval + datasets + accelerate, for fine-tuning).
 
-## Architecture notes
+## Architecture
 
-- `RefExtractor` is the main entry point. It inherits from both `DivideAndConquerLawRefExtractorMixin` (law refs) and `CaseRefExtractorMixin` (case refs). Toggle via `do_law_refs` / `do_case_refs` bools.
-- `extract()` returns `(content_with_markers, list[RefMarker])`. Markers wrap the matched text with `[ref=UUID]...[/ref]` tags.
-- Law extraction uses a divide-and-conquer approach: first multi-refs (`§§`), then single-refs (`§`), masking matched regions to prevent double-matching.
+- **`CitationExtractor`** (orchestrator.py) is the public API. It runs multiple `Extractor` engines and merges results.
+- **`RegexLawExtractor`** + **`RegexCaseExtractor`** (engines/regex.py) wrap the internal extractors in `extractors/law.py` and `extractors/case.py` and emit typed `LawCitation`/`CaseCitation` objects.  Default transformer engine (`engines/transformer.py`) loads `openlegaldata/legal-reference-extraction-base-de` (EuroBERT-210m fine-tune).
+- **`Document`** (document.py) wraps input with format metadata. Supports plain text, HTML, and Markdown. HTML/Markdown is normalized to plain text with character-level offset maps for span round-tripping.
+- **Legacy `RefExtractor`** (extractor.py) is deprecated but preserved for backward compatibility. Internally delegates to the mixin extractors which produce `RefMarker`/`Ref` objects (internal types, not public API).
+- Law extraction uses divide-and-conquer: multi-refs (`§§`) first, then single-refs (`§`), masking matched regions.
 - Case extraction finds file numbers via regex, then heuristically searches surrounding text for court names.
-- `law_book_context` attribute enables within-book extraction (sections without explicit book codes).
+
+## Benchmark
+
+- Benchmark data lives in sibling project `german-legal-references-benchmark` (HF Arrow dataset).
+- CI fixtures vendored in `benchmarks/fixtures/` (plain-text, HTML, and Markdown docs).
+- All optimization uses **validation split only**. Test split reserved for final evaluation.
 
 ## Testing
 
-- 42 tests, 4 skipped (known unsupported patterns marked with `@pytest.mark.skip`).
 - Tests use fixtures from `conftest.py`. The `assert_refs` helper extracts from content and compares sorted ref lists.
-- Test resource files in `tests/resources/law/` and `tests/resources/case/` contain German legal text snippets.
-
+- `test_format_fixtures.py` — integration tests for HTML and Markdown input with real court decision fixtures.
+- `test_document.py` — Document model, normalization, offset mapping, format detection.
 
 ## Git
 
-- Before commiting run "make lint" and "make test"
+- Before committing run `make lint` and `make test`
 - Use prefix branches: chore/, fix/, feat/
-