Skip to content

Commit aabad27

Browse files
authored
feat: Refactor 2026 — typed citation API, multi-engine orchestrator, format-aware I/O, published transformer model
Major refactoring of the extraction pipeline. The regex extraction logic is preserved; this change adds a typed API layer, multiple output adapters, short-form citation resolution, format-aware input handling, optional CRF and transformer inference engines, and a published German-legal-citation fine-tune on Hugging Face Hub. Regex F1 is unchanged on the validation split; the transformer engine adds +4.9 pp span-overlap F1 over the regex baseline. ## New public API - **Typed citation models** (`refex.citations`): `LawCitation`, `CaseCitation`, `Span`, `CitationRelation`, `ExtractionResult` — frozen dataclasses with `__slots__`. - **`CitationExtractor`** orchestrator with pluggable `Extractor` engines; default = regex for law + case. - **`Document`** with format-aware normalization (plain / HTML / Markdown) and character-level offset maps for span round-tripping. - **Output adapters** in `refex.serializers`: `to_jsonl` / `to_json` (primary), `to_spacy_doc`, `to_hf_bio`, `to_gliner`, `to_web_annotation`, `to_akn_ref` (Akoma Ntoso / LegalDocML.de). ## Extraction improvements - **Artikel / Grundgesetz support**: `Art.` / `Artikel` patterns for constitutional and EU law citations. - **Reporter citations**: `BGHZ 132, 105`, `NJW 2003, 1234`, etc. (~40 German reporter abbreviations). - **Short-form resolution**: bare `§ 5` inherits book from a prior `§ 3 BGB`; reporter citations linked to full case citations. - **Relation detection**: `i.V.m.`, `vgl.`, `a.a.O.`, `ebenda`, `siehe dort` as citation edges. - **Precise law book regex**: 1,948 codes loaded longest-first with a generic fallback; `REFEX_PRECISE_BOOK_REGEX` env var for A/B measurement. - **`default_unit` hints** in `law_book_codes.txt` (optional TSV column): 23 well-known codes annotated article/paragraph; overrides the text-prefix heuristic. - **Interval-based marker masking** in the divide-and-conquer law extractor — one O(len) pass per phase instead of O(N × len). ## New inference engines (optional extras) - **CRF engine** (`[crf]`): `sklearn-crfsuite` + streaming trainer. - **Transformer engine** (`[transformers]`): HuggingFace token classification with sliding-window tokenisation, first-token-of- word aggregation, CPU/CUDA/MPS, and batched `extract_batch(...)`. - **Default transformer model published** to Hugging Face: `openlegaldata/legal-reference-extraction-base-de` (CC BY-NC 4.0) — a fine-tune of `EuroBERT/EuroBERT-210m`. - **Training split**: `[training]` extra adds `wandb` / `seqeval` / `datasets` / `accelerate`; `scripts/train_transformer.py` and `scripts/export_bio.py` for in-repo fine-tuning. ## Benchmark harness - New `benchmarks/` package: adapter, metrics, runner, dataset validator, CI fixtures. - Metrics: span F1 (exact + overlap), per-type F1, field accuracy (book, number, court, file_number, **structure** key-level), **relation-edge F1** via `(source_span, target_span, relation)` triples. - `make bench-ci` (vendored fixtures) / `make bench-dev` (validation split) / `make bench-test` (final lock-in). ## Breaking - Removed `refex.compat.to_ref_marker_string` and the `[ref=UUID]...[/ref]` inline-marker machinery (`RefMarker.replace_content`, `_MARKER_*_FORMAT`). - Removed dead model fields and helpers: `BaseRef.sentence`, `RefMarker.get_length` / `get_start_position` / `get_end_position`, `Ref.get_law_repr` / `get_case_repr`, `@total_ordering` on `Ref`. - Deleted legacy `src/refex/extractors/law.py` (pre-refactor, 410 LOC); renamed `law_dnc.py` → `law.py`. - Split `[ml]` extra into `[crf]` + `[transformers]` (users pick one). CI matrix restructured: minimal install + tests on 3.11 / 3.12 / 3.13; full install + coverage on 3.12 only. ## Backward compatibility - `RefExtractor` preserved and fully working; `RefExtractor.extract_citations()` bridges to the typed API. - `refex.compat.citations_to_ref_markers()` converts `ExtractionResult` → legacy `list[RefMarker]` for Open Legal Data's internal pipeline. ## Metrics (validation split, 821 docs) | Engine | span F1 exact | span F1 overlap | Throughput | |----------------------------|--------------:|----------------:|-----------:| | Regex (CPU) | 0.734 | 0.815 | ~470 docs/s | | Regex + CRF (CPU) | 0.741 | 0.842 | ~90 docs/s | | Transformer EuroBERT (MPS) | 0.509 | **0.913** | ~1.5 docs/s | | Regex + Transformer (MPS) | **0.743** | 0.852 | ~1.5 docs/s | ## Tests ~347 tests covering the typed API, all engines, adapters, document normalization, benchmark metrics (structure + relations), law / case extractor edge cases, and internal helpers (`_apply_mask_intervals`, `REFEX_PRECISE_BOOK_REGEX`, `DEFAULT_MODEL`). Lint + format clean; CI coverage gate ≥ 75 % on the full-install 3.12 job. Refactor workspace docs (architecture review, implementation plan, optimization log, transformer training log, benchmark spec) are preserved on the `docs/refactor2026` branch and are not shipped on master.
1 parent 4473131 commit aabad27

61 files changed

Lines changed: 9973 additions & 1444 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/bench.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: Benchmark
2+
3+
on:
4+
pull_request:
5+
branches: [master]
6+
paths:
7+
- "src/**"
8+
- "benchmarks/**"
9+
- "pyproject.toml"
10+
11+
jobs:
12+
bench-ci:
13+
name: Benchmark (CI subset)
14+
runs-on: ubuntu-latest
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
- name: Set up Python 3.12
20+
uses: actions/setup-python@v5
21+
with:
22+
python-version: "3.12"
23+
24+
- name: Install dependencies
25+
run: make install
26+
27+
- name: Validate fixtures
28+
run: make bench-validate BENCH_ARGS="-d benchmarks/fixtures"
29+
30+
- name: Run benchmark on CI subset
31+
run: make bench-ci BENCH_ARGS="--json -o bench-result.json"
32+
33+
- name: Show results
34+
run: cat bench-result.json | python -m json.tool
35+
36+
- name: Upload benchmark result
37+
uses: actions/upload-artifact@v4
38+
with:
39+
name: bench-result
40+
path: bench-result.json

.github/workflows/ci.yml

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,13 @@ on:
77
branches: [master]
88

99
jobs:
10-
check:
10+
# Minimal install + test across all supported Python versions.
11+
# No coverage gate — the minimal install skips the ML engines, so
12+
# coverage on them would always be low here.
13+
test:
1114
runs-on: ubuntu-latest
1215
strategy:
16+
fail-fast: false
1317
matrix:
1418
python-version: ["3.11", "3.12", "3.13"]
1519

@@ -21,11 +25,33 @@ jobs:
2125
with:
2226
python-version: ${{ matrix.python-version }}
2327

24-
- name: Install dependencies
28+
- name: Install dev dependencies
2529
run: make install
2630

2731
- name: Lint
2832
run: make lint
2933

34+
- name: Test
35+
run: make test
36+
37+
# Full install + coverage gate on one Python version.
38+
# Installs [crf,transformers,adapters] so the ML modules are
39+
# exercised and the fail_under=90 check is meaningful.
40+
coverage:
41+
runs-on: ubuntu-latest
42+
43+
steps:
44+
- uses: actions/checkout@v4
45+
46+
- name: Set up Python 3.12
47+
uses: actions/setup-python@v5
48+
with:
49+
python-version: "3.12"
50+
51+
- name: Install with all extras
52+
run: |
53+
make install
54+
.venv/bin/pip install -e ".[crf,transformers,adapters]"
55+
3056
- name: Test with coverage
3157
run: make test-cov

.gitignore

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,3 +108,13 @@ media/courts/*
108108
logs/*
109109

110110
workingdir
111+
112+
# Trained CRF model (regenerate with `make train-crf`)
113+
src/refex/data/crf_model.pkl
114+
src/refex/data/crf_model.crfsuite
115+
116+
# Transformer training artifacts
117+
models/
118+
data/hf_bio/
119+
wandb/
120+

CHANGELOG.md

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Changelog
2+
3+
## 0.5.0 — Refactor 2026
4+
5+
Major refactoring of the extraction pipeline. Adds a typed API
6+
layer, multiple output adapters, short-form citation resolution,
7+
format-aware input handling (plain / HTML / Markdown), optional
8+
CRF and transformer inference engines, and a published
9+
German-legal-citation fine-tune on Hugging Face Hub.
10+
11+
Regex F1 unchanged on the validation split; the transformer engine
12+
adds a measurable +4.9 pp span-overlap F1 over the regex baseline.
13+
14+
### New Features
15+
16+
- **Typed citation models** (Stream C): `LawCitation`, `CaseCitation`,
17+
`Span`, `CitationRelation`, `ExtractionResult` — frozen dataclasses
18+
with `__slots__`.
19+
- **Strategy-based orchestrator** (Stream C): `CitationExtractor` with
20+
pluggable `Extractor` engines; default uses `RegexLawExtractor` +
21+
`RegexCaseExtractor`.
22+
- **Output adapters** (Stream D):
23+
- `to_jsonl()` / `to_json()` — primary JSONL output matching the
24+
benchmark spec.
25+
- `to_spacy_doc()` — spaCy Doc-compatible dict (no spaCy dep
26+
required).
27+
- `to_hf_bio()` — HuggingFace BIO token-classification format.
28+
- `to_gliner()` — GLiNER span format.
29+
- `to_web_annotation()` — W3C Web Annotation Data Model.
30+
- `to_akn_ref()` — Akoma Ntoso / LegalDocML.de XML fragments.
31+
- **Artikel / Grundgesetz support** (Stream E): `Art.` / `Artikel`
32+
patterns for German constitutional and EU law citations.
33+
- **Short-form resolution** (Stream I): bare `§ 5` inherits the book
34+
from a prior `§ 3 BGB`; reporter citations (BGHZ, BVerfGE, …)
35+
linked to their prior full case citations.
36+
- **Relation detection** (Stream I): `i.V.m.`, `vgl.`, `a.a.O.`,
37+
`ebenda`, `siehe dort` detected between adjacent citations.
38+
- **Input format handling** (Stream J): `Document` model with
39+
format-aware normalization (plain / HTML / Markdown), offset maps
40+
for span round-tripping.
41+
- **Reporter citation extraction**: `BGHZ 132, 105`, `NJW 2003, 1234`,
42+
etc. — ~40 German legal reporter abbreviations recognized.
43+
- **`STRUCTURE_KEYS`**: frozenset of 21 valid structure dict keys
44+
for `LawCitation.structure`.
45+
- **CRF inference engine** (Stream F): `RegexCRFExtractor` with
46+
`sklearn-crfsuite` feature extractor + streaming trainer.
47+
- **Transformer inference engine** (Stream G): `TransformerExtractor`
48+
with sliding-window tokenisation, first-token-of-word aggregation,
49+
CPU / CUDA / MPS inference, and batched `extract_batch(...)`.
50+
- **Published default transformer model**:
51+
[`openlegaldata/legal-reference-extraction-base-de`](https://huggingface.co/openlegaldata/legal-reference-extraction-base-de)
52+
(CC BY-NC 4.0) — a fine-tune of `EuroBERT/EuroBERT-210m` for
53+
German legal law / case citation BIO tagging.
54+
`refex.engines.transformer.DEFAULT_MODEL` points at this repo, so
55+
`TransformerExtractor()` with no args loads it by default.
56+
- **`default_unit` column** in `law_book_codes.txt`. Optional
57+
tab-separated `<unit>` column (`article` / `paragraph`); when
58+
present, overrides the text-prefix heuristic in
59+
`_law_markers_to_citations`. 23 high-confidence annotations
60+
curated (`GG` / `EUV` = article; `BGB` / `HGB` / `StGB` / `StPO` /
61+
`ZPO` / … = paragraph). New `get_unit_hint(code)` helper on the
62+
law extractor mixin.
63+
- **Structure key-level accuracy metric** in `BenchmarkResult`
64+
(A2c). `field_accuracy['structure']` accumulates per-key
65+
`correct` / `incorrect` / `missing_pred` / `missing_gold` on
66+
exact-matched law pairs.
67+
- **Relation-edge F1 metric** in `BenchmarkResult` (A2d).
68+
`relation_exact: PRF` scored as `(source_span, target_span,
69+
relation)` triples. Benchmark runner accepts `extract_fn`
70+
returning either `list[Citation]` (legacy) or `(citations,
71+
relations)`.
72+
- **`REFEX_PRECISE_BOOK_REGEX` env var**. Toggles
73+
`use_precise_book_regex` at runtime for A/B measurement of the
74+
precise vs generic book-code regex. Default `True` (matches the
75+
exact-F1 optimization metric).
76+
77+
### Improvements
78+
79+
- **Precise law book regex** (B7): 1,948 law book codes loaded from
80+
the bundled data file, sorted longest-first with a generic
81+
fallback.
82+
- **Pre-compiled regex patterns** (B5): all patterns compiled once
83+
at init instead of per call.
84+
- **Fixed mutable class defaults** (B1, B6): `RefMarker.references`
85+
and `law_book_codes` are now instance-level.
86+
- **Fixed `Ref.__eq__`** (B3): returns `NotImplemented` for foreign
87+
types.
88+
- **Fixed `Ref.__hash__`** (B4): hashes the full field tuple, not
89+
`__repr__`.
90+
- **Interval-based marker masking** in `law.py`. Each extraction
91+
phase now collects match spans and applies a single
92+
O(len(content)) mask pass at the end of the phase instead of
93+
O(N × len) per-marker calls. +1 % throughput; F1 unchanged.
94+
95+
### Breaking
96+
97+
- **Removed `refex.compat.to_ref_marker_string`.** It emitted the
98+
legacy ``[ref=UUID]…[/ref]`` inline-marker string. Use
99+
`ExtractionResult.citations` directly and a serializer from
100+
`refex.serializers` (e.g. `to_jsonl`, `to_web_annotation`) for
101+
persistence / round-tripping.
102+
- **Removed `RefMarker.replace_content`** and the
103+
`_MARKER_OPEN_FORMAT` / `_MARKER_CLOSE_FORMAT` constants — only
104+
`to_ref_marker_string` called them. `RefMarker.set_uuid` is still
105+
present for `citations_to_ref_markers`.
106+
- **Removed dead model surface:** `BaseRef.sentence`,
107+
`RefMarker.get_length` / `get_start_position` / `get_end_position`,
108+
`Ref.get_law_repr` / `get_case_repr`, `@total_ordering` on `Ref`
109+
— none had external callers.
110+
- **Deleted legacy `src/refex/extractors/law.py`** (410 LOC,
111+
pre-refactor) and **renamed `law_dnc.py``law.py`**; the
112+
divide-and-conquer extractor now lives at the canonical filename.
113+
114+
### Deprecations
115+
116+
- `RefExtractor.extract(is_html=True)` emits a `DeprecationWarning`.
117+
Use `CitationExtractor().extract(text, fmt="html")` instead.
118+
- The `[ref=UUID]...[/ref]` marker format is deprecated. Use JSONL
119+
output via `to_jsonl()` for new integrations.
120+
121+
### Backward Compatibility
122+
123+
- `RefExtractor` is preserved and works as before.
124+
- `RefExtractor.extract_citations()` bridges to the new typed API.
125+
- `refex.compat.citations_to_ref_markers()` converts typed
126+
`ExtractionResult` → legacy `list[RefMarker]` for Open Legal
127+
Data's internal pipeline.
128+
- All legacy tests remain green.
129+
130+
### Closed follow-ups (measured and rejected)
131+
132+
- **Aho–Corasick court-name index** — pure-Python variant regressed
133+
throughput −35.6 % because Python's C ``re`` engine beats a
134+
single-pass scan on typical docs; a C-backed AC dep is out of
135+
scope.
136+
- **Per-`(doc_id, fn_span)` court cache** — 16.9 % same-fn
137+
recurrence is real, but court resolution is position-dependent;
138+
cache-first regresses span F1, fresh-first with cache fallback
139+
regresses court-field accuracy and throughput.
140+
141+
### Metrics (benchmark validation split, 821 docs)
142+
143+
| Engine | span F1 (exact) | span F1 (overlap) | Throughput |
144+
|----------------------------|----------------:|------------------:|-----------:|
145+
| Regex baseline (CPU) | 0.734 | 0.815 | ~470 docs/s |
146+
| Regex + CRF (CPU) | 0.741 | 0.842 | ~90 docs/s |
147+
| Transformer EuroBERT (MPS) | 0.509 | **0.913** | ~1.5 docs/s |
148+
| Regex + Transformer (MPS) | **0.743** | 0.852 | ~1.5 docs/s |
149+
150+
### Tests & benchmarks
151+
152+
- **~347 tests** covering the new typed API, engines, adapters,
153+
document normalization, benchmark metrics (including the new
154+
structure and relation-edge metrics), law / case extractor edge
155+
cases, and the regex interval-masking / env-var / default-model
156+
internals.
157+
158+
## 0.4.2
159+
160+
- Previous release.

CLAUDE.md

Lines changed: 32 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,19 +3,28 @@
33
## Project layout
44

55
- `src/refex/` — source package (src layout, installed via `pip install -e .`)
6+
- `src/refex/orchestrator.py``CitationExtractor` (main entry point)
7+
- `src/refex/citations.py` — typed citation models (`LawCitation`, `CaseCitation`, `Span`)
8+
- `src/refex/document.py``Document` model, HTML/Markdown normalization, offset mapping
9+
- `src/refex/engines/regex.py` — regex-based extraction engines
10+
- `src/refex/extractors/` — internal regex engines: `law.py` (divide-and-conquer multi-ref matcher) and `case.py` (file-number + court heuristic)
11+
- `src/refex/serializers.py` — output format adapters (JSONL, BIO, spaCy, etc.)
12+
- `src/refex/resolver.py` — short-form citation resolution (a.a.O., ebenda, i.V.m.)
613
- `src/refex/data/` — bundled data files (`law_book_codes.txt`, `file_number_codes.csv`)
7-
- `src/refex/extractors/`law and case reference extractors
14+
- `benchmarks/`benchmark runner, metrics, adapter, validator, fixtures
815
- `tests/` — pytest test suite
9-
- `tests/resources/` — test fixture files (German legal text snippets)
10-
- `tests/conftest.py` — shared fixtures (`extractor`, `law_extractor`, `case_extractor`) and helpers (`assert_refs`, `get_book_codes_from_file`)
1116

1217
## Development commands
1318

1419
```
15-
make install # create .venv, install editable + dev deps (auto-detects uv vs pip)
16-
make test # pytest
17-
make lint # ruff check + format check
18-
make format # ruff auto-fix + format
20+
make install # create .venv, install editable + dev deps (auto-detects uv vs pip)
21+
make test # pytest
22+
make lint # ruff check + format check
23+
make format # ruff auto-fix + format
24+
make bench-ci # benchmark against vendored CI fixtures
25+
make bench-dev # benchmark against full validation split
26+
make bench-validate # dataset integrity checks
27+
make diagnose # error analysis on validation split
1928
```
2029

2130
## Key conventions
@@ -25,25 +34,30 @@ make format # ruff auto-fix + format
2534
- Data files accessed via `importlib.resources.files("refex") / "data"`, not `os.path`.
2635
- Regex strings use raw string literals (`r"..."`) to avoid escape sequence warnings.
2736
- Ruff rules: `E, F, I, UP, W`. Line length 120. E501 suppressed in tests (German legal text fixtures).
28-
- No runtime dependencies. Dev deps: `pytest`, `ruff`.
37+
- No runtime dependencies. Optional extras: `[adapters]` (spaCy), `[crf]` (sklearn-crfsuite), `[transformers]` (transformers + torch), `[training]` (wandb + seqeval + datasets + accelerate, for fine-tuning).
2938

30-
## Architecture notes
39+
## Architecture
3140

32-
- `RefExtractor` is the main entry point. It inherits from both `DivideAndConquerLawRefExtractorMixin` (law refs) and `CaseRefExtractorMixin` (case refs). Toggle via `do_law_refs` / `do_case_refs` bools.
33-
- `extract()` returns `(content_with_markers, list[RefMarker])`. Markers wrap the matched text with `[ref=UUID]...[/ref]` tags.
34-
- Law extraction uses a divide-and-conquer approach: first multi-refs (`§§`), then single-refs (`§`), masking matched regions to prevent double-matching.
41+
- **`CitationExtractor`** (orchestrator.py) is the public API. It runs multiple `Extractor` engines and merges results.
42+
- **`RegexLawExtractor`** + **`RegexCaseExtractor`** (engines/regex.py) wrap the internal extractors in `extractors/law.py` and `extractors/case.py` and emit typed `LawCitation`/`CaseCitation` objects. Default transformer engine (`engines/transformer.py`) loads `openlegaldata/legal-reference-extraction-base-de` (EuroBERT-210m fine-tune).
43+
- **`Document`** (document.py) wraps input with format metadata. Supports plain text, HTML, and Markdown. HTML/Markdown is normalized to plain text with character-level offset maps for span round-tripping.
44+
- **Legacy `RefExtractor`** (extractor.py) is deprecated but preserved for backward compatibility. Internally delegates to the mixin extractors which produce `RefMarker`/`Ref` objects (internal types, not public API).
45+
- Law extraction uses divide-and-conquer: multi-refs (`§§`) first, then single-refs (`§`), masking matched regions.
3546
- Case extraction finds file numbers via regex, then heuristically searches surrounding text for court names.
36-
- `law_book_context` attribute enables within-book extraction (sections without explicit book codes).
47+
48+
## Benchmark
49+
50+
- Benchmark data lives in sibling project `german-legal-references-benchmark` (HF Arrow dataset).
51+
- CI fixtures vendored in `benchmarks/fixtures/` (plain-text, HTML, and Markdown docs).
52+
- All optimization uses **validation split only**. Test split reserved for final evaluation.
3753

3854
## Testing
3955

40-
- 42 tests, 4 skipped (known unsupported patterns marked with `@pytest.mark.skip`).
4156
- Tests use fixtures from `conftest.py`. The `assert_refs` helper extracts from content and compares sorted ref lists.
42-
- Test resource files in `tests/resources/law/` and `tests/resources/case/` contain German legal text snippets.
43-
57+
- `test_format_fixtures.py` — integration tests for HTML and Markdown input with real court decision fixtures.
58+
- `test_document.py` — Document model, normalization, offset mapping, format detection.
4459

4560
## Git
4661

47-
- Before commiting run "make lint" and "make test"
62+
- Before committing run `make lint` and `make test`
4863
- Use prefix branches: chore/, fix/, feat/
49-

0 commit comments

Comments
 (0)