Commit aabad27
authored
feat: Refactor 2026 — typed citation API, multi-engine orchestrator, format-aware I/O, published transformer model
Major refactoring of the extraction pipeline. The regex extraction
logic is preserved; this change adds a typed API layer, multiple
output adapters, short-form citation resolution, format-aware input
handling, optional CRF and transformer inference engines, and a
published German-legal-citation fine-tune on Hugging Face Hub.
Regex F1 is unchanged on the validation split; the transformer
engine adds +4.9 pp span-overlap F1 over the regex baseline.
## New public API
- **Typed citation models** (`refex.citations`): `LawCitation`,
`CaseCitation`, `Span`, `CitationRelation`, `ExtractionResult` —
frozen dataclasses with `__slots__`.
- **`CitationExtractor`** orchestrator with pluggable `Extractor`
engines; default = regex for law + case.
- **`Document`** with format-aware normalization
(plain / HTML / Markdown) and character-level offset maps for
span round-tripping.
- **Output adapters** in `refex.serializers`:
`to_jsonl` / `to_json` (primary), `to_spacy_doc`, `to_hf_bio`,
`to_gliner`, `to_web_annotation`, `to_akn_ref` (Akoma Ntoso /
LegalDocML.de).
## Extraction improvements
- **Artikel / Grundgesetz support**: `Art.` / `Artikel` patterns
for constitutional and EU law citations.
- **Reporter citations**: `BGHZ 132, 105`, `NJW 2003, 1234`, etc.
(~40 German reporter abbreviations).
- **Short-form resolution**: bare `§ 5` inherits book from a prior
`§ 3 BGB`; reporter citations linked to full case citations.
- **Relation detection**: `i.V.m.`, `vgl.`, `a.a.O.`, `ebenda`,
`siehe dort` as citation edges.
- **Precise law book regex**: 1,948 codes loaded longest-first with
a generic fallback; `REFEX_PRECISE_BOOK_REGEX` env var for
A/B measurement.
- **`default_unit` hints** in `law_book_codes.txt` (optional TSV
column): 23 well-known codes annotated article/paragraph;
overrides the text-prefix heuristic.
- **Interval-based marker masking** in the divide-and-conquer law
extractor — one O(len) pass per phase instead of O(N × len).
## New inference engines (optional extras)
- **CRF engine** (`[crf]`): `sklearn-crfsuite` + streaming trainer.
- **Transformer engine** (`[transformers]`): HuggingFace token
classification with sliding-window tokenisation, first-token-of-
word aggregation, CPU/CUDA/MPS, and batched `extract_batch(...)`.
- **Default transformer model published** to Hugging Face:
`openlegaldata/legal-reference-extraction-base-de` (CC BY-NC 4.0)
— a fine-tune of `EuroBERT/EuroBERT-210m`.
- **Training split**: `[training]` extra adds `wandb` / `seqeval` /
`datasets` / `accelerate`; `scripts/train_transformer.py` and
`scripts/export_bio.py` for in-repo fine-tuning.
## Benchmark harness
- New `benchmarks/` package: adapter, metrics, runner, dataset
validator, CI fixtures.
- Metrics: span F1 (exact + overlap), per-type F1, field accuracy
(book, number, court, file_number, **structure** key-level),
**relation-edge F1** via `(source_span, target_span, relation)`
triples.
- `make bench-ci` (vendored fixtures) / `make bench-dev`
(validation split) / `make bench-test` (final lock-in).
## Breaking
- Removed `refex.compat.to_ref_marker_string` and the
`[ref=UUID]...[/ref]` inline-marker machinery
(`RefMarker.replace_content`, `_MARKER_*_FORMAT`).
- Removed dead model fields and helpers: `BaseRef.sentence`,
`RefMarker.get_length` / `get_start_position` / `get_end_position`,
`Ref.get_law_repr` / `get_case_repr`, `@total_ordering` on `Ref`.
- Deleted legacy `src/refex/extractors/law.py` (pre-refactor,
410 LOC); renamed `law_dnc.py` → `law.py`.
- Split `[ml]` extra into `[crf]` + `[transformers]` (users pick
one). CI matrix restructured: minimal install + tests on 3.11 /
3.12 / 3.13; full install + coverage on 3.12 only.
## Backward compatibility
- `RefExtractor` preserved and fully working;
`RefExtractor.extract_citations()` bridges to the typed API.
- `refex.compat.citations_to_ref_markers()` converts
`ExtractionResult` → legacy `list[RefMarker]` for Open Legal
Data's internal pipeline.
## Metrics (validation split, 821 docs)
| Engine | span F1 exact | span F1 overlap | Throughput |
|----------------------------|--------------:|----------------:|-----------:|
| Regex (CPU) | 0.734 | 0.815 | ~470 docs/s |
| Regex + CRF (CPU) | 0.741 | 0.842 | ~90 docs/s |
| Transformer EuroBERT (MPS) | 0.509 | **0.913** | ~1.5 docs/s |
| Regex + Transformer (MPS) | **0.743** | 0.852 | ~1.5 docs/s |
## Tests
~347 tests covering the typed API, all engines, adapters, document
normalization, benchmark metrics (structure + relations),
law / case extractor edge cases, and internal helpers
(`_apply_mask_intervals`, `REFEX_PRECISE_BOOK_REGEX`,
`DEFAULT_MODEL`). Lint + format clean; CI coverage gate ≥ 75 % on
the full-install 3.12 job.
Refactor workspace docs (architecture review, implementation plan,
optimization log, transformer training log, benchmark spec) are
preserved on the `docs/refactor2026` branch and are not shipped on
master.1 parent 4473131 commit aabad27
61 files changed
Lines changed: 9973 additions & 1444 deletions
File tree
- .github/workflows
- benchmarks
- fixtures
- schemas
- scripts
- src/refex
- data
- engines
- extractors
- tests
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
11 | 14 | | |
12 | 15 | | |
| 16 | + | |
13 | 17 | | |
14 | 18 | | |
15 | 19 | | |
| |||
21 | 25 | | |
22 | 26 | | |
23 | 27 | | |
24 | | - | |
| 28 | + | |
25 | 29 | | |
26 | 30 | | |
27 | 31 | | |
28 | 32 | | |
29 | 33 | | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
30 | 56 | | |
31 | 57 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
6 | 13 | | |
7 | | - | |
| 14 | + | |
8 | 15 | | |
9 | | - | |
10 | | - | |
11 | 16 | | |
12 | 17 | | |
13 | 18 | | |
14 | 19 | | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
19 | 28 | | |
20 | 29 | | |
21 | 30 | | |
| |||
25 | 34 | | |
26 | 35 | | |
27 | 36 | | |
28 | | - | |
| 37 | + | |
29 | 38 | | |
30 | | - | |
| 39 | + | |
31 | 40 | | |
32 | | - | |
33 | | - | |
34 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
35 | 46 | | |
36 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
37 | 53 | | |
38 | 54 | | |
39 | 55 | | |
40 | | - | |
41 | 56 | | |
42 | | - | |
43 | | - | |
| 57 | + | |
| 58 | + | |
44 | 59 | | |
45 | 60 | | |
46 | 61 | | |
47 | | - | |
| 62 | + | |
48 | 63 | | |
49 | | - | |
| |||
0 commit comments