Extract citations from German legal documents — law references (§ 433 BGB)
and case references (BGH, VIII ZR 295/01).
Used by de.openlegaldata.io.
Supported Python versions: 3.11, 3.12, 3.13 (tested on every CI run).
pip install legal-reference-extraction
# or from git
pip install git+https://github.com/openlegaldata/legal-reference-extraction.git
# local dev
make installfrom refex.orchestrator import CitationExtractor
extractor = CitationExtractor()
result = extractor.extract("Die Entscheidung beruht auf § 42 VwGO.")
for cit in result.citations:
print(cit.type, cit.span.text)
# law § 42 VwGOPlain text, HTML, and Markdown are supported. Format is auto-detected or can be set explicitly:
# HTML — tags are stripped, entities decoded, spans map to plain text
result = extractor.extract("<p>Gemäß § 433 BGB ist der Käufer verpflichtet.</p>", fmt="html")
# Markdown — formatting markers stripped
result = extractor.extract("Gemäß **§ 433 BGB** ist der Käufer verpflichtet.", fmt="markdown")
# Auto-detect (based on content sniffing)
result = extractor.extract(html_content)For HTML and Markdown input, span offsets reference the canonical plain-text
projection. Use map_span_to_raw to recover positions in the original:
from refex.document import Document, map_span_to_raw
doc = Document(raw="<p>§ 433 BGB</p>", format="html")
result = extractor.extract(doc)
for cit in result.citations:
raw_span = map_span_to_raw(cit.span, doc)
print(f"{cit.span.text} → raw[{raw_span.start}:{raw_span.end}]")from refex.serializers import to_jsonl, to_hf_bio, to_gliner, to_spacy_doc, to_web_annotation, to_akn_ref
to_jsonl(result, doc_id="example") # JSONL (primary format)
to_hf_bio(result, text) # HuggingFace BIO tags
to_gliner(result) # GLiNER span format
to_spacy_doc(result, text) # spaCy Doc dict
to_web_annotation(result) # W3C Web Annotation
to_akn_ref(result, text) # Akoma Ntoso XMLLaw references — § and §§ patterns with section numbers and law book codes:
result = extractor.extract(
"Bar und bar §§ 1, 2 Abs. 2, 3, 10 Abs. 1 Nr. 1 BGB foo."
)
for cit in result.citations:
print(cit.book, cit.number)
# bgb 1
# bgb 2
# bgb 3
# bgb 10Cross-references — i.V.m. (in conjunction with) linking sections across law books:
result = extractor.extract(
"Die vorläufige Vollstreckbarkeit folgt aus "
"§ 167 VwGO i.V.m. §§ 708 Nr. 11, 711 ZPO."
)
for cit in result.citations:
print(cit.book, cit.number)
# vwgo 167
# zpo 708
# zpo 711Case references — court names and file numbers:
result = extractor.extract(
"Das OVG Schleswig habe bereits in seinem Urteil vom 22.04.2010 "
"(1 KN 19/09) entschieden."
)
for cit in result.citations:
print(cit.court, cit.file_number)
# OVG Schleswig 1 KN 19/09Artikel / Grundgesetz — Art. references are supported:
result = extractor.extract("Gemäß Art. 12 Abs. 1 GG besteht Berufsfreiheit.")Law book context — extract bare § references within a specific law:
extractor = CitationExtractor()
# ... set law_book_context on the underlying engine if neededThe old RefExtractor API is still available but deprecated:
from refex.extractor import RefExtractor
extractor = RefExtractor()
content, markers = extractor.extract("Ein Satz mit § 3b AsylG.")
# Note: content no longer contains [ref=UUID] markers (deprecated in v0.7.0)make install # create venv + install in editable mode with dev deps
make test # run pytest (271 tests)
make lint # ruff check + format check
make format # auto-fix lint + formatRun the extraction benchmark against gold-annotated German legal documents:
make bench-ci # vendored CI subset (15 docs, no external data needed)
make bench-dev # full validation split (821 docs)
make bench-quick # quick check (50 docs on validation)
make bench-validate # dataset integrity checks
make diagnose # error analysisCurrent metrics (validation split, 821 docs):
| Metric | Value |
|---|---|
| Span F1 (exact) | 0.734 |
| Case F1 (exact) | 0.613 |
| Law F1 (exact) | 0.797 |
| Throughput | 418 docs/s |
See benchmarks/README.md for details.
The base install has zero runtime dependencies. Inference engines and format adapters live in opt-in extras — pick the ones you need:
pip install "legal-reference-extraction[adapters]" # spaCy adapter for to_spacy_doc
pip install "legal-reference-extraction[crf]" # CRF engine (~30 MB, sklearn-crfsuite)
pip install "legal-reference-extraction[transformers]" # transformer engine (~2 GB, transformers + torch)
pip install "legal-reference-extraction[training]" # fine-tuning utilities (wandb, seqeval, datasets, accelerate)Most users pick exactly one inference engine ([crf] or
[transformers]). [training] is only needed when fine-tuning a
transformer via scripts/train_transformer.py.
The default transformer model is
openlegaldata/legal-reference-extraction-base-de
(a fine-tune of EuroBERT/EuroBERT-210m, CC BY-NC 4.0) — so
TransformerExtractor() with no arguments downloads and uses it
automatically. Override via TransformerExtractor(model="...").
The benchmark harness also honours REFEX_TRANSFORMER_MODEL /
REFEX_TRANSFORMER_DEVICE env vars for quick A/B runs.
MIT