Legal Reference Extraction

Extract citations from German legal documents — law references (§ 433 BGB) and case references (BGH, VIII ZR 295/01).

Used by de.openlegaldata.io.

Supported Python versions: 3.11, 3.12, 3.13 (tested on every CI run).

Install

pip install legal-reference-extraction

# or from git
pip install git+https://github.com/openlegaldata/legal-reference-extraction.git

# local dev
make install

Usage

from refex.orchestrator import CitationExtractor

extractor = CitationExtractor()
result = extractor.extract("Die Entscheidung beruht auf § 42 VwGO.")

for cit in result.citations:
    print(cit.type, cit.span.text)
# law § 42 VwGO

Input formats

Plain text, HTML, and Markdown are supported. Format is auto-detected or can be set explicitly:

# HTML — tags are stripped, entities decoded, spans map to plain text
result = extractor.extract("<p>Gemäß &#167; 433 BGB ist der Käufer verpflichtet.</p>", fmt="html")

# Markdown — formatting markers stripped
result = extractor.extract("Gemäß **§ 433 BGB** ist der Käufer verpflichtet.", fmt="markdown")

# Auto-detect (based on content sniffing)
result = extractor.extract(html_content)

For HTML and Markdown input, span offsets reference the canonical plain-text projection. Use map_span_to_raw to recover positions in the original:

from refex.document import Document, map_span_to_raw

doc = Document(raw="<p>§ 433 BGB</p>", format="html")
result = extractor.extract(doc)
for cit in result.citations:
    raw_span = map_span_to_raw(cit.span, doc)
    print(f"{cit.span.text} → raw[{raw_span.start}:{raw_span.end}]")

Output formats

from refex.serializers import to_jsonl, to_hf_bio, to_gliner, to_spacy_doc, to_web_annotation, to_akn_ref

to_jsonl(result, doc_id="example")      # JSONL (primary format)
to_hf_bio(result, text)                 # HuggingFace BIO tags
to_gliner(result)                       # GLiNER span format
to_spacy_doc(result, text)              # spaCy Doc dict
to_web_annotation(result)               # W3C Web Annotation
to_akn_ref(result, text)                # Akoma Ntoso XML

Examples

Law references — § and §§ patterns with section numbers and law book codes:

result = extractor.extract(
    "Bar und bar §§ 1, 2 Abs. 2, 3, 10 Abs. 1 Nr. 1 BGB foo."
)
for cit in result.citations:
    print(cit.book, cit.number)
# bgb 1
# bgb 2
# bgb 3
# bgb 10

Cross-references — i.V.m. (in conjunction with) linking sections across law books:

result = extractor.extract(
    "Die vorläufige Vollstreckbarkeit folgt aus "
    "§ 167 VwGO i.V.m. §§ 708 Nr. 11, 711 ZPO."
)
for cit in result.citations:
    print(cit.book, cit.number)
# vwgo 167
# zpo 708
# zpo 711

Case references — court names and file numbers:

result = extractor.extract(
    "Das OVG Schleswig habe bereits in seinem Urteil vom 22.04.2010 "
    "(1 KN 19/09) entschieden."
)
for cit in result.citations:
    print(cit.court, cit.file_number)
# OVG Schleswig 1 KN 19/09

Artikel / Grundgesetz — Art. references are supported:

result = extractor.extract("Gemäß Art. 12 Abs. 1 GG besteht Berufsfreiheit.")

Law book context — extract bare § references within a specific law:

extractor = CitationExtractor()
# ... set law_book_context on the underlying engine if needed

Legacy API

The old RefExtractor API is still available but deprecated:

from refex.extractor import RefExtractor

extractor = RefExtractor()
content, markers = extractor.extract("Ein Satz mit § 3b AsylG.")
# Note: content no longer contains [ref=UUID] markers (deprecated in v0.7.0)

Development

make install   # create venv + install in editable mode with dev deps
make test      # run pytest (271 tests)
make lint      # ruff check + format check
make format    # auto-fix lint + format

Benchmark

Run the extraction benchmark against gold-annotated German legal documents:

make bench-ci           # vendored CI subset (15 docs, no external data needed)
make bench-dev          # full validation split (821 docs)
make bench-quick        # quick check (50 docs on validation)
make bench-validate     # dataset integrity checks
make diagnose           # error analysis

Current metrics (validation split, 821 docs):

Metric	Value
Span F1 (exact)	0.734
Case F1 (exact)	0.613
Law F1 (exact)	0.797
Throughput	418 docs/s

See benchmarks/README.md for details.

Optional extras

The base install has zero runtime dependencies. Inference engines and format adapters live in opt-in extras — pick the ones you need:

pip install "legal-reference-extraction[adapters]"     # spaCy adapter for to_spacy_doc
pip install "legal-reference-extraction[crf]"          # CRF engine  (~30 MB, sklearn-crfsuite)
pip install "legal-reference-extraction[transformers]" # transformer engine (~2 GB, transformers + torch)
pip install "legal-reference-extraction[training]"     # fine-tuning utilities (wandb, seqeval, datasets, accelerate)

Most users pick exactly one inference engine ([crf] or [transformers]). [training] is only needed when fine-tuning a transformer via scripts/train_transformer.py.

The default transformer model is openlegaldata/legal-reference-extraction-base-de (a fine-tune of EuroBERT/EuroBERT-210m, CC BY-NC 4.0) — so TransformerExtractor() with no arguments downloads and uses it automatically. Override via TransformerExtractor(model="..."). The benchmark harness also honours REFEX_TRANSFORMER_MODEL / REFEX_TRANSFORMER_DEVICE env vars for quick A/B runs.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
scripts		scripts
src/refex		src/refex
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal Reference Extraction

Install

Usage

Input formats

Output formats

Examples

Legacy API

Development

Benchmark

Optional extras

See also

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legal Reference Extraction

Install

Usage

Input formats

Output formats

Examples

Legacy API

Development

Benchmark

Optional extras

See also

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages