anno

Information extraction for unstructured text: named entity recognition (NER), coreference resolution, entity linking, relation extraction, and zero-shot entity types.

Dual-licensed under MIT or Apache-2.0. MSRV: 1.85.

Library

The crate is published as anno on crates.io.

[dependencies]
anno = "0.3.9"

The default feature (onnx) pulls in ort (ONNX Runtime C++ bindings). For builds without a C++ runtime, use default-features = false.

use anno::{Model, StackedNER};

let m = StackedNER::default();
let ents = m.extract_entities("Sophie Wilson designed the ARM processor.", None)?;
assert!(!ents.is_empty());
# Ok::<(), anno::Error>(())

StackedNER::default() selects the best available backend at runtime: GLiNER (if onnx enabled and model cached), falling back to heuristic + pattern extraction. Set ANNO_NO_DOWNLOADS=1 or HF_HUB_OFFLINE=1 to force cached-only behavior.

Zero-shot custom types via DynamicLabels (GLiNER, GLiNER2, NuNER):

use anno::{DynamicLabels, GLiNEROnnx};

let m = GLiNEROnnx::new("onnx-community/gliner_small-v2.1")?;
let ents = m.extract_with_labels(
    "Aspirin treats headaches.",
    &["drug", "symptom"],
    None,
)?;
# Ok::<(), anno::Error>(())

Feature flags

Feature	Default	Description
`onnx`	Yes	ONNX Runtime backends (GLiNER, NuNER, BERT, W2NER, FCoref) via `ort`
`candle`	No	Pure-Rust Candle backends (no C++ runtime needed)
`analysis`	No	Evaluation helpers, coref metrics, RAG rewriting
`eval`	No	Alias for `analysis` (kept for backward compatibility)
`graph`	No	Keywords, summarization, salience (TextRank/LexRank/YAKE)
`schema`	No	JSON Schema generation for output types via `schemars`
`llm`	No	LLM-based extraction via OpenRouter, Anthropic, Groq, Gemini, Ollama
`discourse`	No	Centering theory, abstract anaphora, shell nouns

Tasks

Named entity recognition. Given input text, identify spans (start, end, type, confidence) where each span denotes a named entity [1, 2]. Entity types follow standard taxonomies (PER, ORG, LOC, MISC for CoNLL-style [2]) or caller-defined labels for zero-shot extraction. Offsets are character offsets (Unicode scalar values), not byte offsets; see CONTRACT.md.

Coreference resolution. Identify mention spans and group them into clusters, where each cluster tracks a consistent discourse referent within the document [3, 4]. Coreferring mentions span proper names, definite descriptions, and pronouns: "Sophie Wilson", "the designer", and "She" form one cluster. Multiple backends: rule-based sieves (SimpleCorefResolver), neural (FCoref, 78.5 F1 on CoNLL-2012), and mention-ranking (MentionRankingCoref).

Relation extraction. Extract typed (head, relation, tail) triples from text. Available on RelationCapable backends (gliner2, tplinker); others fall back to co-occurrence edges for graph export.

Structured pattern extraction. Dates, monetary amounts, email addresses, URLs, phone numbers via deterministic regex grammars.

Backends

All backends produce the same output type: variable-length spans with character offsets.

Backend	Architecture	Labels	Zero-shot	Relations	Weights	Reference
`stacked` (default)	Selector/fallback	GLiNER > BERT > heuristic+pattern	--	--	HuggingFace (when ML enabled)	--
`gliner`	Bi-encoder span classifier	Custom	Yes	--	gliner_small-v2.1	Zaratiana et al. [5]
`gliner2`	Multi-task span classifier	Custom	Yes	Heuristic	gliner-multitask-large-v0.5	[11]
`nuner`	Token classifier (BIO)	Custom	Yes	--	NuNerZero_onnx	Bogdanov et al. [6]
`bert-onnx`	Sequence labeling (BERT)	PER/ORG/LOC/MISC	No	--	bert-base-NER-onnx	Devlin et al. [8]
`pattern`	Regex grammars	DATE/MONEY/EMAIL/URL/PHONE/PERCENT	N/A	--	None	--
`tplinker`	Joint entity-relation (heuristic)	Custom	--	Heuristic	None	[10]
`universal-ner`	LLM prompt (any provider)	Custom	Yes	--	None (API)	--

ML backends are feature-gated (onnx or candle). Weights download from HuggingFace on first use. See BACKENDS.md for the full list (including experimental backends) and feature-flag details.

CLI

cargo install --git https://github.com/arclabs561/anno --package anno-cli --bin anno --features "onnx eval"

Examples

anno extract --text "Lynn Conway worked at IBM and Xerox PARC in California."

PER:1 "Lynn Conway"
ORG:2 "IBM" "Xerox PARC"
LOC:1 "California"

JSON output (--format json):

anno extract --model pattern --format json --text "Contact jobs@acme.com by March 15 for the \$50K role."

Zero-shot custom entity types (via GLiNER [5]):

anno extract --model gliner --extract-types "DRUG,SYMPTOM" \
  --text "Aspirin can treat headaches and reduce fever."

drug:1 "Aspirin" symptom:2 "headaches" "fever"

Coreference:

anno debug --coref -t "Sophie Wilson designed the ARM processor. She revolutionized mobile computing."

Coreference: "Sophie Wilson" → "She"

Coreference

Three coref backends with different tradeoffs:

Backend	Type	Quality	Speed	Weights
`SimpleCorefResolver`	Rule-based (9 sieves)	Low	Fast	None
`FCoref`	Neural (DistilRoBERTa)	78.5 F1	Medium	ONNX export
`MentionRankingCoref`	Mention-ranking	Medium	Medium	None

FCoref (Otmazgin et al., AACL 2022) requires a one-time model export. The export script is in the repo at scripts/export_fcoref.py (not shipped in the crates.io package -- clone the repo to run it):

# From a repo clone:
uv run scripts/export_fcoref.py

let coref = FCoref::from_path("fcoref_onnx")?;
let clusters = coref.resolve("Marie Curie was born in Warsaw. She discovered radium.")?;
// clusters[0] = { mentions: ["Marie Curie", "She"], canonical: "Marie Curie" }

RAG preprocessing (rag::resolve_for_rag(), requires analysis feature): rewrites pronouns with their antecedents so document chunks remain self-contained after splitting. Supports English, French, Spanish, German. Includes pleonastic "it" filter and overlap guard.

Downstream

Filter and pipe JSON output:

# All person entities, text only
anno extract --format json --file article.txt \
  | jq '[.[] | select(.entity_type == "PER") | .text]'

# Unique organizations, sorted
anno extract --format json --file article.txt \
  | jq '[.[] | select(.entity_type == "ORG") | .text] | unique | sort[]'

Batch a directory (parallel, cached):

anno batch --dir docs/ --parallel 4 --cache --output results/

# Stream stdin JSONL: {"id":"…","text":"…"} per line
cat corpus.jsonl | anno batch --stdin --parallel 4 --cache --output results/

Knowledge Graph (RDF)

The following export commands require --features graph at install time (add graph to the features list in the cargo install command above).

anno export emits standard N-Triples or JSON-LD -- loadable into any RDF store (Oxigraph, Jena, Blazegraph, etc.). --base-uri sets the IRI namespace:

# Own namespace (recommended for private corpora)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
  --base-uri https://myproject.example.com/kg/

# Align to DBpedia (widely used LOD namespace)
anno export --input docs/ --output /tmp/kg/ --format ntriples \
  --base-uri https://dbpedia.org/resource/

Default (urn:anno:) produces stable URNs suitable for local use without a registered domain. The output is standard W3C N-Triples, so any SPARQL-capable store can query it:

# Example: load into a SPARQL store and query
cat /tmp/kg/*.nt | your-rdf-store load --format ntriples

your-rdf-store query 'SELECT ?label WHERE {
  ?e a <https://myproject.example.com/kg/vocab#PERType> ;
     <http://www.w3.org/2000/01/rdf-schema#label> ?label .
}'

Semantic relation triples — RelationCapable backends (tplinker, gliner2) produce typed (head, relation, tail) triples instead of co-occurrence edges. Use --format graph-ntriples (--features graph) for routing through the internal graph substrate:

anno export --input docs/ --output /tmp/kg/ \
  --format graph-ntriples --model gliner2 \
  --base-uri https://myproject.example.com/kg/
# → <entity/person/0_Lynn_Conway_0> <rel/works_for> <entity/org/1_IBM_13> .

Property Graph CSV

Export node and edge tables as CSV for import into any property-graph database:

# Semantic edges (gliner2 or tplinker)
anno export --input docs/ --output /tmp/kg/ --format kuzu --model gliner2

# Co-occurrence edges (any other backend)
anno export --input docs/ --output /tmp/kg/ --format kuzu

Each file produces {stem}-nodes.csv + {stem}-edges.csv with columns:

Nodes: id, entity_type, text, start, end, source
Edges: from, to, rel_type, confidence

Architecture

Crate	Published	Purpose
`anno` (root)	Yes	Backends, `Model` trait, extraction pipeline, coref resolvers.
`anno-core`	No (yanked)	Data model (`Entity`, `Relation`, `Mention`, `CorefChain`, `Signal`, `Track`). Re-exported by `anno`.
`anno-graph`	No	Graph/KG export adapters (N-Triples, JSON-LD, CSV)
`anno-eval`	No	Evaluation harnesses, dataset loaders, matrix sampling
`anno-cli`	No	CLI binary
`anno-metrics`	No	Shared evaluation primitives (CoNLL F1, encoders)

anno-core exists so that crates needing only data types (Entity, Relation, etc.) don't pull in ML dependencies. anno-lib depends on anno-core and re-exports all its types at the crate root.

Pipeline: Text -> Extract -> Coalesce -> structured output. See ARCHITECTURE.md.

Evaluation

anno-eval provides dataset loading, backend-vs-dataset compatibility gating, and CoNLL-style [2] span-level evaluation (precision, recall, F1) with label mapping between backend and dataset taxonomies.

anno benchmark --model gliner --dataset conll2003

anno sampler (--features eval) provides a randomized matrix sampler with two modes: triage (worst-first) and measure (ML-only stable measurement).

Sampler examples

# Deterministic offline decision-loop smoke example (also run in CI).
cargo run -p anno-eval --example muxer_decision_loop --features eval

# Run the CI sampler harness locally (uses ~/.anno_cache for history/cache).
just ci-matrix-local 42 ner

Scope

Inference-time extraction. Training pipelines are out of scope -- use upstream frameworks (Hugging Face Transformers, Flair, etc.) and export ONNX weights for consumption. (The repo contains some experimental training code for box embeddings; this is research-only and not part of the public API.)

Troubleshooting

ONNX Runtime linking errors. The onnx feature pulls in ort, which needs the ONNX Runtime C++ library. If linking fails, use default-features = false for a minimal build without native deps. If you need ONNX backends, ensure ort can download the prebuilt runtime -- check proxy/firewall settings and the ORT_DYLIB_PATH env var.

Model download failures. First use of ML backends downloads weights from HuggingFace to ~/.cache/huggingface/hub/. Behind firewalls, pre-fetch models on a connected machine, then set HF_HUB_OFFLINE=1 (or ANNO_NO_DOWNLOADS=1) to force cached-only mode.

FCoref export script not found. The export script (scripts/export_fcoref.py) is only present in the git repository, not in the crates.io package. Clone the repo to access it: uv run scripts/export_fcoref.py.

"Feature not available" errors. Most ML backends are feature-gated behind onnx or candle. If a backend constructor returns an error about a missing feature, check which features are enabled. The feature flags table above lists what each feature unlocks.

Character offset mismatches. All spans use character offsets (Unicode scalar values), not byte offsets. If offsets look wrong, ensure you are indexing by chars() count, not by byte position. See CONTRACT.md.

Documentation

QUICKSTART
CONTRACT — offset semantics, scope, feature gating
BACKENDS — backend selection, architecture, feature flags
ARCHITECTURE — crate layout, dependency flow
REFERENCES — full bibliography (NER, coref, relation extraction, software)
API docs
Changelog

References

[1] Grishman & Sundheim, COLING 1996 (MUC-6 NER). [2] Tjong Kim Sang & De Meulder, CoNLL 2003. [3] Lee et al., EMNLP 2017 (end-to-end coref). [4] Jurafsky & Martin, SLP3 2024. [5] Zaratiana et al., NAACL 2024 (GLiNER). [6] Bogdanov et al., 2024 (NuNER). [7] Li et al., AAAI 2022 (W2NER). [8] Devlin et al., NAACL 2019 (BERT). [9] Lafferty et al., ICML 2001 (CRF). [10] Wang et al., COLING 2020 (TPLinker). [11] Zaratiana et al., 2025 (GLiNER2). [12] Rabiner, Proc. IEEE 1989 (HMM).

Full list with links: docs/REFERENCES.md. Citeable via CITATION.cff.

License

Dual-licensed under MIT or Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 999 Commits
.cargo		.cargo
.config		.config
.github		.github
.opengrep/rules		.opengrep/rules
crates		crates
docs		docs
hack		hack
proptest-regressions		proptest-regressions
scripts		scripts
src		src
testdata		testdata
.cursorignore		.cursorignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
clippy.toml		clippy.toml
deny.toml		deny.toml
env.example		env.example
justfile		justfile
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anno

Library

Feature flags

Tasks

Backends

CLI

Examples

Coreference

Downstream

Knowledge Graph (RDF)

Property Graph CSV

Architecture

Evaluation

Sampler examples

Scope

Troubleshooting

Documentation

References

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

anno

Library

Feature flags

Tasks

Backends

CLI

Examples

Coreference

Downstream

Knowledge Graph (RDF)

Property Graph CSV

Architecture

Evaluation

Sampler examples

Scope

Troubleshooting

Documentation

References

License

About

Topics

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages