A LinkML schema and annotation pipeline for representing ecological causal claims as an ontology-grounded graph. Combines ELMO-style entity decomposition with the Illari–Russo causal mosaic framework (CAMO), so the same curated claims can drive causal diagrams, evidence gap maps, fuzzy cognitive models, and practitioner summaries.
| File / folder | Purpose |
|---|---|
causal_mosaic_v0.4.0.yaml |
Versioned LinkML schema |
sample_data.yaml |
Full worked example (grassland restoration) |
sample_data_grounded.yaml |
Compact ontology-grounded version of the same example |
causal_mosaic_annotation_guide.md |
Complete annotator handbook (includes LLM pipeline guidance in Appendix A) |
schema_cheat_sheet_one_page.md |
One-page quick reference |
schema_guide_ecologists.md |
Audience guide for ecologists |
schema_guide_philosophers.md |
Audience guide for philosophers |
schema_guide_plain_language.md |
Plain-language guide |
schema_guide_semantic_engineers.md |
Audience guide for semantic engineers |
sentence_to_schema_infographic.md |
Visual walkthrough of schema decomposition |
annotator/ |
LLM annotation pipeline scripts |
A Causal Mosaic annotation is a labeled property graph where every node is a change in an ecological variable (entity + attribute + direction) and every edge is a causal claim carrying four annotation layers: claim strength, philosophical account, fifteen causal features, and evidential basis.
The five questions the schema answers for each claim:
- What changed? —
entity_term,variable_attribute - Which way? —
variable_direction - What did it affect? — edge
subject→object - How strong is the claim? —
claim_strength,philosophical_accounts - What evidence supports it? —
evidential_basis
The pipeline implements the human-in-the-loop workflow from Annotation Guide Appendix A. An LLM produces a first-pass draft; a trained human reviews and corrects every field.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. PREPARE │──▶│ 2. EXTRACT │──▶│ 3. VALIDATE │──▶│ 4. REVIEW │
│ Paper + DOI │ │ LLM draft │ │ Schema, │ │ Human check │
│ metadata │ │ YAML output │ │ CURIEs, │ │ + correct │
└──────────────┘ └──────────────┘ │ Spans │ └──────────────┘
Human LLM └──────────────┘ Human (critical)
cd annotator
pip install -r requirements.txt
# 1. Configure the LLM (edit this file before running anything)
# Defaults to openai_compatible pointing at a local vLLM instance.
nano llm.yaml
# 2. Set your API key (if required by your endpoint)
export LLM_API_KEY="your-key-here" # or "none" for unauthenticated vLLM
# 3. Run the full pipeline
python pipeline.py --text paper.pdf --doi "10.xxxx/yyyy" --output-dir outputs/
# Or run steps individually:
python extract.py --text paper.pdf --output draft.yaml
python validate_schema.py draft.yaml
python validate_curies.py draft.yaml --output curie_report.json
python verify_spans.py draft.yaml paper.pdfAll pipeline scripts read annotator/llm.yaml at startup. No model settings are hard-coded. Edit this file to point at any LLM endpoint.
vLLM (default)
provider: openai_compatible
base_url: "http://localhost:8000/v1"
model: "meta-llama/Llama-3.3-70B-Instruct"
api_key_env: "LLM_API_KEY"
max_tokens: 8192
temperature: 0.1Start vLLM with, e.g.:
vllm serve meta-llama/Llama-3.3-70B-Instruct --port 8000
export LLM_API_KEY="none"Ollama (local)
provider: openai_compatible
base_url: "http://localhost:11434/v1"
model: "llama3.3:70b"
api_key_env: "LLM_API_KEY"ollama pull llama3.3:70b
export LLM_API_KEY="none"OpenAI
provider: openai_compatible
base_url: "https://api.openai.com/v1"
model: "gpt-4o"
api_key_env: "OPENAI_API_KEY"Together AI / Groq
provider: openai_compatible
base_url: "https://api.together.xyz/v1" # or https://api.groq.com/openai/v1
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo"
api_key_env: "TOGETHER_API_KEY"Anthropic Claude
provider: anthropic
model: "claude-sonnet-4-6"
max_tokens: 8192
temperature: 0.1export ANTHROPIC_API_KEY="your-key"
pip install anthropicAll scripts that accept a source document (extract.py, verify_spans.py, pipeline.py) accept both .txt/.md and .pdf files. PDF text extraction requires one additional package:
pip install pdfplumber # recommended
# or
pip install pymupdf| Script | Purpose |
|---|---|
extract.py |
Call the LLM with the Appendix A.3 extraction prompt; write draft YAML |
validate_schema.py |
Check required fields, enum values, node references, FCM weight sign consistency |
validate_curies.py |
Validate every ontology CURIE against NCBI E-utilities and OLS4 |
lookup_curie.py |
Find the correct CURIE for a label when the LLM guessed wrong |
verify_spans.py |
Confirm every source_spans.text is a verbatim quote from the paper |
pipeline.py |
Run all stages in sequence with a single command |
llm.yaml |
LLM provider, endpoint, model, and generation settings |
utils.py |
Shared text loading (.txt/.pdf) and LLM client construction |
python extract.py --text paper.pdf [options]
--text PATH Source document (.txt, .md, or .pdf) [required]
--output PATH Draft YAML output [default: draft.yaml]
--config PATH llm.yaml path [default: annotator/llm.yaml]
--doi TEXT DOI of the paper
--title TEXT Paper title
--authors TEXT Author name (repeat for multiple)
--year INT Publication year
--journal TEXT Journal name
--chunk Split into ~3000-word chunks (for long papers)
--words-per-chunk Words per chunk [default: 3000]
python validate_schema.py draft.yaml [--strict]
Exits 0 if no errors; 1 if errors found. --strict treats warnings as errors.
python validate_curies.py draft.yaml [--output report.json] [--dry-run] [--verbose]
Set NCBI_API_KEY to raise the NCBI rate limit from 3 to 10 requests/second.
# Single lookup
python lookup_curie.py "Andropogon gerardii" NCBITaxon
python lookup_curie.py "temperate grassland" ENVO
# Batch (TSV: label<TAB>prefix)
python lookup_curie.py --batch labels.tsv --output results.tsv
python verify_spans.py draft.yaml paper.pdf [options]
--threshold FLOAT Fuzzy similarity cutoff [default: 0.90]
--fix Auto-correct near-matches in the YAML
--output PATH Write JSON report to file
--show-diff Print diff for near-matches
python pipeline.py --text paper.pdf [options]
--text PATH Source document [required unless --no-extract]
--draft PATH Existing draft YAML (use with --no-extract)
--output-dir DIR Directory for all outputs [default: .]
--config PATH llm.yaml path
--doi / --title / --authors / --year / --journal
Paper metadata passed to extract.py
--no-extract Skip extraction; validate an existing draft
--skip-curies Skip CURIE API validation (useful offline)
--skip-spans Skip source span verification
--fix-spans Auto-correct near-match spans
--strict Treat schema warnings as errors
--chunk Enable chunked extraction
After every LLM extraction, a human annotator must verify (from Annotation Guide §A.2.4):
- Source spans are verbatim quotes (use
verify_spans.py, then Ctrl+F) - Every ontology CURIE is correct (use
validate_curies.py+ ontology browser) - Claim strength matches the author's actual language, not the LLM's interpretation
- Philosophical accounts match the text framing (check linguistic cues in §8.2)
- All fifteen causal features are grounded in the text (
not_addressedby default) -
russo_williamson_satisfied: trueonly if the paper itself provides both statistical and mechanistic evidence -
bradford_hill_countequals the length ofbradford_hill_viewpoints - FCM weight sign matches the predicate sign
- No hallucinated edges — every edge traces to a specific passage in the paper
- Competing or offsetting effects are captured (check the Discussion section)
- Ecological causal claims extracted from papers, reports, or synthesis products
- Ontology terms from CAMO, ELMO, ENVO, GO, PATO, NCBITaxon, ECO, and related vocabularies
- Source-document provenance: quoted spans, study metadata, annotator judgments
- Versioned, validated YAML records conforming to
causal_mosaic_v0.4.0.yaml - Causal graphs suitable for Fuzzy Cognitive Maps, Evidence Gap Maps, and RAG pipelines
- Structured evidence for practitioner summaries and systematic reviews
- Which node categories should be grounded directly to stable external ontology terms rather than local placeholders?
- How strict should validation be for partially grounded claims when no exact ontology term exists?
- Should the project maintain one canonical sample dataset or both a compact grounded sample and a fuller narrative sample?
- What downstream renderers should be treated as first-class targets in the next iteration?