Turn Claude Code, Cursor, OpenAI Codex, Gemini CLI, or OpenCode into an expert proteomics SDRF annotator.
Pick a dataset → The agent fetches PRIDE + paper → You review a validated SDRF.
Structured skills that give AI assistants expert-level capabilities for annotating, validating, improving, and brainstorming proteomics metadata in the SDRF format.
SETUP PLAN ANNOTATE VALIDATE REFINE SHARE
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Conda │ │ Templates│ │ PXD │ │ Columns │ │ Score │ │ Convert │
│ Pip │────▶│ Strategy │────▶│ PRIDE │────▶│ OLS │────▶│ AutoFix │────▶│ PR │
│ Tools │ │ Layers │ │ Paper │ │ Rules │ │ Raw scan │ │ Pipeline │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
/sdrf:setup /sdrf:brainstorm /sdrf:annotate /sdrf:validate /sdrf:improve /sdrf:contribute
/sdrf:templates /sdrf:fix /sdrf:convert
/sdrf:review
/sdrf:techrefine
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Format │ │ Ontology │ │ Plain │ │ Batch │
│ Spec │ │ Lookup │ │ Lang │ │ Confound │
│ Rules │ │ Verify │ │ Concepts │ │ Replic. │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
/sdrf:knowledge /sdrf:terms /sdrf:explain /sdrf:design
Instead of an AI guessing at ontology terms or SDRF rules, these skills teach it exactly how to annotate proteomics datasets — using real tools (OLS, PRIDE, PubMed) guided by the methodology of experienced annotators.
The SDRF specification data (column definitions, templates) lives in a git submodule and is read at runtime — so the skills stay current when the spec evolves.
All 16 skills are under the sdrf: namespace. In Claude Code, type /sdrf: and autocomplete will show them all.
| Skill | What it does |
|---|---|
/sdrf:setup |
Install dependencies (parse_sdrf, techsdrf) — conda or pip guided setup |
/sdrf:knowledge |
Ask about SDRF format, column rules, ontology mappings, reserved words |
/sdrf:templates |
Ask about templates, select templates, understand layers and selection rules |
/sdrf:annotate |
Full annotation workflow: PXD → PRIDE + paper → draft SDRF → validate |
/sdrf:validate |
Systematic validation against templates + ontology checking via OLS |
/sdrf:improve |
Quality analysis: specificity, completeness, consistency, score |
/sdrf:fix |
Auto-fix common errors (UNIMOD swaps, case, format, artifacts) |
/sdrf:terms |
Find and verify ontology terms for any SDRF column |
/sdrf:brainstorm |
Plan metadata strategy before creating an SDRF |
/sdrf:review |
Comprehensive quality review with cross-reference to paper + PRIDE |
/sdrf:explain |
Explain any column, error, or concept in plain language |
/sdrf:convert |
Choose and configure analysis pipelines from SDRF |
/sdrf:design |
Detect batch effects, confounders, replication issues |
/sdrf:contribute |
Contribute annotated SDRF back to sdrf-annotated-datasets via PR |
/sdrf:techrefine |
Verify/refine technical metadata from raw files via techsdrf |
/sdrf:cellline |
Translate Cellosaurus records into SDRF cell-line columns (organism, disease, sampling site, sex, ancestry) |
The SDRF specification data is included as a git submodule. You must initialize it:
# Clone with submodules:
git clone --recurse-submodules https://github.com/bigbio/sdrf-skills
# Or if already cloned without submodules:
cd sdrf-skills
git submodule update --init --recursiveTo update the spec to the latest version:
git submodule update --remote --recursiveInstall the deterministic helper tools used by the skills. Conda is recommended (includes thermorawfileparser for Thermo .raw files):
# Recommended (conda):
conda env create -f environment.yml
conda activate sdrf-skills
# Or pip:
pip install -r requirements.txtFor Thermo .raw files, thermorawfileparser is not on PyPI — use conda: conda install -c bioconda thermorawfileparser.
Claude Code (plugin)
After dependencies are installed (step 2 above):
- Install the plugin:
# From the official marketplace (when published):
/plugin install sdrf-skills
# Or from GitHub:
/plugin install github:bigbio/sdrf-skills- Run guided dependency setup:
/sdrf:setup - Then use:
/sdrf:annotate PXD######and/or/sdrf:validate your_file.sdrf.tsv
Cursor
After dependencies are installed (step 2 above):
- Ensure you have
.cursor/rules/sdrf-skills.mdcin your project. - Cursor does not run Claude Code's
SessionStarthook, so ask when needed:- "Install SDRF dependencies"
- "Follow the sdrf setup workflow"
- The AI will use
skills/sdrf-setup/SKILL.mdto show the exactconda/pipcommands.
Codex (OpenAI)
After dependencies are installed (step 2 above):
- Follow
.codex/INSTALL.mdto symlinkskills/andspec/into your Codex agents skills path. - When validation/ontology checks are needed, run
parse_sdrf validate-sdrf(fromsdrf-pipelines).
Gemini CLI
After dependencies are installed (step 2 above):
- Gemini CLI auto-loads
GEMINI.mdfrom the repo root. - When generating SDRF, instruct Gemini to validate with:
parse_sdrf validate-sdrf --sdrf_file ... --template ...
OpenCode
After dependencies are installed (step 2 above):
- Follow
.opencode/AGENTS.mdto wire the skills into your OpenCode agent. - Ask it to consult
skills/sdrf-setup/SKILL.mdifparse_sdrf/techsdrfare missing.
For full SDRF annotation (PRIDE, OLS, literature), configure these MCP servers:
- OLS — Ontology Lookup Service (EBI)
- PRIDE MCP — Proteomics dataset repository
- PubMed — Biomedical literature
- bioRxiv — Preprint server
The SessionStart hook checks for parse_sdrf and recommends /sdrf:setup if dependencies are missing.
You: /sdrf:templates I have a DIA phosphoproteomics study on mouse brain
Claude: For your experiment, I recommend:
1. ms-proteomics (mass spectrometry)
2. vertebrates (Mus musculus)
3. dia-acquisition (DIA method)
This adds columns for: strain, developmental stage, DIA scan windows...
You: /sdrf:annotate PXD045678
Claude:
→ Fetches PRIDE metadata (organism, instrument, files)
→ Finds the paper (PMID from PRIDE → PubMed → full text from PMC)
→ Extracts sample info from methods section
→ Selects templates: ms-proteomics + human + clinical-metadata
→ Drafts SDRF with verified ontology terms from OLS
→ Validates the result
You: /sdrf:knowledge What format should modification parameters use?
Claude: The format for comment[modification parameters] is:
NT=<name>;AC=UNIMOD:<id>;TA=<amino acid>;MT=<Fixed|Variable>
Example: NT=Carbamidomethyl;AC=UNIMOD:4;TA=C;MT=Fixed
Warning: UNIMOD:1 = Acetyl, UNIMOD:21 = Phospho (most common swap!)
You: /sdrf:fix [paste SDRF content]
Claude:
→ Identifies: UNIMOD:21 used for Acetyl (should be UNIMOD:1)
→ Fixes case: "Male" → "male"
→ Fixes format: "58 years" → "58Y"
→ Fixes artifacts: "['breast cancer']" → "breast cancer"
→ Shows changelog → outputs corrected SDRF
You: /sdrf:contribute PXD045678
Claude:
→ Checks if PXD045678 already exists in datasets/
→ Validates the SDRF file
→ Forks bigbio/sdrf-annotated-datasets
→ Creates branch annotation/PXD045678
→ Commits the SDRF file to datasets/PXD045678/
→ Opens a PR with dataset summary (organism, templates, row count)
You: /sdrf:techrefine PXD045678
Claude:
→ Checks techsdrf prerequisites
→ Presents refinement mode (PRIDE download vs local files)
→ Assembles command: techsdrf refine -p PXD045678 -s input.sdrf.tsv -n 5 -o refined.sdrf.tsv
→ Interprets results: instrument, tolerances, modifications, DDA/DIA
→ Shows diff: "Q Exactive" → "Q Exactive HF-X", tolerance 20ppm → 10ppm
→ Lets user approve/reject each change
python scripts/europepmc_fulltext.py PMC4047622 --format text
python scripts/europepmc_fulltext.py 24657495 --id-type pmid --section methods
python scripts/europepmc_fulltext.py 10.1016/j.jprot.2014.03.010 --id-type doi --format json
python scripts/europepmc_fulltext.py https://europepmc.org/articles/PMC10960138 --format tocUse this when you need Europe PMC full text converted from noisy JATS/XML into:
- clean plain text with most citation/link clutter removed
- structured article metadata for downstream prompts
- explicit section splits like
abstract,methods,results,discussion, andconclusion - a machine-friendly JSON representation for future agent workflows
- canonical article links plus detected accessions such as
PXD,MSV,PRJNA,GSE, andE-MTAB
It also supports canonical URL-style inputs like Europe PMC article URLs and doi:...
prefixes, plus --xml-file for offline parsing of previously downloaded Europe PMC XML.
You: /sdrf:terms disease "liver cancer"
Claude:
→ Searches EFO, MONDO, DOID via OLS
→ Recommends: hepatocellular carcinoma (EFO:0000182)
→ Shows alternatives: liver carcinoma, cholangiocarcinoma
→ Checks specificity: "liver cancer" too generic → use subtype
The repository is skills-first. New user-facing SDRF workflows should
normally be added as skills under skills/. The tools/ package is reserved
for deterministic helpers that a skill can call or that maintainers can run in
batch jobs.
Use this rule of thumb:
| If it does this... | Put it here |
|---|---|
| Guides an agent through evidence gathering, ontology choices, or review policy | skills/ |
| Parses TSV, validates values, scores completeness, applies deterministic fixes, or wraps external APIs | tools/ |
| Talks to multiple LLM providers to compare answers | skills/, not tools/ |
| Ships a large upstream mirror that can be queried live instead | prefer skills/ + upstream service, not a bundled dump |
The Python helpers that remain in this PR are the pieces that are clearly programmatic:
sdrf:validate -> tools/hallucination.py + tools/ols_client.py + tools/sdrf_parser.py
sdrf:improve -> tools/completeness.py
sdrf:fix -> tools/sdrf_fixer.py
sdrf:cellline -> tools/cellline_db.py (offline helper only; Cellosaurus stays authoritative)
maintainer use -> tools/benchmark.py
shared plumbing -> tools/services.py + tools/column_ontology_map.py + tools/cli.py
MassIVE fallback -> tools/massive_raw_files.py
# Detect hallucinated ontology terms and UNIMOD swaps
python -m tools check your_file.sdrf.tsv
# Score annotation quality (0-100 across 5 dimensions)
python -m tools score your_file.sdrf.tsv
# Auto-fix common errors (UNIMOD swaps, case, format, reserved words)
python -m tools fix your_file.sdrf.tsv -o fixed.sdrf.tsv
# Benchmark quality across multiple datasets
python -m tools benchmark PXD000001 PXD012345 local_file.sdrf.tsv
# Recover file names for MassIVE-hosted PXDs when PRIDE is empty
python -m tools massive-files PXD016117 --mode raw
python -m tools massive-files PXD016117 --mode acquisition --format tsv
# Verify a single ontology accession against OLS
python -m tools verify UNIMOD:1 --label Acetyl
# Cell line metadata lookup and SDRF enrichment
python -m tools cellline lookup HeLa
python -m tools cellline annotate file.sdrf.tsv -o enriched.tsv
python -m tools cellline stats| Module | Purpose |
|---|---|
tools/sdrf_parser.py |
Lightweight TSV parser with column classification and value parsing |
tools/ols_client.py |
EBI OLS4 REST API client with caching and rate limiting |
tools/hallucination.py |
Ontology hallucination detector (UNIMOD swaps, label mismatches) |
tools/completeness.py |
5-dimension quality scorer (completeness, specificity, consistency, standards, design) |
tools/sdrf_fixer.py |
Deterministic auto-fixer for 10 common error patterns |
tools/cellline_db.py |
Curated offline cell-line enrichment helper for batch SDRF cleanup |
tools/services.py |
REST clients for Cellosaurus, UniProt, BioSamples, PRIDE |
tools/massive_raw_files.py |
MassIVE fallback for recovering raw/acquisition file names from ProteomeCentral + FTP |
tools/benchmark.py |
Benchmark suite for quality analysis across datasets |
tools/cli.py |
Unified CLI entry point (python -m tools <command>) |
sdrf-skills/
├── .claude-plugin/plugin.json # Claude Code — plugin manifest
├── .cursor/rules/sdrf-skills.mdc # Cursor — rules file (auto-activates on *.sdrf.tsv)
├── .codex/INSTALL.md # Codex — installation instructions
├── .opencode/AGENTS.md # OpenCode — agent reference
├── environment.yml # Conda env (sdrf-pipelines, techsdrf, thermorawfileparser)
├── requirements.txt # Pip fallback
├── scripts/
│ └── europepmc_fulltext.py # Europe PMC full text cleaner: JATS/XML → LLM-friendly text/JSON
├── hooks/hooks.json # Claude Code — session init + dependency check
├── hooks/check-deps.sh # Checks parse_sdrf, recommends setup
├── spec/ # ← Git submodule: proteomics-metadata-standard
│ └── sdrf-proteomics/
│ ├── TERMS.tsv # Column definitions (read by skills at runtime)
│ └── sdrf-templates/ # ← Nested submodule: sdrf-templates
│ ├── templates.yaml # Template manifest (read by skills at runtime)
│ └── {name}/{ver}/ # Individual template YAMLs
├── tools/ # ← Python tools for programmatic analysis
│ ├── sdrf_parser.py # TSV parser with duplicate-column handling
│ ├── ols_client.py # OLS4 API client
│ ├── hallucination.py # Ontology hallucination detector
│ ├── completeness.py # 5-dimension quality scorer
│ ├── sdrf_fixer.py # Auto-fixer (10 error patterns)
│ ├── cellline_db.py # Curated offline cell-line enrichment helper
│ ├── services.py # External API clients
│ ├── massive_raw_files.py # MassIVE fallback for raw/acquisition file recovery
│ ├── benchmark.py # Dataset benchmark suite
│ ├── column_ontology_map.py # Column → ontology mappings
│ └── cli.py # Unified CLI
├── tests/ # ← pytest test suite (80+ tests)
├── examples/ # ← Sample SDRF files for testing
│ └── PXD_synthetic.sdrf.tsv # Synthetic example with deliberate errors
├── skills/ # ← Portable across ALL platforms
│ ├── sdrf-setup/SKILL.md # /sdrf:setup — guided dependency installation
│ ├── sdrf-knowledge/SKILL.md # /sdrf:knowledge — SDRF spec, columns, ontologies
│ ├── sdrf-templates/SKILL.md # /sdrf:templates — template system, layers, selection
│ ├── sdrf-annotate/SKILL.md # /sdrf:annotate — full annotation workflow
│ ├── sdrf-validate/SKILL.md # /sdrf:validate — validation + OLS checking
│ ├── sdrf-improve/SKILL.md # /sdrf:improve — quality analysis + scoring
│ ├── sdrf-fix/SKILL.md # /sdrf:fix — auto-fix common errors
│ ├── sdrf-terms/SKILL.md # /sdrf:terms — ontology term lookup
│ ├── sdrf-brainstorm/SKILL.md # /sdrf:brainstorm — metadata planning
│ ├── sdrf-review/SKILL.md # /sdrf:review — comprehensive review
│ ├── sdrf-explain/SKILL.md # /sdrf:explain — explain any concept
│ ├── sdrf-contribute/SKILL.md # /sdrf:contribute — PR to community repo
│ ├── sdrf-convert/SKILL.md # /sdrf:convert — pipeline guidance
│ ├── sdrf-design/SKILL.md # /sdrf:design — experimental design analysis
│ ├── sdrf-techrefine/SKILL.md # /sdrf:techrefine — techsdrf raw file refinement
│ └── sdrf-cellline/SKILL.md # /sdrf:cellline — Cellosaurus → SDRF translation
├── CLAUDE.md # Claude Code — project config
├── GEMINI.md # Gemini CLI — project config
├── BRAINSTORM.md # Design document
└── README.md # This file
The core of this plugin is the skills/ directory — 15 markdown files that encode
annotation methodology. These are platform-agnostic. Each platform just needs a
thin shim to discover and load them:
| Platform | Config File | How It Works |
|---|---|---|
| Claude Code | .claude-plugin/plugin.json + CLAUDE.md |
Native plugin — skills auto-discovered, /sdrf:* commands |
| Cursor | .cursor/rules/sdrf-skills.mdc |
Rules file with glob trigger on *.sdrf.tsv |
| Codex | .codex/INSTALL.md |
Symlink skills to ~/.agents/skills/ |
| Gemini CLI | GEMINI.md |
Project-level instructions, auto-loaded |
| OpenCode | .opencode/AGENTS.md |
Agent reference file |
The skills themselves work on any AI assistant that can read markdown and call
external APIs (OLS, PRIDE, PubMed). The specification data in spec/ stays
current via the git submodule — no skills need updating when the spec changes.
The tools AI assistants need already exist as MCP servers (OLS, PRIDE, PubMed). What was missing was the expertise — knowing:
- Which ontology to search for which column
- How to read a paper and extract SDRF metadata
- What the most common annotation errors are and how to fix them
- How to select the right templates for an experiment
- What "good" SDRF annotation looks like
Skills encode this expertise as structured workflows that any AI assistant follows step by step. No custom code to build, deploy, or maintain — just markdown files that teach the AI the methodology of an experienced SDRF annotator.
The SDRF specification evolves independently. To pull the latest:
# Update to latest spec:
git submodule update --remote --recursive
# Commit the updated reference:
git add spec
git commit -m "Update SDRF spec to latest version"Skills read spec/ files at runtime, so updating the submodule is all that's needed.
No SKILL.md files need to be modified when columns or templates change.
To add a new skill:
- Create
skills/your-skill/SKILL.mdwith YAML frontmatter - Write the workflow instructions in markdown
- Reference
spec/files for any specification data (never hardcode) - Test with Claude Code:
/your-skill [arguments]
Maintained by the BigBio team.
- Yasset Perez-Riverol (maintainer) — @ypriverol · ypriverol@gmail.com · @ypriverol
- Asier Larrea Sebal — @asierlarrea · EMBL-EBI
For questions about the SDRF specification itself, open an issue in bigbio/proteomics-metadata-standard.
MIT