Skip to content

feat: enrich RO-Crate output files with EDAM format IDs via tataki#47

Merged
suecharo merged 3 commits into
mainfrom
feat/tataki-edam-enrichment
Mar 4, 2026
Merged

feat: enrich RO-Crate output files with EDAM format IDs via tataki#47
suecharo merged 3 commits into
mainfrom
feat/tataki-edam-enrichment

Conversation

@inutano

@inutano inutano commented Mar 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds add_tataki_edam() to sapporo/ro_crate.py, called at the end of generate_ro_crate_metadata() before final JSON-LD serialisation
  • Runs tataki via Docker against all files in outputs/, parses the JSON result, and replaces each File entity's encodingFormat with a proper EDAM ontology entity
  • Enables tonkaz Level 1–3 file-content comparison on typical nf-core pipeline outputs

What changes

Before (encodingFormat was a mixed array):

"encodingFormat": ["text/tab-separated-values", {"@id": "https://www.iana.org/..."}]

After (proper EDAM entity, as tonkaz expects):

"encodingFormat": {"@id": "http://edamontology.org/format_3475"}

tataki detects formats by file content (not just extension), covering TSV, CSV, JSON, HTML, PDF, PNG, SVG in addition to the existing genomic formats (BAM, VCF, FASTQ, …).

Design

  • Separate images: tataki runs as its own Docker container; no changes to the sapporo image
  • Best-effort: if Docker is unavailable or tataki fails, a warning is logged and the crate is left unchanged — the run never fails because of enrichment
  • Replaces only when detected: files tataki cannot identify keep their existing encodingFormat
  • Single batch call: all output file paths are passed to tataki in one docker run invocation

Dependency

Requires ghcr.io/sapporo-wes/tataki:latest to be available on the Docker host. This image is built and published via sapporo-wes/tataki#16.

Test plan

  • Run nf-core/rnaseq via Sapporo and verify ro-crate-metadata.json contains encodingFormat: {"@id": "http://edamontology.org/format_XXXX"} for TSV, HTML, JSON output files
  • Verify tataki failure (image not present) does not break run completion or crate generation
  • Run tonkaz on two identical runs and confirm Level 3 score > 0

🤖 Generated with Claude Code

inutano and others added 3 commits March 3, 2026 14:23
Adds add_tataki_edam() to ro_crate.py, called at the end of
generate_ro_crate_metadata() before the final JSON-LD serialisation.

The function runs tataki (ghcr.io/sapporo-wes/tataki:latest) as a
Docker container against all files in the outputs/ directory, parses
the JSON result, and replaces each File entity's encodingFormat with
a proper EDAM ontology ContextEntity:

    "encodingFormat": {"@id": "http://edamontology.org/format_3475"}

This replaces the previous mixed-array format
    "encodingFormat": ["text/html", {"@id": "...iana..."}]
for files that tataki recognises, enabling tonkaz Level 1-3
file-content comparison on typical nf-core pipeline outputs
(TSV, CSV, JSON, HTML, PDF, PNG, SVG, BAM, VCF, …).

Enrichment is always best-effort: if Docker is unavailable or tataki
fails for any reason the function logs a warning and returns, leaving
the crate unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@suecharo suecharo merged commit 355930d into main Mar 4, 2026
10 checks passed
@suecharo suecharo deleted the feat/tataki-edam-enrichment branch March 4, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants