Skip to content

Ingestion changes #965

Open
Open
@rcannood

Description

@rcannood

At the end of each ingestion workflow, it would make sense if the output already contains some QC metrics and that the structure for this information is standardised across assay technologies.

Currently, workflows are in openpipelines-bio/openpipeline:

cellranger_multi:

  • cellranger_multi_component
  • from_cellranger_multi_to_h5mu

Proposed: workflows are in openpipelines-bio/single_cell_ingestion, using components from viash-hub/biobox.

pre_ingestion: Provide some QC on the fastq files

  • fastqc | other qc statistics depending on which assay
  • multiqc / or other report

cellranger_multi:

  • cellranger_multi_component
  • from_cellranger_multi_to_h5mu
  • add_sample_id / join sample metadata sheet
  • ingestion_qc_report

bd_rhapsody:

  • bd_rhapsody_mapping
  • from_cellranger_multi_to_h5mu
  • add_sample_id / join sample metadata sheet
  • ingestion_qc_report

cellranger for 10x multiome:

  • ...

processing pipeline smartseq2/3:

  • ...

At the end of this workflow, there should be h5mu file per sample containing:

MuData (/TileDBSOMA?) containing n_obs (barcodes) × n_vars (genes and so on):

GEX:

  • layers
    • counts: Raw counts
  • obs:
    • sample_id: categorical
    • {prefix}_id
    • ... metrics by calculate_qc_metrics (but renamed?)...
  • var:
    • _index: structured id (e.g. ensembl for RNA)
    • feature_name: human readable
    • is_mitochondrial (RNA only)
    • is_ribosomal (RNA only)
    • is_hemoglobin? (RNA only)
    • ... other variables we fetch from the reference ...
    • ... metrics by calculate_qc_metrics (but renamed?)...
  • uns
    • obs_glossary: data frame - id, label, description
    • var_glossary: data frame - id, label, description

Global:

  • uns:
    • {prefix}_metadata: which also contains {prefix}_id
    • sample_metadata: a data frame that contains at least the 'sample_id' column, and the categories are the same as .obs["sample_id"]
      • sample_id: ...
      • ... sample sheet information provided by user (e.g. experiment)
      • ... mapping metrics ...
      • ... sample qc metrics ...
    • sample_metadata_glossary: data frame - id, label, description

TODO: Discuss whether to store certain values in obs vs obsm, or var vs varm

At least at the workflow-level, components should indicate that the expected / required schema of the h5ads / h5mus are. Example from src/labels_transfer/api/common_arguments.yaml:

file_format:
type: h5mu
mod:
rna:
description: "Modality in AnnData format containing RNA data."
required: true
slots:
X:
type: double
name: features
required: false
description: |
The expression data to use for the classifier's inference, if `--input_obsm_features` argument is not provided.
obsm:
- type: "double"
name: "features"
example: X_scvi
required: false
description: |
The embedding to use for the classifier's inference. Override using the `--input_obsm_features` argument. If not provided, the `.X` slot will be used instead.
Make sure that embedding was obtained in the same way as the reference embedding (e.g. by the same model or preprocessing).

This is slightly related to what was written in #102


The updated ingestion workflows could be stored in a separate repository, e.g. openpipelines-bio/scrnaseq_ingestion to not break existing workflows and start implementing some separation of concerns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions