Description
At the end of each ingestion workflow, it would make sense if the output already contains some QC metrics and that the structure for this information is standardised across assay technologies.
Currently, workflows are in openpipelines-bio/openpipeline
:
cellranger_multi:
- cellranger_multi_component
- from_cellranger_multi_to_h5mu
Proposed: workflows are in openpipelines-bio/single_cell_ingestion
, using components from viash-hub/biobox
.
pre_ingestion: Provide some QC on the fastq files
- fastqc | other qc statistics depending on which assay
- multiqc / or other report
cellranger_multi:
- cellranger_multi_component
- from_cellranger_multi_to_h5mu
- add_sample_id / join sample metadata sheet
- ingestion_qc_report
bd_rhapsody:
- bd_rhapsody_mapping
- from_cellranger_multi_to_h5mu
- add_sample_id / join sample metadata sheet
- ingestion_qc_report
cellranger for 10x multiome:
- ...
processing pipeline smartseq2/3:
- ...
At the end of this workflow, there should be h5mu file per sample containing:
MuData (/TileDBSOMA?) containing n_obs (barcodes) × n_vars (genes and so on):
GEX:
layers
counts
: Raw counts
obs
:sample_id
: categorical{prefix}_id
- ... metrics by calculate_qc_metrics (but renamed?)...
var
:_index
: structured id (e.g. ensembl for RNA)feature_name
: human readableis_mitochondrial
(RNA only)is_ribosomal
(RNA only)is_hemoglobin
? (RNA only)- ... other variables we fetch from the reference ...
- ... metrics by calculate_qc_metrics (but renamed?)...
uns
obs_glossary
: data frame - id, label, descriptionvar_glossary
: data frame - id, label, description
Global:
uns
:{prefix}_metadata
: which also contains{prefix}_id
sample_metadata
: a data frame that contains at least the 'sample_id' column, and the categories are the same as .obs["sample_id"]
sample_id
: ...- ... sample sheet information provided by user (e.g. experiment)
- ... mapping metrics ...
- ... sample qc metrics ...
sample_metadata_glossary
: data frame - id, label, description
TODO: Discuss whether to store certain values in obs
vs obsm
, or var
vs varm
At least at the workflow-level, components should indicate that the expected / required schema of the h5ads / h5mus are. Example from src/labels_transfer/api/common_arguments.yaml
:
openpipeline/src/labels_transfer/api/common_arguments.yaml
Lines 11 to 31 in 7a90f3a
This is slightly related to what was written in #102
The updated ingestion workflows could be stored in a separate repository, e.g. openpipelines-bio/scrnaseq_ingestion
to not break existing workflows and start implementing some separation of concerns.