Ingestion changes

At the end of each ingestion workflow, it would make sense if the output already contains some QC metrics and that the structure for this information is standardised across assay technologies.

Currently, workflows are in `openpipelines-bio/openpipeline`:

cellranger_multi:

* cellranger_multi_component
* from_cellranger_multi_to_h5mu

-----

Proposed: workflows are in `openpipelines-bio/single_cell_ingestion`, using components from `viash-hub/biobox`.

pre_ingestion: Provide some QC on the fastq files

* fastqc | other qc statistics depending on which assay
* multiqc / or other report

cellranger_multi:

* cellranger_multi_component
* from_cellranger_multi_to_h5mu
* add_sample_id / join sample metadata sheet
* ingestion_qc_report

bd_rhapsody:

* bd_rhapsody_mapping
* from_cellranger_multi_to_h5mu
* add_sample_id / join sample metadata sheet
* ingestion_qc_report

cellranger for 10x multiome:

* ...

processing pipeline smartseq2/3:

* ...

------------

At the end of this workflow, there should be h5mu file per sample containing:

MuData (/TileDBSOMA?) containing n_obs (barcodes) × n_vars (genes and so on):

GEX:

  * `layers`
    - `counts`: Raw counts
  * `obs`:
    - `sample_id`: categorical
    - `{prefix}_id`
    - ... metrics by calculate_qc_metrics (but renamed?)...
  * `var`: 
    - `_index`: structured id (e.g. ensembl for RNA)
    - `feature_name`: human readable
    - `is_mitochondrial` (RNA only)
    - `is_ribosomal` (RNA only)
    - `is_hemoglobin`? (RNA only)
    - ... other variables we fetch from the reference ...
    - ... metrics by calculate_qc_metrics (but renamed?)...
  * `uns`
    - `obs_glossary`: data frame - id, label, description
    - `var_glossary`: data frame - id, label, description

Global:

* `uns`:
  * `{prefix}_metadata`: which also contains `{prefix}_id`
  * `sample_metadata`: a data frame that contains at least the 'sample_id' column, and the categories are the same as .obs[`"sample_id"]`
    * `sample_id`: ...
    * ... sample sheet information provided by user (e.g. experiment)
    * ... mapping metrics ...
    * ... sample qc metrics ...
  * `sample_metadata_glossary`: data frame - id, label, description

TODO: Discuss whether to store certain values in `obs` vs `obsm`, or `var` vs `varm`

At least at the workflow-level, components should indicate that the expected / required schema of the h5ads / h5mus are. Example from `src/labels_transfer/api/common_arguments.yaml`: https://github.com/openpipelines-bio/openpipeline/blob/7a90f3a83a8ad0e90db642cae761ddaa50faa04b/src/labels_transfer/api/common_arguments.yaml#L11-L31

This is slightly related to what was written in https://github.com/openpipelines-bio/openpipeline/issues/102

--------------------

The updated ingestion workflows could be stored in a separate repository, e.g. `openpipelines-bio/scrnaseq_ingestion` to not break existing workflows and start implementing some separation of concerns.

	file_format:
	type: h5mu
	mod:
	rna:
	description: "Modality in AnnData format containing RNA data."
	required: true
	slots:
	X:
	type: double
	name: features
	required: false
	description: \|
	The expression data to use for the classifier's inference, if `--input_obsm_features` argument is not provided.
	obsm:
	- type: "double"
	name: "features"
	example: X_scvi
	required: false
	description: \|
	The embedding to use for the classifier's inference. Override using the `--input_obsm_features` argument. If not provided, the `.X` slot will be used instead.
	Make sure that embedding was obtained in the same way as the reference embedding (e.g. by the same model or preprocessing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ingestion changes #965

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingestion changes #965

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions