You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: data_model/standards.md
+4-78Lines changed: 4 additions & 78 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,82 +8,8 @@ HTAN Centers submit assay data files and [metadata](../data_submission/metadata.
8
8
9
9

10
10
11
-
For **HTAN Phase 1**, [The HTAN Portal's Data Standards pages](https://humantumoratlas.org/standards) provide interactive, searchable and downloadable summaries of the metadata attributes, requirements and valid values expected for each data type.
12
-
13
-
!!! The following Table provides links to HTAN Phase 1 data standards. Additional documentation for Phase 2 Data Standards is currently in development. Please see [Data Model Introduction](../data_model/overview.md) for more information regarding the Phase 2 Data Model.
For **HTAN Phase 2**, there are both metadata standards as well as specific file requirements.
12
+
- Interactive, searchable and downloadable summaries of metadata requirements are provided [here](https://htan2-data-model.readthedocs.io/en/latest/index.html).
13
+
- Specific file requirements for single cell RNA-seq h5ad files are modeled after [CELLxGENE's requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data). Please see the [Phase 2 Single Cell RNA-seq page](../data_submission/scrnaseq_data_submission.md) for more information.
89
14
15
+
For **HTAN Phase 1**, [The HTAN Portal's Data Standards pages](https://humantumoratlas.org/standards) provide interactive, searchable and downloadable summaries of the metadata attributes, requirements and valid values expected for each data type.
In HTAN Phase 2, the following files are submitted for single cell/single nuclei RNA-sequencing (sc/snRNA-seq) data:
9
+
10
+
| Level | Data Type | Example Files |
11
+
|---|-------------------|----------------------|
12
+
| 1 | raw sequence data | fastq, unaligned bam |
13
+
| 2 | aligned sequence data | bam |
14
+
| 3_4 | sample level summary information, e.g. cell annotations, t-SNE/UMAP coordinates, etc. | h5ad |
15
+
16
+
Metadata requirements are documented in the HTAN Data Model [readthedocs](https://htan2-data-model.readthedocs.io/en/latest/docs/scrna-seq.html) pages. This part of the manual describes **file requirements** for level 3_4 h5ad files.
17
+
18
+
## HTAN's h5ad Requirements
19
+
HTAN Centers are encouraged to reference the [sc/snRNA-seq RFC](https://docs.google.com/document/d/1XjDLWulYWhnfZrGCg-0_Jh93ytIp3p_01ZrTyymTjoU/edit?usp=sharing) for additional details. The HTAN h5ad (AnnData 0.10) requirements are modeled after CELLxGENE's requirements. They also include three attributes developed by the Human Cell Atlas (HCA). Please see the [Background](#background-h5ad-files-cellxgene-human-cell-atlas) section below for more information about h5ad (AnnData 0.10) files, CELLxGENE and the HCA.
20
+
21
+
### Required File Attributes
22
+
Similar to CELLxGENE's [Dataset Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data), level 3_4 sc/snRNA-seq h5ad files must contain the following attributes. Please see [HTAN_h5ad_exemplar_2025_03_03.h5ad](https://github.com/ncihtan/h5ad/blob/main/exemplars/HTAN_h5ad_exemplar_2025_03_03.h5ad) for an example file which meets these requirements.
| var.gene_is_filtered, raw.var.gene_id_filtered || no genes filtered in raw data; if gene is filtered in normalized data, count is set to 0 and gene_is_filtered set to 1.|
28
+
| obs.organism_ontology_term_id |[NCBITaxon](https://www.ncbi.nlm.nih.gov/taxonomy)| Set to NCBITaxon:9606 for human. |
29
+
| obs.donor_id || Set to the HTAN Participant ID, e.g. HTA201_1.|
30
+
| obs.sample_id || Set to the HTAN Biospecimen ID, e.g. HTA201_1_B. |
31
+
| obs.development_stage_ontology_term_id |[Human Development Stages (HsapDv)](https://www.ebi.ac.uk/ols4/ontologies/HsapDv)| use [HCA recommended terms](https://docs.google.com/document/d/1SsHZweG_kqerCAPNbQF7gQHNBDRqOsNZWzaWXZIKwTE/edit?usp=sharing) (p.22) |
32
+
| obs.sex_ontology_term_id |[Phenotype and Trait Ontology (PATO)](https://www.ebi.ac.uk/ols4/ontologies/pato)| Use [CELLxGENE Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data#dataset-requirements) PATO:0000384 for male, PATO:0000383 for female, or unknown if unavailable. |
33
+
| obs.self_reported_ethnicity_term_id |[Human Ancestry Ontology (HANCESTRO)](https://www.ebi.ac.uk/ols4/ontologies/hancestro)| Use [CELLxGENE Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data#dataset-requirements). HANCESTRO multiple comma-separated terms may be used if more than one ethnicity is reported. If information is unavailable, use unknown. Example: HANCESTRO_0568. Note that CELLxGENE specifically excludes certain HANCESTRO categories. See [full details](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md#self_reported_ethnicity_ontology_term_id).|
| obs.tissue_type | CELLxGENE | Use [CELLxGENE Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data#dataset-requirements), Permitted values are restricted to: tissue, organoid, or cell culture.|
| obs.assay_ontology_term_id |[Experimental Factor Ontology (EFO)](https://www.ebi.ac.uk/ols4/ontologies/efo)| Use [CELLxGENE Requirements](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md#assay_ontology_term_id)|
39
+
| obs.suspension_type | CELLxGENE | Use [CELLxGENE Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data#dataset-requirements). Permitted values are restricted to: cell, nucleus, na. |
40
+
| obs.is_primary_data | CELLxGENE. Used to indicate if this is the canonical data set (True), or data is being reused from another source (False). | Use [CELLxGENE Requirements](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md#is_primary_data). Permitted values are restricted to True or False. |
41
+
| obs.cell_enrichment | Human Cell Atlas: “Specifies the cell types targeted for enrichment or depletion beyond the selection of live cells.“ | CL term, followed by + or -. If no enrichment. Then use CL:00000000. For example, enrichment for fibroblasts would be CL:0000057+ |
42
+
| obs.intron_inclusion | Human Cell Atlas: “Were introns included during read counting in the alignment process?” | Permitted values are: yes, no |
43
+
| obs.author_cell_type | Human Cell Atlas: “Encoding of author intuition of cellular annotation in the dataset.” | Free text |
44
+
|[obsm.X_(suffix)](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md#x_suffix)| CELLxGENE: embeddings of at least two dimensions, e.g. tSNE, UMAP, PCA, spatial coordinates | use [CELLxGENE terms](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md#x_suffix) for suffix (e.g. umap, tsne, pca) |
45
+
46
+
### HTAN h5ad File Validation
47
+
The HTAN Data Coordinating Center (DCC) has released a PyPi package called [HTAN-h5ad-validator](https://pypi.org/project/HTAN-h5ad-validate/) with which Centers can validate their sc/snRNA-seq h5ad files. Sage Bionetworks will run the validator on sc/snRNA-seq h5ad files submitted to Synapse.
48
+
49
+
## Background: h5ad files, CELLxGENE, Human Cell Atlas
50
+
51
+
### h5ad (AnnData 0.10) brief overview
52
+
Please see [AnnData’s documentation](https://anndata.readthedocs.io/en/latest/index.html) for a more detailed description of the AnnData object.
For HTAN’s purposes, the following parts of the AnnData object are of interest:
57
+
58
+
* .X - a matrix with counts where rows are cells and columns are genes.
59
+
* var - a matrix with gene information (e.g. gene name, gene_is_filtered).
60
+
* obs - a matrix with cell-level information.
61
+
* obsm - one or more numpy ndarrays with cell embeddings.
62
+
63
+
CELLxGENE requires that raw data are submitted. Normalized data may also be submitted.
64
+
65
+
### CELLxGENE
66
+
The HTAN DCC submits sc/snRNA-seq data to [CELLxGENE](https://cellxgene.cziscience.com), a tool developed by the Chan Zuckerberg Initiative (CZI) to visualize and explore single cell and spatial data. The DCC submits data to CellxGene in h5ad (AnnData 0.10) format. CELLxGENE’s schema requires:
67
+
68
+
* use of Ensembl gene IDs.
69
+
* a specific genome reference and annotation version.
70
+
* specific h5ad (AnnData 0.10) attributes.
71
+
* use of specific ontologies for many of the required attributes (i.e. cell ontology).
72
+
73
+
The HTAN requirements for h5ad files are modeled after CELLxGENE's [Dataset Requirements](https://cellxgene.cziscience.com/docs/032__Contribute%20and%20Publish%20Data).
74
+
75
+
### Human Cell Atlas (HCA)
76
+
77
+
The Human Cell Atlas (HCA) is a large repository of single cell data from healthy subjects. It provides [standards for single-cell data submission](https://docs.google.com/document/d/1SsHZweG_kqerCAPNbQF7gQHNBDRqOsNZWzaWXZIKwTE/edit?usp=sharing) which adopt most of the CELLxGENE schema, but also include additional fields. Aligning HTAN data with CELLxGENE will potentially facilitate data integration with other consortia such as the HCA. The HTAN requirements include three HCA attributes in addition to CELLxGENE required attributes.
0 commit comments