You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data releases for the SMaHT Data Portal are controlled by the ``status`` field on objects.
6
+
Users have access to metadata objects in the system based on their ``consortia`` and ``submission_center`` fields.
7
+
8
+
9
+
Metadata objects tagged under certain ``status`` values become viewable by users based on the value.
10
+
A description of important ``status`` values most
11
+
relevant to users is below.
12
+
13
+
* ``public`` status is used to denote data that is accessible to all SMaHT Data Portal users.
14
+
* ``released`` status is used to denote data that is accessible only to registered SMaHT Consortia users.
15
+
* ``obsolete`` status is used to denote previously ``released`` data that has been superseded by new data, also only viewable by registered SMaHT Consortia members.
16
+
* ``restricted`` status is used to denote controlled access data whose metadata is viewable by consortia users but can only be downloaded by dbGaP approved users. The set of approved users is managed internally by DAC.
17
+
18
+
19
+
Some additional statuses relevant for data submitters include:
20
+
21
+
* ``uploading`` status is specific to files and indicates a submitted file is pending md5 computation by DAC and is only viewable by the submitting center.
22
+
* ``uploaded`` status is specific to files and indicates a submitted file has completed md5 computation by DAC and is only viewable by the submitting center.
23
+
* ``in review`` status is for non-file metadata that is pending review prior to data release and is only viewable by the submitting center.
Please use the ``manifest`` file to download data from the SMaHT Data Portal.
6
+
7
+
It passes portal access credentials to the command provided in the ``smaht_manifest`` files downloaded from the portal.
8
+
9
+
In the ``manifest`` file, multiple values in a field under a column are separated by the pipe (``|``) character.
10
+
11
+
Below are the columns listed in the ``manifest`` files as of the May 2024 data release.
12
+
13
+
#. **File Download URL** - This URL calls an API that authorizes the user and redirects to a pre-signed URL to download the file.
14
+
15
+
#. **File Accession** - This value is generated by the SMaHT data portal, it is a unique value except for extra files associated with a actual file will have the same accession but different file extension. E.g. When a BAM file (*.bam) is selected to download, an index file (*.bai) is the associated extra file that is also downloaded along with it.
16
+
17
+
#. **File Name** - This value is a file name that also serves as a unique identifier of the file. The file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
18
+
19
+
#. **Size** - File size in bytes.
20
+
21
+
#. **md5sum** - md5 of the file content.
22
+
23
+
#. **Data Type** - This value tells you the file type, e.g. ``Aligned Reads``, ``Unaligned Reads`` or ``Variant Calls``.
24
+
25
+
#. **File Format** - Format of the file (e.g. bam, fastq.gz).
26
+
27
+
#. **Sample Name** - Sample identifier in SMaHT nomenclature. Please refer to the file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
28
+
29
+
#. **Sample Studies** - Studies associated with this file; Benchmarking or Production.
30
+
31
+
#. **Sample Tissues** - Tissues used to generate this file, if applicable.
32
+
33
+
#. **Sample Donors** - Similarly, the donors from which the above tissues were generated.
34
+
35
+
#. **Sample Source** - Sample name provided by a data submitter. If the file is generated from a mixture of samples (e.g. HapMap mix, COLO829-BLT), multiple sample sources will be found here, delimited by ``|``.
36
+
37
+
#. **Analytes** - Analytes used for analysis, e.g. one of ``RNA``, ``DNA``.
38
+
39
+
#. **Sequencer** - Name of the sequencing platform used to generate the raw sequencing data e.g. ``PacBio Revio``.
40
+
41
+
#. **Assay** - Experimental assay used to generate this file, e.g. ``WGS, PCR Free``.
42
+
43
+
#. **Software Name/Version** - Name and version of software used to generate this file, e.g. ``pbmm2 (1.13.0)``.
44
+
45
+
#. **Reference Genome** - Reference Genome version used for the analysis, e.g. ``GRCh38 [GCA_000001405.15]``.
46
+
47
+
#. **File Group** - This field indicates a group of BAM files that can be merged. BAM files with the identical file group value can be merged. Please see the dedicated section below for more information.
48
+
49
+
50
+
----------------
51
+
File Merge Group
52
+
----------------
53
+
54
+
The ``File Group`` field is a special field that indicates which BAM files can be merged. To efficiently process and store large BAMs with high sequencing coverage, the alignment pipeline at DAC produces BAMs per library. To identify BAMs to merge, obtain the files where the file format is BAM and the File Group values are identical.
55
+
56
+
Specifically, the ``File Group`` combines several pieces of information, including:
57
+
58
+
* The center that submitted the raw sequencing data
59
+
* Aggregated sample source information
60
+
* Aggregated sequencing platform information
61
+
* Aggregated experimental assay information
62
+
63
+
For example:
64
+
65
+
File Merge Group = ``bcm_gcc-WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1-pacbio_revio_hifi-Single-end-17500-no-flow-cell-bulk_wgs_pcr_free``
66
+
67
+
* ``bcm_gcc`` = Submission center which indicates that ``BCM-GCC`` submitted the sequencing data.
68
+
* ``WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1`` = Sample Source which indicates this file was generated from SMAHT CORIELL POOL1 sample source, a name designated by the data submitter at BCM.
69
+
* ``pacbio_revio_hifi-Single-end-17500-no-flow-cell`` = Sequencing, which indicates that this file was generated from a PacBio Revio sequencer with target read length 17500 and no flow cell information.
70
+
* ``bulk_wgs_pcr_free`` = Experimental assay.
71
+
72
+
*Please note this functionality is provisional and subject to change. If you encounter issues with this functionality, please report it to DAC!*
<p>The Mergeable Files grouping can help guide whether certain files within file sets are candidates for merging. Submitted by data can always be searched on, but if additional facets for sample source, sequencing and assay are available, this indicates there are file sets that contain files that could potentially be merged with others. File sets that match values across all 4 fields: Submitted By, Sample Source Tag, Sequencing Tag and Assay Tag are candidates for merge. </p>
0 commit comments