Skip to content

Commit e1a1921

Browse files
Merge pull request #147 from smaht-dac/bam_grouping
File Merge Group
2 parents 23afd7d + 0aa2d90 commit e1a1921

File tree

20 files changed

+873
-113
lines changed

20 files changed

+873
-113
lines changed

CHANGELOG.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,16 @@ smaht-portal
77
Change Log
88
----------
99

10+
0.47.0
11+
======
12+
13+
* Add calcprop `file_merge_group` as a tag on file sets to help determine which file sets contain files that are candidates for merging
14+
* Add additional fields to manifest files
15+
* Documentation on manifest files
16+
* Documentation on data release via status
17+
* Adjust access key expiration down to 30 days
18+
19+
1020
0.46.2
1121
======
1222

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
===================
2+
Data Release Status
3+
===================
4+
5+
Data releases for the SMaHT Data Portal are controlled by the ``status`` field on objects.
6+
Users have access to metadata objects in the system based on their ``consortia`` and ``submission_center`` fields.
7+
8+
9+
Metadata objects tagged under certain ``status`` values become viewable by users based on the value.
10+
A description of important ``status`` values most
11+
relevant to users is below.
12+
13+
* ``public`` status is used to denote data that is accessible to all SMaHT Data Portal users.
14+
* ``released`` status is used to denote data that is accessible only to registered SMaHT Consortia users.
15+
* ``obsolete`` status is used to denote previously ``released`` data that has been superseded by new data, also only viewable by registered SMaHT Consortia members.
16+
* ``restricted`` status is used to denote controlled access data whose metadata is viewable by consortia users but can only be downloaded by dbGaP approved users. The set of approved users is managed internally by DAC.
17+
18+
19+
Some additional statuses relevant for data submitters include:
20+
21+
* ``uploading`` status is specific to files and indicates a submitted file is pending md5 computation by DAC and is only viewable by the submitting center.
22+
* ``uploaded`` status is specific to files and indicates a submitted file has completed md5 computation by DAC and is only viewable by the submitting center.
23+
* ``in review`` status is for non-file metadata that is pending review prior to data release and is only viewable by the submitting center.

docs/source/manifest.rst

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
==============================================
2+
Understanding SMaHT Data Portal Manifest Files
3+
==============================================
4+
5+
Please use the ``manifest`` file to download data from the SMaHT Data Portal.
6+
7+
It passes portal access credentials to the command provided in the ``smaht_manifest`` files downloaded from the portal.
8+
9+
In the ``manifest`` file, multiple values in a field under a column are separated by the pipe (``|``) character.
10+
11+
Below are the columns listed in the ``manifest`` files as of the May 2024 data release.
12+
13+
#. **File Download URL** - This URL calls an API that authorizes the user and redirects to a pre-signed URL to download the file.
14+
15+
#. **File Accession** - This value is generated by the SMaHT data portal, it is a unique value except for extra files associated with a actual file will have the same accession but different file extension. E.g. When a BAM file (*.bam) is selected to download, an index file (*.bai) is the associated extra file that is also downloaded along with it.
16+
17+
#. **File Name** - This value is a file name that also serves as a unique identifier of the file. The file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
18+
19+
#. **Size** - File size in bytes.
20+
21+
#. **md5sum** - md5 of the file content.
22+
23+
#. **Data Type** - This value tells you the file type, e.g. ``Aligned Reads``, ``Unaligned Reads`` or ``Variant Calls``.
24+
25+
#. **File Format** - Format of the file (e.g. bam, fastq.gz).
26+
27+
#. **Sample Name** - Sample identifier in SMaHT nomenclature. Please refer to the file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
28+
29+
#. **Sample Studies** - Studies associated with this file; Benchmarking or Production.
30+
31+
#. **Sample Tissues** - Tissues used to generate this file, if applicable.
32+
33+
#. **Sample Donors** - Similarly, the donors from which the above tissues were generated.
34+
35+
#. **Sample Source** - Sample name provided by a data submitter. If the file is generated from a mixture of samples (e.g. HapMap mix, COLO829-BLT), multiple sample sources will be found here, delimited by ``|``.
36+
37+
#. **Analytes** - Analytes used for analysis, e.g. one of ``RNA``, ``DNA``.
38+
39+
#. **Sequencer** - Name of the sequencing platform used to generate the raw sequencing data e.g. ``PacBio Revio``.
40+
41+
#. **Assay** - Experimental assay used to generate this file, e.g. ``WGS, PCR Free``.
42+
43+
#. **Software Name/Version** - Name and version of software used to generate this file, e.g. ``pbmm2 (1.13.0)``.
44+
45+
#. **Reference Genome** - Reference Genome version used for the analysis, e.g. ``GRCh38 [GCA_000001405.15]``.
46+
47+
#. **File Group** - This field indicates a group of BAM files that can be merged. BAM files with the identical file group value can be merged. Please see the dedicated section below for more information.
48+
49+
50+
----------------
51+
File Merge Group
52+
----------------
53+
54+
The ``File Group`` field is a special field that indicates which BAM files can be merged. To efficiently process and store large BAMs with high sequencing coverage, the alignment pipeline at DAC produces BAMs per library. To identify BAMs to merge, obtain the files where the file format is BAM and the File Group values are identical.
55+
56+
Specifically, the ``File Group`` combines several pieces of information, including:
57+
58+
* The center that submitted the raw sequencing data
59+
* Aggregated sample source information
60+
* Aggregated sequencing platform information
61+
* Aggregated experimental assay information
62+
63+
For example:
64+
65+
File Merge Group = ``bcm_gcc-WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1-pacbio_revio_hifi-Single-end-17500-no-flow-cell-bulk_wgs_pcr_free``
66+
67+
* ``bcm_gcc`` = Submission center which indicates that ``BCM-GCC`` submitted the sequencing data.
68+
* ``WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1`` = Sample Source which indicates this file was generated from SMAHT CORIELL POOL1 sample source, a name designated by the data submitter at BCM.
69+
* ``pacbio_revio_hifi-Single-end-17500-no-flow-cell`` = Sequencing, which indicates that this file was generated from a PacBio Revio sequencer with target read length 17500 and no flow cell information.
70+
* ``bulk_wgs_pcr_free`` = Experimental assay.
71+
72+
*Please note this functionality is provisional and subject to change. If you encounter issues with this functionality, please report it to DAC!*

poetry.lock

Lines changed: 252 additions & 67 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

pyproject.toml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "encoded"
3-
version = "0.46.2"
3+
version = "0.47.0"
44
description = "SMaHT Data Analysis Portal"
55
authors = ["4DN-DCIC Team <[email protected]>"]
66
license = "MIT"
@@ -37,7 +37,6 @@ classifiers = [
3737

3838
[tool.poetry.dependencies]
3939
python = ">=3.9.1,<3.12"
40-
awscli = ">=1.32.40"
4140
boto3 = "^1.34.40"
4241
botocore = "^1.34.40"
4342
certifi = ">=2021.5.30"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
<p>The Mergeable Files grouping can help guide whether certain files within file sets are candidates for merging. Submitted by data can always be searched on, but if additional facets for sample source, sequencing and assay are available, this indicates there are file sets that contain files that could potentially be merged with others. File sets that match values across all 4 fields: Submitted By, Sample Source Tag, Sequencing Tag and Assay Tag are candidates for merge. </p>

src/encoded/metadata.py

Lines changed: 131 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,10 @@ def includeme(config):
3030
FILE = 0
3131

3232

33+
# This field is special because it is a transformation applied from other fields
34+
FILE_GROUP = 'FileGroup'
35+
36+
3337
class MetadataArgs(NamedTuple):
3438
""" NamedTuple that holds all the args passed to the /metadata and /peek-metadata endpoints """
3539
accessions: List[str]
@@ -43,22 +47,35 @@ class MetadataArgs(NamedTuple):
4347

4448
class TSVDescriptor:
4549
""" Dataclass that holds the structure """
46-
def __init__(self, *, field_type, field_name, deduplicate=True):
50+
def __init__(self, *, field_type: int, field_name: List[str],
51+
deduplicate: bool = True, use_base_metadata: bool = False):
52+
""" field_type is str, int or float, field_name is a list of possible
53+
paths when searched can retrieve the field value, deduplicate is unused,
54+
use_base_metadata means to rely on top level object instead of sub object
55+
(only used for extra files)
56+
"""
4757
self._field_type = field_type
4858
self._field_name = field_name
4959
self._deduplicate = deduplicate
60+
self._use_base_metadata = use_base_metadata
5061

51-
def field_type(self):
62+
def field_type(self) -> int:
63+
""" Note this is an int enum """
5264
return self._field_type
5365

54-
def field_name(self):
66+
def field_name(self) -> List[str]:
67+
""" Field name in this case is a list of possible paths to search """
5568
return self._field_name
5669

57-
def deduplicate(self):
70+
def deduplicate(self) -> bool:
5871
return self._deduplicate
5972

73+
def use_base_metadata(self) -> bool:
74+
return self._use_base_metadata
75+
6076

6177
class DummyFileInterfaceImplementation(object):
78+
""" This is used to simulate a file interface for streaming the TSV output """
6279
def __init__(self):
6380
self._line = None
6481
def write(self, line):
@@ -69,45 +86,106 @@ def read(self):
6986

7087
# This dictionary is a key --> 3-tuple mapping that encodes options for the /metadata/ endpoint
7188
# given a field description. This also describes the order that fields show up in the TSV.
89+
# VERY IMPORTANT NOTE WHEN ADDING FIELDS - right now support for arrays generally is limited.
90+
# The limitations are: array of terminal values are fine, but arrays of dictionaries will only
91+
# traverse one additional level of depth ie:
92+
# item contains dictionary d1, where d1 has property that is array of object
93+
# --> d1.arr --> d1.array.dict --> d1.array.dict.value
7294
# TODO: move to another file or write in JSON
7395
TSV_MAPPING = {
7496
FILE: {
75-
'File Download URL': TSVDescriptor(field_type=FILE,
76-
field_name=['href']),
77-
'File Accession': TSVDescriptor(field_type=FILE,
78-
field_name=['accession']),
79-
'File Name': TSVDescriptor(field_type=FILE,
80-
field_name=['annotated_filename', 'display_title', 'filename']),
81-
'Size (MB)': TSVDescriptor(field_type=FILE,
82-
field_name=['file_size']),
97+
'FileDownloadURL': TSVDescriptor(field_type=FILE,
98+
field_name=['href']),
99+
'FileAccession': TSVDescriptor(field_type=FILE,
100+
field_name=['accession']),
101+
'FileName': TSVDescriptor(field_type=FILE,
102+
field_name=['annotated_filename', 'display_title', 'filename']),
103+
'Size(B)': TSVDescriptor(field_type=FILE,
104+
field_name=['file_size']),
83105
'md5sum': TSVDescriptor(field_type=FILE,
84106
field_name=['md5sum']),
85-
'File Type': TSVDescriptor(field_type=FILE,
86-
field_name=['file_type']),
87-
'File Format': TSVDescriptor(field_type=FILE,
88-
field_name=['file_format.display_title']),
107+
'DataType': TSVDescriptor(field_type=FILE,
108+
field_name=['data_type'],
109+
use_base_metadata=True), # do not traverse extra_files for this
110+
'FileFormat': TSVDescriptor(field_type=FILE,
111+
field_name=['file_format.display_title']),
112+
'SampleName': TSVDescriptor(field_type=FILE,
113+
field_name=['sample_summary.sample_names'],
114+
use_base_metadata=True), # do not traverse extra_files for this
115+
'SampleStudies': TSVDescriptor(field_type=FILE,
116+
field_name=['sample_summary.studies'],
117+
use_base_metadata=True), # do not traverse extra_files for this
118+
'SampleTissues': TSVDescriptor(field_type=FILE,
119+
field_name=['sample_summary.tissues'],
120+
use_base_metadata=True), # do not traverse extra_files for this
121+
'SampleDonors': TSVDescriptor(field_type=FILE,
122+
field_name=['sample_summary.donor_ids'],
123+
use_base_metadata=True), # do not traverse extra_files for this
124+
'SampleSource': TSVDescriptor(field_type=FILE,
125+
field_name=['sample_summary.sample_descriptions'],
126+
use_base_metadata=True), # do not traverse extra_files for this
127+
'Analytes': TSVDescriptor(field_type=FILE,
128+
field_name=['sample_summary.analytes'],
129+
use_base_metadata=True),
130+
'Sequencer': TSVDescriptor(field_type=FILE,
131+
field_name=['sequencing.sequencer.display_title'],
132+
use_base_metadata=True),
133+
'Assay': TSVDescriptor(field_type=FILE,
134+
field_name=['assays.display_title'],
135+
use_base_metadata=True),
136+
'SoftwareName/Version': TSVDescriptor(field_type=FILE,
137+
field_name=['analysis_summary.software'],
138+
use_base_metadata=True),
139+
'ReferenceGenome': TSVDescriptor(field_type=FILE,
140+
field_name=['analysis_summary.reference_genome'],
141+
use_base_metadata=True),
142+
FILE_GROUP: TSVDescriptor(field_type=FILE,
143+
field_name=['file_sets.file_group'],
144+
use_base_metadata=False) # omit this field on extra files
89145
}
90146
}
91147

92148

93149
def generate_file_download_header(download_file_name: str):
94-
""" Helper function that generates a suitable header for the File download """
95-
header1 = ['###', 'Metadata TSV Download', '', '', '', '', '']
150+
""" Helper function that generates a suitable header for the File download, generating 18 columns"""
151+
header1 = ['###', 'Metadata TSV Download', 'Column Count', '18'] + ([''] * 14) # length 18
96152
header2 = ['Suggested command to download: ', '', '',
97153
"cut -f 1,3 ./{} | tail -n +4 | grep -v ^# | xargs -n 2 -L 1 sh -c 'curl -L "
98-
"--user <access_key_id>:<access_key_secret> $0 --output $1'".format(download_file_name), '', '', '']
154+
"--user <access_key_id>:<access_key_secret> $0 --output $1'".format(download_file_name)] + ([''] * 14)
99155
header3 = list(TSV_MAPPING[FILE].keys())
100156
return header1, header2, header3
101157

102158

159+
def extract_array(array: list, i: int, fields: list) -> str:
160+
""" Extracts field_name values from array of dicts, or the value itself if a terminal field """
161+
if isinstance(array[0], dict):
162+
if isinstance(array[0][fields[i]], dict): # go one level deeper
163+
field1, field2 = fields[i], fields[i+1]
164+
return '|'.join(sorted([ele[field1][field2] for ele in array]))
165+
else:
166+
return '|'.join(sorted(ele[fields[i]] for ele in array))
167+
else:
168+
return '|'.join(sorted(array))
169+
170+
103171
def descend_field(request, prop, field_names):
104172
""" Helper to grab field values if we reach a terminal field ie: not dict or list """
105173
for possible_field in field_names:
106174
current_prop = prop # store a reference to the original object
107175
fields = possible_field.split('.')
108-
for field in fields:
176+
for i, field in enumerate(fields):
109177
current_prop = current_prop.get(field)
110-
if current_prop is None or isinstance(current_prop, dict) or isinstance(current_prop, list):
178+
if isinstance(current_prop, list) and possible_field != 'file_sets.file_group':
179+
return extract_array(current_prop, i+1, fields)
180+
elif current_prop and possible_field == 'file_sets.file_group':
181+
return current_prop[0].get('file_group')
182+
elif not current_prop:
183+
break
184+
# this hard code is necessary because in this select case we are processing an object field,
185+
# and we want all other object fields to be ignored - Will 1 May 2024
186+
if isinstance(current_prop, dict) and possible_field == 'file_sets.file_group':
187+
return current_prop
188+
elif current_prop is None or isinstance(current_prop, dict):
111189
continue
112190
elif possible_field == 'href':
113191
return f'{request.scheme}://{request.host}{current_prop}'
@@ -116,6 +194,17 @@ def descend_field(request, prop, field_names):
116194
return None
117195

118196

197+
def handle_file_group(field: dict) -> str:
198+
""" Transforms the file_group into a single string """
199+
if field:
200+
sc_part = field['submission_center']
201+
sample_source_part = field['sample_source']
202+
sequencing_part = field['sequencing']
203+
assay_part = field['assay']
204+
return f'{sc_part}-{sample_source_part}-{sequencing_part}-{assay_part}'
205+
return ''
206+
207+
119208
def generate_tsv(header: Tuple, data_lines: list):
120209
""" Helper function that actually generates the TSV """
121210
line = DummyFileInterfaceImplementation()
@@ -228,16 +317,33 @@ def metadata_tsv(context, request):
228317
data_lines = []
229318
for file in search_iter:
230319
line = []
231-
for _, tsv_descriptor in args.tsv_mapping.items():
232-
field = descend_field(request, file, tsv_descriptor.field_name()) or ''
320+
for field_name, tsv_descriptor in args.tsv_mapping.items():
321+
traversal_path = tsv_descriptor.field_name()
322+
if field_name == FILE_GROUP:
323+
field = descend_field(request, file, traversal_path) or ''
324+
if field: # requires special care
325+
field = handle_file_group(field)
326+
else:
327+
field = descend_field(request, file, traversal_path) or ''
233328
line.append(field)
234329
data_lines += [line]
330+
331+
# Repeat the above process for extra files
332+
# This requires extra care - most fields we take from extra_files directly,
333+
# but some must be taken from the parent metadata, such as anything related to library/assay/sample
334+
# or the file merge group
235335
if args.include_extra_files and 'extra_files' in file:
236336
efs = file.get('extra_files')
237337
for ef in efs:
238338
ef_line = []
239-
for _, tsv_descriptor in args.tsv_mapping.items():
240-
field = descend_field(request, ef, tsv_descriptor.field_name()) or ''
339+
for field_name, tsv_descriptor in args.tsv_mapping.items():
340+
traversal_path = tsv_descriptor.field_name()
341+
if tsv_descriptor.use_base_metadata():
342+
field = descend_field(request, file, traversal_path) or ''
343+
if field_name == FILE_GROUP: # requires special care
344+
field = handle_file_group(field)
345+
else:
346+
field = descend_field(request, ef, traversal_path) or ''
241347
ef_line.append(field)
242348
data_lines += [ef_line]
243349

src/encoded/project/access_key.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from snovault.project.access_key import SnovaultProjectAccessKey
2+
3+
class SMAHTProjectAccessKey(SnovaultProjectAccessKey):
4+
def access_key_has_expiration_date(self):
5+
return True

0 commit comments

Comments
 (0)