smaht-dac
diff --git a/‎CHANGELOG.rst‎
Lines changed: 10 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎docs/source/data_release_status.rst‎
Lines changed: 23 additions & 0 deletions b/‎docs/source/data_release_status.rst‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎docs/source/manifest.rst‎
Lines changed: 72 additions & 0 deletions b/‎docs/source/manifest.rst‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎poetry.lock‎
Lines changed: 252 additions & 67 deletions b/‎poetry.lock‎
Lines changed: 252 additions & 67 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 1 addition & 2 deletions b/‎pyproject.toml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎src/encoded/docs/extended_description_FileSet_MergeableFiles.html‎
Lines changed: 1 addition & 0 deletions b/‎src/encoded/docs/extended_description_FileSet_MergeableFiles.html‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/encoded/metadata.py‎
Lines changed: 131 additions & 25 deletions b/‎src/encoded/metadata.py‎
Lines changed: 131 additions & 25 deletions
diff --git a/‎src/encoded/project/access_key.py‎
Lines changed: 5 additions & 0 deletions b/‎src/encoded/project/access_key.py‎
Lines changed: 5 additions & 0 deletions
@@ -7,6 +7,16 @@ smaht-portal
 Change Log
 ----------
 
+0.47.0
+======
+
+* Add calcprop `file_merge_group` as a tag on file sets to help determine which file sets contain files that are candidates for merging
+* Add additional fields to manifest files
+* Documentation on manifest files
+* Documentation on data release via status
+* Adjust access key expiration down to 30 days
+
+
 0.46.2
 ======
 
 
@@ -0,0 +1,23 @@
+===================
+Data Release Status
+===================
+
+Data releases for the SMaHT Data Portal are controlled by the ``status`` field on objects.
+Users have access to metadata objects in the system based on their ``consortia`` and ``submission_center`` fields.
+
+
+Metadata objects tagged under certain ``status`` values become viewable by users based on the value.
+A description of important ``status`` values most
+relevant to users is below.
+
+* ``public`` status is used to denote data that is accessible to all SMaHT Data Portal users.
+* ``released`` status is used to denote data that is accessible only to registered SMaHT Consortia users.
+* ``obsolete`` status is used to denote previously ``released`` data that has been superseded by new data, also only viewable by registered SMaHT Consortia members.
+* ``restricted`` status is used to denote controlled access data whose metadata is viewable by consortia users but can only be downloaded by dbGaP approved users. The set of approved users is managed internally by DAC.
+
+
+Some additional statuses relevant for data submitters include:
+
+* ``uploading`` status is specific to files and indicates a submitted file is pending md5 computation by DAC and is only viewable by the submitting center.
+* ``uploaded`` status is specific to files and indicates a submitted file has completed md5 computation by DAC and is only viewable by the submitting center.
+* ``in review`` status is for non-file metadata that is pending review prior to data release and is only viewable by the submitting center.
@@ -0,0 +1,72 @@
+==============================================
+Understanding SMaHT Data Portal Manifest Files
+==============================================
+
+Please use the ``manifest`` file to download data from the SMaHT Data Portal.
+
+It passes portal access credentials to the command provided in the ``smaht_manifest`` files downloaded from the portal.
+
+In the ``manifest`` file, multiple values in a field under a column are separated by the pipe (``|``) character.
+
+Below are the columns listed in the ``manifest`` files as of the May 2024 data release.
+
+#. **File Download URL** - This URL calls an API that authorizes the user and redirects to a pre-signed URL to download the file.
+
+#. **File Accession** - This value is generated by the SMaHT data portal, it is a unique value  except for extra files associated with a actual file will have the same accession but different file extension. E.g. When a BAM file (*.bam) is selected to download, an index file (*.bai) is the associated extra file that is also downloaded along with it.
+
+#. **File Name** - This value is a file name that also serves as a unique identifier of the file. The file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
+
+#. **Size** - File size in bytes.
+
+#. **md5sum** - md5 of the file content.
+
+#. **Data Type** - This value tells you the file type, e.g. ``Aligned Reads``, ``Unaligned Reads`` or ``Variant Calls``.
+
+#. **File Format** - Format of the file (e.g. bam, fastq.gz).
+
+#. **Sample Name** - Sample identifier in SMaHT nomenclature. Please refer to the file nomenclature schema is described `here <https://data.smaht.org/docs/additional-resources/sample-file-nomenclature>`_.
+
+#. **Sample Studies** - Studies associated with this file; Benchmarking or Production.
+
+#. **Sample Tissues** - Tissues used to generate this file, if applicable.
+
+#. **Sample Donors** - Similarly, the donors from which the above tissues were generated.
+
+#. **Sample Source** - Sample name provided by a data submitter. If the file is generated from a mixture of samples (e.g. HapMap mix, COLO829-BLT), multiple sample sources will be found here, delimited by ``|``.
+
+#. **Analytes** - Analytes used for analysis, e.g. one of ``RNA``, ``DNA``.
+
+#. **Sequencer** - Name of the sequencing platform used to generate the raw sequencing data e.g. ``PacBio Revio``.
+
+#. **Assay** - Experimental assay used to generate this file, e.g. ``WGS, PCR Free``.
+
+#. **Software Name/Version** - Name and version of software used to generate this file, e.g. ``pbmm2 (1.13.0)``.
+
+#. **Reference Genome** - Reference Genome version used for the analysis, e.g. ``GRCh38 [GCA_000001405.15]``.
+
+#. **File Group** - This field indicates a group of BAM files that can be merged. BAM files with the identical file group value can be merged. Please see the dedicated section below for more information.
+
+
+----------------
+File Merge Group
+----------------
+
+The ``File Group`` field is a special field that indicates which BAM files can be merged. To efficiently process and store large BAMs with high sequencing coverage, the alignment pipeline at DAC produces BAMs per library. To identify BAMs to merge, obtain the files where the file format is BAM and the File Group values are identical.
+
+Specifically, the ``File Group`` combines several pieces of information, including:
+
+* The center that submitted the raw sequencing data
+* Aggregated sample source information
+* Aggregated sequencing platform information
+* Aggregated experimental assay information
+
+For example:
+
+File Merge Group = ``bcm_gcc-WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1-pacbio_revio_hifi-Single-end-17500-no-flow-cell-bulk_wgs_pcr_free``
+
+* ``bcm_gcc`` = Submission center which indicates that ``BCM-GCC`` submitted the sequencing data.
+* ``WASHU_CELL-CULTURE-MIXTURE_SMAHT_CORIELL_POOL1`` = Sample Source which indicates this file was generated from SMAHT CORIELL POOL1 sample source, a name designated by the data submitter at BCM.
+* ``pacbio_revio_hifi-Single-end-17500-no-flow-cell`` = Sequencing, which indicates that this file was generated from a PacBio Revio sequencer with target read length 17500 and no flow cell information.
+* ``bulk_wgs_pcr_free`` = Experimental assay.
+
+*Please note this functionality is provisional and subject to change. If you encounter issues with this functionality, please report it to DAC!*
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "encoded"
-version = "0.46.2"
+version = "0.47.0"
 description = "SMaHT Data Analysis Portal"
 authors = ["4DN-DCIC Team <[email protected]>"]
 license = "MIT"
@@ -37,7 +37,6 @@ classifiers = [
 
 [tool.poetry.dependencies]
 python = ">=3.9.1,<3.12"
-awscli = ">=1.32.40"
 boto3 = "^1.34.40"
 botocore = "^1.34.40"
 certifi = ">=2021.5.30"
 
@@ -0,0 +1 @@
+<p>The Mergeable Files grouping can help guide whether certain files within file sets are candidates for merging. Submitted by data can always be searched on, but if additional facets for sample source, sequencing and assay are available, this indicates there are file sets that contain files that could potentially be merged with others. File sets that match values across all 4 fields: Submitted By, Sample Source Tag, Sequencing Tag and Assay Tag are candidates for merge. </p>
@@ -30,6 +30,10 @@ def includeme(config):
 FILE = 0
 
 
+# This field is special because it is a transformation applied from other fields
+FILE_GROUP = 'FileGroup'
+
+
 class MetadataArgs(NamedTuple):
     """ NamedTuple that holds all the args passed to the /metadata and /peek-metadata endpoints """
     accessions: List[str]
@@ -43,22 +47,35 @@ class MetadataArgs(NamedTuple):
 
 class TSVDescriptor:
     """ Dataclass that holds the structure """
-    def __init__(self, *, field_type, field_name, deduplicate=True):
+    def __init__(self, *, field_type: int, field_name: List[str],
+                 deduplicate: bool = True, use_base_metadata: bool = False):
+        """ field_type is str, int or float, field_name is a list of possible
+            paths when searched can retrieve the field value, deduplicate is unused,
+            use_base_metadata means to rely on top level object instead of sub object
+            (only used for extra files)
+        """
         self._field_type = field_type
         self._field_name = field_name
         self._deduplicate = deduplicate
+        self._use_base_metadata = use_base_metadata
 
-    def field_type(self):
+    def field_type(self) -> int:
+        """ Note this is an int enum """
         return self._field_type
 
-    def field_name(self):
+    def field_name(self) -> List[str]:
+        """ Field name in this case is a list of possible paths to search """
         return self._field_name
 
-    def deduplicate(self):
+    def deduplicate(self) -> bool:
         return self._deduplicate
 
+    def use_base_metadata(self) -> bool:
+        return self._use_base_metadata
+
 
 class DummyFileInterfaceImplementation(object):
+    """ This is used to simulate a file interface for streaming the TSV output """
     def __init__(self):
         self._line = None
     def write(self, line):
@@ -69,45 +86,106 @@ def read(self):
 
 # This dictionary is a key --> 3-tuple mapping that encodes options for the /metadata/ endpoint
 # given a field description. This also describes the order that fields show up in the TSV.
+# VERY IMPORTANT NOTE WHEN ADDING FIELDS - right now support for arrays generally is limited.
+# The limitations are: array of terminal values are fine, but arrays of dictionaries will only
+# traverse one additional level of depth ie:
+# item contains dictionary d1, where d1 has property that is array of object
+#   --> d1.arr --> d1.array.dict --> d1.array.dict.value
 # TODO: move to another file or write in JSON
 TSV_MAPPING = {
     FILE: {
-        'File Download URL': TSVDescriptor(field_type=FILE,
-                                           field_name=['href']),
-        'File Accession': TSVDescriptor(field_type=FILE,
-                                        field_name=['accession']),
-        'File Name': TSVDescriptor(field_type=FILE,
-                                   field_name=['annotated_filename', 'display_title', 'filename']),
-        'Size (MB)': TSVDescriptor(field_type=FILE,
-                                   field_name=['file_size']),
+        'FileDownloadURL': TSVDescriptor(field_type=FILE,
+                                         field_name=['href']),
+        'FileAccession': TSVDescriptor(field_type=FILE,
+                                       field_name=['accession']),
+        'FileName': TSVDescriptor(field_type=FILE,
+                                  field_name=['annotated_filename', 'display_title', 'filename']),
+        'Size(B)': TSVDescriptor(field_type=FILE,
+                                 field_name=['file_size']),
         'md5sum': TSVDescriptor(field_type=FILE,
                                 field_name=['md5sum']),
-        'File Type': TSVDescriptor(field_type=FILE,
-                                   field_name=['file_type']),
-        'File Format': TSVDescriptor(field_type=FILE,
-                                     field_name=['file_format.display_title']),
+        'DataType': TSVDescriptor(field_type=FILE,
+                                  field_name=['data_type'],
+                                  use_base_metadata=True),  # do not traverse extra_files for this
+        'FileFormat': TSVDescriptor(field_type=FILE,
+                                    field_name=['file_format.display_title']),
+        'SampleName': TSVDescriptor(field_type=FILE,
+                                    field_name=['sample_summary.sample_names'],
+                                    use_base_metadata=True),  # do not traverse extra_files for this
+        'SampleStudies': TSVDescriptor(field_type=FILE,
+                                       field_name=['sample_summary.studies'],
+                                       use_base_metadata=True),  # do not traverse extra_files for this
+        'SampleTissues': TSVDescriptor(field_type=FILE,
+                                       field_name=['sample_summary.tissues'],
+                                       use_base_metadata=True),  # do not traverse extra_files for this
+        'SampleDonors': TSVDescriptor(field_type=FILE,
+                                      field_name=['sample_summary.donor_ids'],
+                                      use_base_metadata=True),  # do not traverse extra_files for this
+        'SampleSource': TSVDescriptor(field_type=FILE,
+                                      field_name=['sample_summary.sample_descriptions'],
+                                      use_base_metadata=True),  # do not traverse extra_files for this
+        'Analytes': TSVDescriptor(field_type=FILE,
+                                  field_name=['sample_summary.analytes'],
+                                  use_base_metadata=True),
+        'Sequencer': TSVDescriptor(field_type=FILE,
+                                   field_name=['sequencing.sequencer.display_title'],
+                                   use_base_metadata=True),
+        'Assay': TSVDescriptor(field_type=FILE,
+                               field_name=['assays.display_title'],
+                               use_base_metadata=True),
+        'SoftwareName/Version': TSVDescriptor(field_type=FILE,
+                                              field_name=['analysis_summary.software'],
+                                              use_base_metadata=True),
+        'ReferenceGenome': TSVDescriptor(field_type=FILE,
+                                         field_name=['analysis_summary.reference_genome'],
+                                         use_base_metadata=True),
+        FILE_GROUP: TSVDescriptor(field_type=FILE,
+                                  field_name=['file_sets.file_group'],
+                                  use_base_metadata=False)   # omit this field on extra files
     }
 }
 
 
 def generate_file_download_header(download_file_name: str):
-    """ Helper function that generates a suitable header for the File download """
-    header1 = ['###', 'Metadata TSV Download', '', '', '', '', '']
+    """ Helper function that generates a suitable header for the File download, generating 18 columns"""
+    header1 = ['###', 'Metadata TSV Download', 'Column Count', '18'] + ([''] * 14)  # length 18
     header2 = ['Suggested command to download: ', '', '',
                "cut -f 1,3 ./{} | tail -n +4 | grep -v ^# | xargs -n 2 -L 1 sh -c 'curl -L "
-               "--user <access_key_id>:<access_key_secret> $0 --output $1'".format(download_file_name), '', '', '']
+               "--user <access_key_id>:<access_key_secret> $0 --output $1'".format(download_file_name)] + ([''] * 14)
     header3 = list(TSV_MAPPING[FILE].keys())
     return header1, header2, header3
 
 
+def extract_array(array: list, i: int, fields: list) -> str:
+    """ Extracts field_name values from array of dicts, or the value itself if a terminal field """
+    if isinstance(array[0], dict):
+        if isinstance(array[0][fields[i]], dict):  # go one level deeper
+            field1, field2 = fields[i], fields[i+1]
+            return '|'.join(sorted([ele[field1][field2] for ele in array]))
+        else:
+            return '|'.join(sorted(ele[fields[i]] for ele in array))
+    else:
+        return '|'.join(sorted(array))
+
+
 def descend_field(request, prop, field_names):
     """ Helper to grab field values if we reach a terminal field ie: not dict or list """
     for possible_field in field_names:
         current_prop = prop  # store a reference to the original object
         fields = possible_field.split('.')
-        for field in fields:
+        for i, field in enumerate(fields):
             current_prop = current_prop.get(field)
-        if current_prop is None or isinstance(current_prop, dict) or isinstance(current_prop, list):
+            if isinstance(current_prop, list) and possible_field != 'file_sets.file_group':
+                return extract_array(current_prop, i+1, fields)
+            elif current_prop and possible_field == 'file_sets.file_group':
+                return current_prop[0].get('file_group')
+            elif not current_prop:
+                break
+        # this hard code is necessary because in this select case we are processing an object field,
+        # and we want all other object fields to be ignored - Will 1 May 2024
+        if isinstance(current_prop, dict) and possible_field == 'file_sets.file_group':
+            return current_prop
+        elif current_prop is None or isinstance(current_prop, dict):
             continue
         elif possible_field == 'href':
             return f'{request.scheme}://{request.host}{current_prop}'
@@ -116,6 +194,17 @@ def descend_field(request, prop, field_names):
     return None
 
 
+def handle_file_group(field: dict) -> str:
+    """ Transforms the file_group into a single string """
+    if field:
+        sc_part = field['submission_center']
+        sample_source_part = field['sample_source']
+        sequencing_part = field['sequencing']
+        assay_part = field['assay']
+        return f'{sc_part}-{sample_source_part}-{sequencing_part}-{assay_part}'
+    return ''
+
+
 def generate_tsv(header: Tuple, data_lines: list):
     """ Helper function that actually generates the TSV """
     line = DummyFileInterfaceImplementation()
@@ -228,16 +317,33 @@ def metadata_tsv(context, request):
     data_lines = []
     for file in search_iter:
         line = []
-        for _, tsv_descriptor in args.tsv_mapping.items():
-            field = descend_field(request, file, tsv_descriptor.field_name()) or ''
+        for field_name, tsv_descriptor in args.tsv_mapping.items():
+            traversal_path = tsv_descriptor.field_name()
+            if field_name == FILE_GROUP:
+                field = descend_field(request, file, traversal_path) or ''
+                if field:  # requires special care
+                    field = handle_file_group(field)
+            else:
+                field = descend_field(request, file, traversal_path) or ''
             line.append(field)
         data_lines += [line]
+
+        # Repeat the above process for extra files
+        # This requires extra care - most fields we take from extra_files directly,
+        # but some must be taken from the parent metadata, such as anything related to library/assay/sample
+        # or the file merge group
         if args.include_extra_files and 'extra_files' in file:
             efs = file.get('extra_files')
             for ef in efs:
                 ef_line = []
-                for _, tsv_descriptor in args.tsv_mapping.items():
-                    field = descend_field(request, ef, tsv_descriptor.field_name()) or ''
+                for field_name, tsv_descriptor in args.tsv_mapping.items():
+                    traversal_path = tsv_descriptor.field_name()
+                    if tsv_descriptor.use_base_metadata():
+                        field = descend_field(request, file, traversal_path) or ''
+                        if field_name == FILE_GROUP:  # requires special care
+                            field = handle_file_group(field)
+                    else:
+                        field = descend_field(request, ef, traversal_path) or ''
                     ef_line.append(field)
                 data_lines += [ef_line]
 
 
@@ -0,0 +1,5 @@
+from snovault.project.access_key import SnovaultProjectAccessKey
+
+class SMAHTProjectAccessKey(SnovaultProjectAccessKey):
+    def access_key_has_expiration_date(self):
+        return True
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+<p>The Mergeable Files grouping can help guide whether certain files within file sets are candidates for merging. Submitted by data can always be searched on, but if additional facets for sample source, sequencing and assay are available, this indicates there are file sets that contain files that could potentially be merged with others. File sets that match values across all 4 fields: Submitted By, Sample Source Tag, Sequencing Tag and Assay Tag are candidates for merge. </p>`