enasequence · jas-mckin · Jul 3, 2024 · Jul 3, 2024 · Sep 4, 2024 · Sep 26, 2024
diff --git a/faq/cohort-subs-guide.rst b/faq/cohort-subs-guide.rst
@@ -0,0 +1,151 @@
+How to Submit Multi-Omic Cohort Datasets
+========================================
+
+This guide includes information about how to submit a multi-omic dataset to be displayed as an entry in the `Pathogens
+Portal Cohort browser <https://www.pathogensportal.org/cohorts?activeTab=Browser>`_.
+If you have a multi-omic dataset you wish to archive, but it is not linked to Pathogens, similar principles will apply,
+but the dataset will not be displayed in the Pathogens Portal.
+
+Introduction
+````````````
+Infectious disease plays out as an intricate set of molecular interactions between the systems of both pathogen and infected host.
+In cases of vector-borne disease, such as malaria, or diseases with intermediate hosts, such as tapeworm, interactions with further
+species are involved. Studying these interconnected biologies, such as to understand infection mechanisms and patient response,
+develop clinical and public health interventions and predict outcomes of the circulation of new pathogen variants, requires the use
+of combined data sets which span the two or more organisms involved in the infection.
+
+Regardless of which technical platform is used for their generation, biological data can be organised around the concept of sample.
+A biological sample, such as a blood sample from a patient, can be represented as a digital record with an identifier. When the
+sample is subjected to different assays, such as genomic sequencing or serology analysis, each of the resultant data sets can
+reference the identifier of the sample from which they were derived. In many workflows, samples are divided, such as when a
+wastewater sample is size-filtered to yield a bacterial subsample and a viral subsample. Records for each of these new samples
+can be created and given their own identifiers, and reference can be made back to the sample from which they were derived by using
+its top-level sample identifier.
+
+For example, in this diagram example, the top-level sample (#1) is linked to various child samples which hold information
+for data in multiple databases:
+
+.. image:: images/linked_samples.png
+   :width: 600
+   :alt: diagram showing BioSample relationships and data types
+   :align: center
+
+Steps
+`````
+The steps below provide an overview of creating a multi-omic dataset. Before starting a submission, we strongly advise
+you to contact us at [email protected] if you are planning to submit a linked cohort dataset, including some
+details about your study, and we can give guidance on your sample structure, and how to complete the data submissions.
+
+1. Create the top-level BioSample
+'''''''''''''''''''''''''''''''''
+
+The first step is to create top-level Samples using the `BioSamples Archive <https://www.ebi.ac.uk/biosamples/>`_.
+These Samples will represent each case or patient in the study. This is represented by Sample #1 in the diagram.
+If this is a human sample, this can contain minimal, non-identifying metadata about the patient (e.g. gender,
+organism, disease). See an example `here <https://www.ebi.ac.uk/biosamples/samples/SAMEA12928716>`_.
+
+Top-level Sample records can be created in BioSamples using the `BioSamples uploader tool <https://www.ebi.ac.uk/biosamples/docs/cookbook/upload_files>`_.
+
+.. note ::
+    The ENA and the `EGA (European Genome Phenome Archive) <https://ega-archive.org/>`_ are the only archives which integrate
+    BioSample records into their :doc:`database structure <submit/general-guide/metadata>`. For data deposited at other
+    archives, additional BioSample records may need to be created (in BioSamples) to represent those data.
+
+2. Submit Pathogen Sequence data to the ENA
+'''''''''''''''''''''''''''''''''''''''''''
+
+The next step is to submit your nucleotide records (raw reads or assembly data) to the ENA.
+The :doc:`Pathogen Submissions Guide <faq/pathogen-subs-guide>` provides a quick introduction to the ENA and tips for
+Pathogen data submissions.
+Otherwise, please refer to the :doc:`ENA General Submissions Guide <../submit/general-guide>`.
+
+
+3. Submit other data types to appropriate database resources
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The next step is to create your datasets in the correct database for the data type. The `EBI submissions wizard
+<https://www.ebi.ac.uk/submission/>`_ can help direct you to a resource to deposit your data.
+We can reccommend the following database resources for common data types:
+
+- For sensitive human nucleotide records and human clinical epidemiological data which requires controlled access, please
+  contact the `EGA (European Genome Phenome Archive) <https://ega-archive.org/>`_ to start a submission.
+- For expression data, or uncategorsied datasets, please use `ArrayExpress/BioStudies <https://www.ebi.ac.uk/biostudies/arrayexpress>`_
+
+4. Create the child BioSamples for linking
+''''''''''''''''''''''''''''''''''''''''''
+
+After the datasets have been submitted in the appropriate databases, the required child Samples for linking can be created.
+The child samples will represent their relationship to the top-level Sample. Different samples can be used for different
+data types **and** for different time points. Please contact us if you have any doubts about setting up your sample structure.
+
+
+5. Link together the samples using BioSamples
+''''''''''''''''''''''''''''''''''''''''''''''
+
+Link your samples created from other EBI resources to the top-level sample using a
+`BioSamples derived from curation <https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submit_curation_object> `_.
+
+Link your samples created from other EBI resources to the top-level sample using the ‘derived from’ curation on
+BioSamples. The derived from relationship is used as follows, where the Source is the child Sample, and the Target is
+the top-level Sample:
+
+**Source sample** - *derived from* - **Target sample**
+
+**Child sample accession** - *derived from* - **Parent sample accession**
+
+For example, in the first linked dataset, the `Erasmus Medical Cemter (EMC) study <https://www.infectious-diseases-toolkit.org/showcase/linked-cohort-data>`_,
+the BioSamples relationship is as follows:
+
+**[T/B-Cell/Antibody profile/ENA viral sample accession]** - *derived from* - **[Top level patient sample accession]**
+
+A JSON file curation object (see example below) containing the relationship attribute with the source and target sample
+can be created and submitted via curl to the `BioSamples API <https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submit_curation_object>`_)
+
+JSON curation:
+
+.. code-block:: JSON
+
+   {
+     "curation" : {
+      "relationshipsPre" : [ ],
+       "relationshipsPost" : [ {
+         "source" : "SAMFAKE123456",
+         "type" : "DERIVED_FROM",
+         "target" : "SAMFAKE7654321"
+       } ],
+       "hash" : "09a5a9cddbea9f5bb6302b86b922c408abc92b8b10c78f0662ac7e41fd44e91f"
+     },
+    "domain" : null,
+    "webinSubmissionAccountId" : "WEBIN-12345",
+     "created" : "2023-07-17T12:19:33.056356Z",
+     "hash" : "d1f611ec2c8caf3d9f58fa40227ea60ebb5fc00eda06338fb81db7d987a6fa63"
+   }
+
+..
+
+Please contact [email protected] for technical support with any questions related to sample
+linking using BioSamples.
+
+6. Submit the cohort metadata
+'''''''''''''''''''''''''''''
+
+While the BioSamples database is key to capturing the linking of data types on participant level, the
+`Cohort Browser <https://www.pathogensportal.org/cohorts>`_ presents a range of study-level information about each cohort.
+This metadata is an integral part of the Pathogens Portal, enhancing the findability of a cohort dataset, and this serves
+as the primary entry point into the dataset. The included data types in the dataset will be represented by the
+'Type of data' column within the cohort browser.
+
+For your cohort to display within a cohort browser, please contact us to check which metadata will be needed for your dataset.
+As a guide, the following information will be needed to describe the cohort:
+
+- Cohort acronym/link to webpage
+- Cohort title
+- Cohort/study description
+- Institution
+- Number of participants
+- Territory/country
+- Enrollment period
+
+Please find the form `here <https://docs.google.com/spreadsheets/d/1LuyPhv1J5t2FU7JE2XjW9n__PjGTxeBoA38PXpN8sG8/edit#gid=0>`_
+for a more complete version of the suggested metadata. Please get in touch with us using [email protected] if you
+would like to add your cohort metadata to the Pathogens Portal Cohort Browser.
diff --git a/faq/images/linked_samples.png b/faq/images/linked_samples.png
diff --git a/faq/pathogen-subs-guide.rst b/faq/pathogen-subs-guide.rst
@@ -2,8 +2,9 @@ General Pathogens Submissions Guide
 ==================================
 
 .. image:: images/pathogens_logo_1.png
- :width: 400
- :align: center
+  :width: 400
+  :alt: Pathogens Portal logo
+  :align: center
 
 
 
@@ -15,19 +16,22 @@ Introduction
 ~~~~~~~~~~~~
 
 
-This guide provides general information and help for submitting pathogen sequence data to the `European Nucleotide Archive (ENA) <https://www.ebi.ac.uk/ena/browser/home>`_
-. All public `INSDC <https://www.insdc.org/>`_ pathogen data will be made available to browse using the `Pathogens Portal <https://www.ebi.ac.uk/ena/pathogens/v2/>`_.
+This guide provides general information and help for submitting pathogen sequence data to the
+`European Nucleotide Archive (ENA) <https://www.ebi.ac.uk/ena/browser/home>`_. The ENA is a partner of the `INSDC
+<https://www.insdc.org/>`_ (International Nucleotide Sequence Database Collaboration), and provides an entry point for
+INSDC data. All public pathogen data is made available by the ENA to explore and browse via
+the `Pathogens Portal <https://www.pathogensportal.org/>`_.
 
-Please see below for a specific guide for submitting pathogen related data. The guide frequently refers to the
-`ENA Training Modules <https://ena-docs.readthedocs.io/en/latest/index.html>`_,
-our general ENA submissions guide. If you have any queries or require assistance with your submission please contact
-us at [email protected].
+This is a walk-through guide for submitting pathogen-related raw read files and assembled 'clone or isolate' genomes.
+The guide frequently refers to our `ENA Data Submission <https://ena-docs.readthedocs.io/en/latest/index.html>`_ pages.
+If your pathgoen dataset is not raw reads or a genome, or you have any other queries about archiving your data at the
+ENA, you can also contact us at [email protected].
 
 .. tip::
 
-  **Looking for something else?**
+  **Are you submitting SARS-CoV-2 or Monkeypox virus data?**
 
-  For pathogen-specific submissions guidance, please refer to these guides:
+  We have tailored support for SARS-CoV-2 and Monkeypox virus data submissions here:
 
   - `ENA SARS-CoV-2 submissions guide <https://ena-covid19-docs.readthedocs.io/en/latest/index.html>`_
   - `Monkeypox virus ENA submissions Guidance <https://docs.google.com/viewer?url=https://github.com/enasequence/ena-content-dataflow/raw/master/docs/Monkeypox%20virus%20ENA%20Submission%20Guidance.pdf>`_
@@ -53,15 +57,6 @@ Before submitting data to ENA, it is important to familiarise yourself with the
 and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.
 
 
-.. raw:: html
-
-
-    <embed>
-        <blockquote class="twitter-tweet"><p lang="en" dir="ltr">1/8<br><br>The ENA would like to introduce you to our very first TWEETORIAL! For this <a href="https://twitter.com/hashtag/tweetorial?src=hash&amp;ref_src=twsrc%5Etfw">#tweetorial</a>, we will be explaining the ENA Metadata Model. When submitting data to the ENA, you need to register additional metadata so your submission is in accordance with FAIR data principles. <a href="https://t.co/m45ENIrlIM">pic.twitter.com/m45ENIrlIM</a></p>&mdash; European Nucleotide Archive (ENA) (@ENASequence) <a href="https://twitter.com/ENASequence/status/1514229572425994245?ref_src=twsrc%5Etfw">April 13, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
-    </embed>
-
-
-
 ENA Submission routes
 `````````````````````
 ENA allows submissions via three routes, each of which is appropriate for a
@@ -173,7 +168,7 @@ in the checklists. If you specify the strain with both, this will make your stra
 The `ENA taxonomy API <https://www.ebi.ac.uk/ena/taxonomy/rest/>`_ interface may also be used.
 
 
-Sample host
+Sample Host
 '''''''''''
 
 Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields.
@@ -235,17 +230,18 @@ Prepare files
 Assembly file
 '''''''''''''
 
-The accepted format for unannotated genome assembly is **fasta** OR for annotated genome assembly, the accepted format is **embl flat file**
-Please refer to the `Accepted genome assembly data formats guide <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#accepted-genome-assembly-data-formats>`_
+The accepted format for unannotated genome assembly is **fasta**. For annotated genome assemblies, the accepted format
+is **embl flat file**. Please refer to `Accepted Genome Assembly Data Formats
+<https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#accepted-genome-assembly-data-formats>`_
 for information about preparing these files.
 
 
 Manifest file
 '''''''''''''
 
 The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the
-assembly, including the study and sample it is linked to.
-Please refer to the `assembly manifest file guide <https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files>`_
+assembly, including the Study and Sample it is linked to.
+Please refer to the `Clone or isolate genome manifest file guide <https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files>`_
 for permitted values.
 
 For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:
@@ -270,7 +266,9 @@ For example, the following manifest file represents a genome assembly consisting
 Chromosome list file
 ''''''''''''''''''''
 
-The **chromosome list file** must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to the `chromosome list file guide <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#chromosome-list-file>`_
+The **chromosome list file** must be provided when the submission contains assembled chromosomes. This is a tab
+separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to
+`Accepted Genome Assembly Data Formats <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#chromosome-list-file>`_
 for permitted values.
 
 .. tabs::

diff --git a/submit/annotation/clearinghouse_for_ENA_users.md b/submit/annotation/clearinghouse_for_ENA_users.md
@@ -127,5 +127,5 @@ It is important to differentiate between the curations submitted via the ELIXIR
 
 
 ## Appendix:
-### 1. [A template bash script for submission](clearinghouse_submission_template.sh)
+### 1. {doc}`A template bash script for submission </submit/annotation/clearinghouse_submission_template>`
Original file line number	Diff line number	Diff line change
Expand Up		@@ -127,5 +127,5 @@ It is important to differentiate between the curations submitted via the ELIXIR


		## Appendix:
		### 1. [A template bash script for submission](clearinghouse_submission_template.sh)
		### 1. {doc}`A template bash script for submission </submit/annotation/clearinghouse_submission_template>`