Skip to content

Add cohorts submissions guide #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions faq/cohort-subs-guide.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
How to Submit Multi-Omic Cohort Datasets
========================================

This guide includes information about how to submit a multi-omic dataset to be displayed as an entry in the `Pathogens
Portal Cohort browser <https://www.pathogensportal.org/cohorts?activeTab=Browser>`_.
If you have a multi-omic dataset you wish to archive, but it is not linked to Pathogens, similar principles will apply,
but the dataset will not be displayed in the Pathogens Portal.

Introduction
````````````
Infectious disease plays out as an intricate set of molecular interactions between the systems of both pathogen and infected host.
In cases of vector-borne disease, such as malaria, or diseases with intermediate hosts, such as tapeworm, interactions with further
species are involved. Studying these interconnected biologies, such as to understand infection mechanisms and patient response,
develop clinical and public health interventions and predict outcomes of the circulation of new pathogen variants, requires the use
of combined data sets which span the two or more organisms involved in the infection.

Regardless of which technical platform is used for their generation, biological data can be organised around the concept of sample.
A biological sample, such as a blood sample from a patient, can be represented as a digital record with an identifier. When the
sample is subjected to different assays, such as genomic sequencing or serology analysis, each of the resultant data sets can
reference the identifier of the sample from which they were derived. In many workflows, samples are divided, such as when a
wastewater sample is size-filtered to yield a bacterial subsample and a viral subsample. Records for each of these new samples
can be created and given their own identifiers, and reference can be made back to the sample from which they were derived by using
its top-level sample identifier.

For example, in this diagram example, the top-level sample (#1) is linked to various child samples which hold information
for data in multiple databases:

.. image:: images/linked_samples.png
:width: 600
:alt: diagram showing BioSample relationships and data types
:align: center

Steps
`````
The steps below provide an overview of creating a multi-omic dataset. Before starting a submission, we strongly advise
you to contact us at [email protected] if you are planning to submit a linked cohort dataset, including some
details about your study, and we can give guidance on your sample structure, and how to complete the data submissions.

1. Create the top-level BioSample
'''''''''''''''''''''''''''''''''

The first step is to create top-level Samples using the `BioSamples Archive <https://www.ebi.ac.uk/biosamples/>`_.
These Samples will represent each case or patient in the study. This is represented by Sample #1 in the diagram.
If this is a human sample, this can contain minimal, non-identifying metadata about the patient (e.g. gender,
organism, disease). See an example `here <https://www.ebi.ac.uk/biosamples/samples/SAMEA12928716>`_.

Top-level Sample records can be created in BioSamples using the `BioSamples uploader tool <https://www.ebi.ac.uk/biosamples/docs/cookbook/upload_files>`_.

.. note ::
The ENA and the `EGA (European Genome Phenome Archive) <https://ega-archive.org/>`_ are the only archives which integrate
BioSample records into their :doc:`database structure <submit/general-guide/metadata>`. For data deposited at other
archives, additional BioSample records may need to be created (in BioSamples) to represent those data.

2. Submit Pathogen Sequence data to the ENA
'''''''''''''''''''''''''''''''''''''''''''

The next step is to submit your nucleotide records (raw reads or assembly data) to the ENA.
The :doc:`Pathogen Submissions Guide <faq/pathogen-subs-guide>` provides a quick introduction to the ENA and tips for
Pathogen data submissions.
Otherwise, please refer to the :doc:`ENA General Submissions Guide <../submit/general-guide>`.


3. Submit other data types to appropriate database resources
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

The next step is to create your datasets in the correct database for the data type. The `EBI submissions wizard
<https://www.ebi.ac.uk/submission/>`_ can help direct you to a resource to deposit your data.
We can reccommend the following database resources for common data types:

- For sensitive human nucleotide records and human clinical epidemiological data which requires controlled access, please
contact the `EGA (European Genome Phenome Archive) <https://ega-archive.org/>`_ to start a submission.
- For expression data, or uncategorsied datasets, please use `ArrayExpress/BioStudies <https://www.ebi.ac.uk/biostudies/arrayexpress>`_

4. Create the child BioSamples for linking
''''''''''''''''''''''''''''''''''''''''''

After the datasets have been submitted in the appropriate databases, the required child Samples for linking can be created.
The child samples will represent their relationship to the top-level Sample. Different samples can be used for different
data types **and** for different time points. Please contact us if you have any doubts about setting up your sample structure.


5. Link together the samples using BioSamples
''''''''''''''''''''''''''''''''''''''''''''''

Link your samples created from other EBI resources to the top-level sample using a
`BioSamples derived from curation <https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submit_curation_object> `_.

Link your samples created from other EBI resources to the top-level sample using the ‘derived from’ curation on
BioSamples. The derived from relationship is used as follows, where the Source is the child Sample, and the Target is
the top-level Sample:

**Source sample** - *derived from* - **Target sample**

**Child sample accession** - *derived from* - **Parent sample accession**

For example, in the first linked dataset, the `Erasmus Medical Cemter (EMC) study <https://www.infectious-diseases-toolkit.org/showcase/linked-cohort-data>`_,
the BioSamples relationship is as follows:

**[T/B-Cell/Antibody profile/ENA viral sample accession]** - *derived from* - **[Top level patient sample accession]**

A JSON file curation object (see example below) containing the relationship attribute with the source and target sample
can be created and submitted via curl to the `BioSamples API <https://www.ebi.ac.uk/biosamples/docs/references/api/submit#_submit_curation_object>`_)

JSON curation:

.. code-block:: JSON

{
"curation" : {
"relationshipsPre" : [ ],
"relationshipsPost" : [ {
"source" : "SAMFAKE123456",
"type" : "DERIVED_FROM",
"target" : "SAMFAKE7654321"
} ],
"hash" : "09a5a9cddbea9f5bb6302b86b922c408abc92b8b10c78f0662ac7e41fd44e91f"
},
"domain" : null,
"webinSubmissionAccountId" : "WEBIN-12345",
"created" : "2023-07-17T12:19:33.056356Z",
"hash" : "d1f611ec2c8caf3d9f58fa40227ea60ebb5fc00eda06338fb81db7d987a6fa63"
}

..

Please contact [email protected] for technical support with any questions related to sample
linking using BioSamples.

6. Submit the cohort metadata
'''''''''''''''''''''''''''''

While the BioSamples database is key to capturing the linking of data types on participant level, the
`Cohort Browser <https://www.pathogensportal.org/cohorts>`_ presents a range of study-level information about each cohort.
This metadata is an integral part of the Pathogens Portal, enhancing the findability of a cohort dataset, and this serves
as the primary entry point into the dataset. The included data types in the dataset will be represented by the
'Type of data' column within the cohort browser.

For your cohort to display within a cohort browser, please contact us to check which metadata will be needed for your dataset.
As a guide, the following information will be needed to describe the cohort:

- Cohort acronym/link to webpage
- Cohort title
- Cohort/study description
- Institution
- Number of participants
- Territory/country
- Enrollment period

Please find the form `here <https://docs.google.com/spreadsheets/d/1LuyPhv1J5t2FU7JE2XjW9n__PjGTxeBoA38PXpN8sG8/edit#gid=0>`_
for a more complete version of the suggested metadata. Please get in touch with us using [email protected] if you
would like to add your cohort metadata to the Pathogens Portal Cohort Browser.
Binary file added faq/images/linked_samples.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 23 additions & 25 deletions faq/pathogen-subs-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ General Pathogens Submissions Guide
==================================

.. image:: images/pathogens_logo_1.png
:width: 400
:align: center
:width: 400
:alt: Pathogens Portal logo
:align: center



Expand All @@ -15,19 +16,22 @@ Introduction
~~~~~~~~~~~~


This guide provides general information and help for submitting pathogen sequence data to the `European Nucleotide Archive (ENA) <https://www.ebi.ac.uk/ena/browser/home>`_
. All public `INSDC <https://www.insdc.org/>`_ pathogen data will be made available to browse using the `Pathogens Portal <https://www.ebi.ac.uk/ena/pathogens/v2/>`_.
This guide provides general information and help for submitting pathogen sequence data to the
`European Nucleotide Archive (ENA) <https://www.ebi.ac.uk/ena/browser/home>`_. The ENA is a partner of the `INSDC
<https://www.insdc.org/>`_ (International Nucleotide Sequence Database Collaboration), and provides an entry point for
INSDC data. All public pathogen data is made available by the ENA to explore and browse via
the `Pathogens Portal <https://www.pathogensportal.org/>`_.

Please see below for a specific guide for submitting pathogen related data. The guide frequently refers to the
`ENA Training Modules <https://ena-docs.readthedocs.io/en/latest/index.html>`_,
our general ENA submissions guide. If you have any queries or require assistance with your submission please contact
us at [email protected].
This is a walk-through guide for submitting pathogen-related raw read files and assembled 'clone or isolate' genomes.
The guide frequently refers to our `ENA Data Submission <https://ena-docs.readthedocs.io/en/latest/index.html>`_ pages.
If your pathgoen dataset is not raw reads or a genome, or you have any other queries about archiving your data at the
ENA, you can also contact us at [email protected].

.. tip::

**Looking for something else?**
**Are you submitting SARS-CoV-2 or Monkeypox virus data?**

For pathogen-specific submissions guidance, please refer to these guides:
We have tailored support for SARS-CoV-2 and Monkeypox virus data submissions here:

- `ENA SARS-CoV-2 submissions guide <https://ena-covid19-docs.readthedocs.io/en/latest/index.html>`_
- `Monkeypox virus ENA submissions Guidance <https://docs.google.com/viewer?url=https://github.com/enasequence/ena-content-dataflow/raw/master/docs/Monkeypox%20virus%20ENA%20Submission%20Guidance.pdf>`_
Expand All @@ -53,15 +57,6 @@ Before submitting data to ENA, it is important to familiarise yourself with the
and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.


.. raw:: html


<embed>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">1/8<br><br>The ENA would like to introduce you to our very first TWEETORIAL! For this <a href="https://twitter.com/hashtag/tweetorial?src=hash&amp;ref_src=twsrc%5Etfw">#tweetorial</a>, we will be explaining the ENA Metadata Model. When submitting data to the ENA, you need to register additional metadata so your submission is in accordance with FAIR data principles. <a href="https://t.co/m45ENIrlIM">pic.twitter.com/m45ENIrlIM</a></p>&mdash; European Nucleotide Archive (ENA) (@ENASequence) <a href="https://twitter.com/ENASequence/status/1514229572425994245?ref_src=twsrc%5Etfw">April 13, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</embed>



ENA Submission routes
`````````````````````
ENA allows submissions via three routes, each of which is appropriate for a
Expand Down Expand Up @@ -173,7 +168,7 @@ in the checklists. If you specify the strain with both, this will make your stra
The `ENA taxonomy API <https://www.ebi.ac.uk/ena/taxonomy/rest/>`_ interface may also be used.


Sample host
Sample Host
'''''''''''

Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields.
Expand Down Expand Up @@ -235,17 +230,18 @@ Prepare files
Assembly file
'''''''''''''

The accepted format for unannotated genome assembly is **fasta** OR for annotated genome assembly, the accepted format is **embl flat file**
Please refer to the `Accepted genome assembly data formats guide <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#accepted-genome-assembly-data-formats>`_
The accepted format for unannotated genome assembly is **fasta**. For annotated genome assemblies, the accepted format
is **embl flat file**. Please refer to `Accepted Genome Assembly Data Formats
<https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#accepted-genome-assembly-data-formats>`_
for information about preparing these files.


Manifest file
'''''''''''''

The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the
assembly, including the study and sample it is linked to.
Please refer to the `assembly manifest file guide <https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files>`_
assembly, including the Study and Sample it is linked to.
Please refer to the `Clone or isolate genome manifest file guide <https://ena-docs.readthedocs.io/en/latest/submit/assembly/genome.html#manifest-files>`_
for permitted values.

For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:
Expand All @@ -270,7 +266,9 @@ For example, the following manifest file represents a genome assembly consisting
Chromosome list file
''''''''''''''''''''

The **chromosome list file** must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to the `chromosome list file guide <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#chromosome-list-file>`_
The **chromosome list file** must be provided when the submission contains assembled chromosomes. This is a tab
separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to
`Accepted Genome Assembly Data Formats <https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#chromosome-list-file>`_
for permitted values.

.. tabs::
Expand Down
2 changes: 1 addition & 1 deletion submit/annotation/clearinghouse_for_ENA_users.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,5 +127,5 @@ It is important to differentiate between the curations submitted via the ELIXIR


## Appendix:
### 1. [A template bash script for submission](clearinghouse_submission_template.sh)
### 1. {doc}`A template bash script for submission </submit/annotation/clearinghouse_submission_template>`