General Pathogens Submissions Guide

Introduction
Getting Started
Register Metadata
- Register Study
- Register Samples
Submit Runs
Submit Assembled Sequences
- Prepare files
- Webin-CLI submission
Data Release and Citing

Introduction

This guide provides general information and help for submitting pathogen sequence data to the European Nucleotide Archive (ENA). The ENA is a partner of the INSDC (International Nucleotide Sequence Database Collaboration), and provides an entry point for INSDC data. All public pathogen data is made available by the ENA to explore and browse via the Pathogens Portal.

This is a walk-through guide for submitting pathogen-related raw read files and assembled 'clone or isolate' genomes. The guide frequently refers to our ENA Data Submission pages. If your pathgoen dataset is not raw reads or a genome, or you have any other queries about archiving your data at the ENA, you can also contact us at ena-path-collabs@ebi.ac.uk.

Tip

Are you submitting SARS-CoV-2 or Monkeypox virus data?

We have tailored support for SARS-CoV-2 and Monkeypox virus data submissions here:

ENA SARS-CoV-2 submissions guide
Monkeypox virus ENA submissions Guidance

For small-scale SARS-CoV-2 viral data submissions, with no prior knowledge of ENA submission routes, we have developed a drag and drop submissions tool. Please complete the form if you would like to submit your data using this route.

Getting Started

Register a submission account

Before you can submit data to the ENA you must register a Webin submission account.

Please navigate to the Webin Portal and click the ‘Register’ button and complete the registration form.

The ENA Metadata Model

Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.

ENA Submission routes

ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:

Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.
Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.
Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.

The table below outlines what can be submitted through each submission route.

	Interactive	Webin-CLI	Programmatic
Study	Y	N	Y
Sample	Y	N	Y
Read data	Y	Y	Y
Genome Assembly	N	Y	N
Transcriptome Assembly	N	Y	N
Template Sequence	N	Y	N
Other Analyses	N	N	Y

Register Metadata

Register Study

Data submissions to the ENA require that you register a study to contextualise and group your data. Details of how to do this can be found in our Study Registration Guide. Please ensure you describe your study adequately, as well as provide an informative title.

Your studies can now be claimed using your ORCID ID and/or assigned a DOI. Please see here and here for more information on these options.

Register Samples

Having registered a study, please proceed to register your samples. These are metadata objects that describe the source biological material of your experiments. Following this, the sequence data can be registered (as described in later sections).

Instructions for sample registration can be found in our Sample Registration Guide. As part of this process, you must select a sample checklist to describe metadata. If you require any support regarding sample metadata, please contact ena-path-collabs@ebi.ac.uk.

for interactive submission, download the sample checklist template from the Webin Portal and once completed, submit the checklist in .tsv format on the Webin Portal to register your Samples. See programmatic sample submission if you are submitting samples programmatically.

Sample checklists

The following Sample checklists contain mandatory, recommended and optional metadata fields (<SAMPLE_ATTRIBUTE>), with a description for each field, to help with sample metadata completion. The checklists were agreed by the Genomic Standards Consortium (GSC). In addition to the core checklist for each life domain, the GSC also provides checklist extensions which may have the metadata field you are looking for.

You can use the Sample checklists portal to browse all ENA checklists. The pathogen specific checklists are provided below.

link	Checklist name
ERC000028	ENA prokaryotic pathogen minimal sample checklist
ERC000029	ENA Global Microbial Identifier reporting standard checklist GMI_MDM:1.1
ERC000032	ENA Influenza virus reporting standard checklist
ERC000033	ENA virus pathogen reporting standard checklist
ERC000039	ENA parasite sample checklist
ERC000041	ENA Global Microbial Identifier Proficiency Test (GMI PT) checklist

Sample taxonomy

Our Tips for Sample Taxonomy page provides a helpful guide for choosing the right taxonomy for your pathogen submission.

You can search for suitable taxon IDs and find more information about a taxon ID using the taxonomy API endpoints:

https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/
https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/any-name/
https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/

The strain of a pathogen may be specified using the taxonomy, it may also be specified using the strain field in the checklists. If you specify the strain with both, this will make your strain easier to find.

The ENA taxonomy API interface may also be used.

Sample Host

Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields. If you have any questions or concerns about pathogen sample metadata, please contact the helpdesk.

Pathogen checklists host fields:

host taxid:	NCBI taxon id of the host, e.g. 9606
host health state:	health status of the host at the time of sample collection
host scientific name:	Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained.
lab_host:	scientific name of the laboratory host used to propagate the source organism from which the sample was obtained. The EBI cell line ontology may be used to find the name for the host cell line

Submit Runs

After registering your study and samples, you can submit your read files along with experimental (library-related) metadata. See our Read Submission Guide for detailed instructions on submitting reads.

We encourage submissions to include information on specific protocols used for the experiment. This should be provided in the library description. This can be, for example, the name and/or URL to a specific protocol. View our listing of the available full experimental metadata dictionaries.

Note

Submitted reads to ENA should not contain human identifiable reads. Please filter out human reads prior to submission - if required, here is a tool which can be used.

Submit Assembled Sequences

The instructions below provide a quick guide to submitting a completed isolate pathogen genome assembly. This type of submission is classed as 'clone or isolate' ASSEMBLY_TYPE for the ENA submissions services. For submission of other types of nucleotide assembly data, please see the submission options here. For submission of targeted sequences, please refer to the targeted sequence submissions guide.

For genome assembly submission, Webin-CLI (command line interface) needs to be used. The guide for downloading and using Webin-CLI is here.

A note on assembly levels

This guide includes chromosome list file examples which are used for a chromosome level assembly. Note that ‘chromosome’ should here be understood as a general term for a range of complete replicons, including chromosomes of eukaryotes, prokaryotes, and viruses, as well as organellar chromosomes and plasmids. All of these may be submitted within the same chromosome-level assembly.

If your assembly is not completed, you can submit a contig or scaffold level assembly. Please refer to the explainer about assembly levels here.

Prepare files

Assembly file

The accepted format for unannotated genome assembly is fasta. For annotated genome assemblies, the accepted format is embl flat file. Please refer to Accepted Genome Assembly Data Formats for information about preparing these files.

Manifest file

The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the assembly, including the Study and Sample it is linked to. Please refer to the Clone or isolate genome manifest file guide for permitted values.

For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:

STUDY   TODO
SAMPLE   TODO
ASSEMBLYNAME   TODO
ASSEMBLY_TYPE clone or isolate
COVERAGE   TODO
PROGRAM   TODO
PLATFORM   TODO
MINGAPLENGTH   optional
MOLECULETYPE   genomic DNA
DESCRIPTION optional
RUN_REF optional
FASTA   genome.fasta.gz

Chromosome list file

The chromosome list file must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to Accepted Genome Assembly Data Formats for permitted values.

.. tabs::

   .. tab:: Viruses

      By default the chromosome **TOPOLOGY** will be assumed to be linear, therefore if the topology is circular, it must be specified.

      .. code:: none

         chr01   1 Monopartite

      .. code:: none

         chr01   1 circular-Monopartite viroid

      .. code:: none

         chr01   1 Multipartite
         chr02   2 Multipartite

   .. tab:: Bacteria

      By default prokaryotic chromosomes and plasmids will be assumed to reside in the in the cytoplasm, however, the 'plasmid'
      **CHROMOSOME_LOCATION** may be specified.
      By default the **TOPOLOGY** will be assumed to be linear, so in this example the circular topology was specified.

      .. code:: none

         chr01   1 circular-Chromosome
         chr02   2 circular-Chromosome plasmid
         chr03   3 circular-Chromosome plasmid

   .. tab:: Eukaryota

      By default eukaryotic chromosomes will be assumed to reside in the nucleus. By default the chromosome **TOPOLOGY**
      will be assumed to be linear, but it may also be specified.

      .. code:: none

         chr01   1 Linear-Chromosome
         chr02   2 Linear-Chromosome
         chr03   3 Linear-Chromosome
         chr04   4 Linear-Chromosome
         chrMi   MIT Linear-Chromosome Mitochondrion

If there are sequences that are associated with a specific chromosome, but order and orientation is unknowm, you can also add an unlocalised list file to the submission. Alternatively, an AGP file may also be submitted to define unplaced sequences.

Webin-CLI submission

When you have prepared your files, including the assembly, the manifest file and any additional files for higher assemblies, You can validate and test your submission using the Webin-CLI -validate flag. When you are ready to submit the assembly, you can use the -submit flag.

Webin-CLI validate command:

java -jar webin-cli-6.4.0.jar -userName Webin-XXXX -password XXXX -context genome -manifest manifest.txt -validate

Data Release and Citing

Once the data is submitted, it will take some time to be processed and archived. If your data is set to public, it will be made public and accessible from the Pathogens Portal.

For information about data release, please find more information at the following pages:

Data Release Policies
Accession numbers
Citing and Orcid data claiming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pathogen-subs-guide.rst

pathogen-subs-guide.rst

General Pathogens Submissions Guide

Introduction

Getting Started

Register a submission account

The ENA Metadata Model

ENA Submission routes

Register Metadata

Register Study

Register Samples

Sample checklists

Sample taxonomy

Sample Host

Submit Runs

Submit Assembled Sequences

Prepare files

Assembly file

Manifest file

Chromosome list file

Webin-CLI submission

Data Release and Citing

Files

pathogen-subs-guide.rst

Latest commit

History

pathogen-subs-guide.rst

File metadata and controls

General Pathogens Submissions Guide