This guide provides general information and help for submitting pathogen sequence data to the European Nucleotide Archive (ENA). The ENA is a partner of the INSDC (International Nucleotide Sequence Database Collaboration), and provides an entry point for INSDC data. All public pathogen data is made available by the ENA to explore and browse via the Pathogens Portal.
This is a walk-through guide for submitting pathogen-related raw read files and assembled 'clone or isolate' genomes. The guide frequently refers to our ENA Data Submission pages. If your pathgoen dataset is not raw reads or a genome, or you have any other queries about archiving your data at the ENA, you can also contact us at [email protected].
Tip
Are you submitting SARS-CoV-2 or Monkeypox virus data?
We have tailored support for SARS-CoV-2 and Monkeypox virus data submissions here:
For small-scale SARS-CoV-2 viral data submissions, with no prior knowledge of ENA submission routes, we have developed a drag and drop submissions tool. Please complete the form if you would like to submit your data using this route.
Before you can submit data to the ENA you must register a Webin submission account.
Please navigate to the Webin Portal and click the ‘Register’ button and complete the registration form.
Before submitting data to ENA, it is important to familiarise yourself with the ENA metadata model and what parts of your research project can be represented by which metadata objects. This will determine what you need to submit.
ENA allows submissions via three routes, each of which is appropriate for a different set of submission types. You may be required to use more than one in the process of submitting your data:
- Interactive Submissions are completed by filling out web forms directly in your browser and downloading template spreadsheets that can be completed off-line and uploaded to ENA. This is often the most accessible submission route.
- Command Line Submissions use our bespoke Webin-CLI program. This validates your submissions entirely before you complete them, allowing you maximum control of the process.
- Programmatic Submissions are completed by preparing your submissions as XML documents and either sending them to ENA using a program such as cURL or using the Webin Portal.
The table below outlines what can be submitted through each submission route.
Interactive | Webin-CLI | Programmatic | |
Study | Y | N | Y |
Sample | Y | N | Y |
Read data | Y | Y | Y |
Genome Assembly | N | Y | N |
Transcriptome Assembly | N | Y | N |
Template Sequence | N | Y | N |
Other Analyses | N | N | Y |
Data submissions to the ENA require that you register a study to contextualise and group your data. Details of how to do this can be found in our Study Registration Guide. Please ensure you describe your study adequately, as well as provide an informative title.
Your studies can now be claimed using your ORCID ID and/or assigned a DOI. Please see here and here for more information on these options.
Having registered a study, please proceed to register your samples. These are metadata objects that describe the source biological material of your experiments. Following this, the sequence data can be registered (as described in later sections).
Instructions for sample registration can be found in our Sample Registration Guide. As part of this process, you must select a sample checklist to describe metadata. If you require any support regarding sample metadata, please contact [email protected].
for interactive submission, download the sample checklist template from the Webin Portal and once completed, submit the checklist in .tsv format on the Webin Portal to register your Samples. See programmatic sample submission if you are submitting samples programmatically.
The following Sample checklists contain mandatory, recommended and optional metadata fields (<SAMPLE_ATTRIBUTE>
),
with a description for each field, to help with sample metadata completion.
The checklists were agreed by the Genomic Standards Consortium (GSC). In addition to the core checklist for each life domain,
the GSC also provides checklist extensions which may have the
metadata field you are looking for.
You can use the Sample checklists portal to browse all ENA checklists. The pathogen specific checklists are provided below.
link | Checklist name |
ERC000028 | ENA prokaryotic pathogen minimal sample checklist |
ERC000029 | ENA Global Microbial Identifier reporting standard checklist GMI_MDM:1.1 |
ERC000032 | ENA Influenza virus reporting standard checklist |
ERC000033 | ENA virus pathogen reporting standard checklist |
ERC000039 | ENA parasite sample checklist |
ERC000041 | ENA Global Microbial Identifier Proficiency Test (GMI PT) checklist |
Our Tips for Sample Taxonomy page provides a helpful guide for choosing the right taxonomy for your pathogen submission.
You can search for suitable taxon IDs and find more information about a taxon ID using the taxonomy API endpoints:
https://www.ebi.ac.uk/ena/taxonomy/rest/suggest-for-submission/ https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/ https://www.ebi.ac.uk/ena/taxonomy/rest/any-name/ https://www.ebi.ac.uk/ena/taxonomy/rest/tax-id/
The strain of a pathogen may be specified using the taxonomy, it may also be specified using the strain field in the checklists. If you specify the strain with both, this will make your strain easier to find.
The ENA taxonomy API interface may also be used.
Every pathogen checklist includes host attribute fields which can be used to describe the host. Here is provided some guidance on filling the host fields. If you have any questions or concerns about pathogen sample metadata, please contact the helpdesk.
Pathogen checklists host fields:
host taxid: | NCBI taxon id of the host, e.g. 9606 |
---|---|
host health state: | health status of the host at the time of sample collection |
host scientific name: | Scientific name of the natural (as opposed to laboratory) host to the organism from which sample was obtained. |
lab_host: | scientific name of the laboratory host used to propagate the source organism from which the sample was obtained. The EBI cell line ontology may be used to find the name for the host cell line |
After registering your study and samples, you can submit your read files along with experimental (library-related) metadata. See our Read Submission Guide for detailed instructions on submitting reads.
We encourage submissions to include information on specific protocols used for the experiment. This should be provided in the library description. This can be, for example, the name and/or URL to a specific protocol. View our listing of the available full experimental metadata dictionaries.
Note
Submitted reads to ENA should not contain human identifiable reads. Please filter out human reads prior to submission - if required, here is a tool which can be used.
The instructions below provide a quick guide to submitting a completed isolate pathogen genome assembly. This type of submission is classed as 'clone or isolate' ASSEMBLY_TYPE for the ENA submissions services. For submission of other types of nucleotide assembly data, please see the submission options here. For submission of targeted sequences, please refer to the targeted sequence submissions guide.
For genome assembly submission, Webin-CLI (command line interface) needs to be used. The guide for downloading and using Webin-CLI is here.
A note on assembly levels
This guide includes chromosome list file examples which are used for a chromosome level assembly. Note that ‘chromosome’ should here be understood as a general term for a range of complete replicons, including chromosomes of eukaryotes, prokaryotes, and viruses, as well as organellar chromosomes and plasmids. All of these may be submitted within the same chromosome-level assembly.
If your assembly is not completed, you can submit a contig or scaffold level assembly. Please refer to the explainer about assembly levels here.
The accepted format for unannotated genome assembly is fasta. For annotated genome assemblies, the accepted format is embl flat file. Please refer to Accepted Genome Assembly Data Formats for information about preparing these files.
The manifest file is a tab-separated .txt file for Webin-CLI assembly submission. It specifies metadata about the assembly, including the Study and Sample it is linked to. Please refer to the Clone or isolate genome manifest file guide for permitted values.
For example, the following manifest file represents a genome assembly consisting of contigs provided in one fasta file:
STUDY TODO SAMPLE TODO ASSEMBLYNAME TODO ASSEMBLY_TYPE clone or isolate COVERAGE TODO PROGRAM TODO PLATFORM TODO MINGAPLENGTH optional MOLECULETYPE genomic DNA DESCRIPTION optional RUN_REF optional FASTA genome.fasta.gz
The chromosome list file must be provided when the submission contains assembled chromosomes. This is a tab separated file up to four columns. Each row describes each replicon unit within the assembly. Please refer to Accepted Genome Assembly Data Formats for permitted values.
.. tabs:: .. tab:: Viruses By default the chromosome **TOPOLOGY** will be assumed to be linear, therefore if the topology is circular, it must be specified. .. code:: none chr01 1 Monopartite .. code:: none chr01 1 circular-Monopartite viroid .. code:: none chr01 1 Multipartite chr02 2 Multipartite .. tab:: Bacteria By default prokaryotic chromosomes and plasmids will be assumed to reside in the in the cytoplasm, however, the 'plasmid' **CHROMOSOME_LOCATION** may be specified. By default the **TOPOLOGY** will be assumed to be linear, so in this example the circular topology was specified. .. code:: none chr01 1 circular-Chromosome chr02 2 circular-Chromosome plasmid chr03 3 circular-Chromosome plasmid .. tab:: Eukaryota By default eukaryotic chromosomes will be assumed to reside in the nucleus. By default the chromosome **TOPOLOGY** will be assumed to be linear, but it may also be specified. .. code:: none chr01 1 Linear-Chromosome chr02 2 Linear-Chromosome chr03 3 Linear-Chromosome chr04 4 Linear-Chromosome chrMi MIT Linear-Chromosome Mitochondrion
If there are sequences that are associated with a specific chromosome, but order and orientation is unknowm, you can also add an unlocalised list file to the submission. Alternatively, an AGP file may also be submitted to define unplaced sequences.
When you have prepared your files, including the assembly, the manifest file and any additional files for higher assemblies,
You can validate and test your submission using the Webin-CLI -validate
flag. When you are ready to submit the assembly,
you can use the -submit
flag.
Webin-CLI validate command:
java -jar webin-cli-6.4.0.jar -userName Webin-XXXX -password XXXX -context genome -manifest manifest.txt -validate
Once the data is submitted, it will take some time to be processed and archived. If your data is set to public, it will be made public and accessible from the Pathogens Portal.
For information about data release, please find more information at the following pages: