Skip to content

Final output files

Pieter Verschaffelt edited this page Apr 10, 2025 · 2 revisions

This page lists all output files that are generated by the Unipept database construction script. These files are all compressed TSV-files that each represent a single table required for the Unipept database and can thus be directly loaded into an SQL-database

ec_numbers.tsv.gz

Every line in this file corresponds to a single EC-number.

Example

1	1.-.-.-	Oxidoreductases
2	1.1.-.-	Acting on the CH-OH group of donors
3	1.1.1.-	With NAD(+) or NADP(+) as acceptor
4	1.1.2.-	With a cytochrome as acceptor
5	1.1.3.-	With oxygen as acceptor
6	1.1.4.-	With a disulfide as acceptor
7	1.1.5.-	With a quinone or similar compound as acceptor
8	1.1.7.-	With an iron-sulfur protein as acceptor
9	1.1.9.-	With a copper protein as acceptor

Columns

  1. id: Internal identifier of this EC-number, as used by Unipept (these do not correspond to identifiers used by external organizations).
  2. ec_number_code: The official EC-number identifier, as provided by the Enzyme Commission.
  3. name: Name of this EC-number.

go_terms.tsv.gz

Every line in this file corresponds to a single GO-term.

Example

3	GO:0000003	biological process	reproduction
4	GO:0019952	biological process	reproduction
5	GO:0050876	biological process	reproduction
6	GO:0000005	molecular function	obsolete ribosomal chaperone activity
7	GO:0000006	molecular function	high-affinity zinc transmembrane transporter activity

Columns

  1. id: Internal identifier of this GO-term, as used by Unipept (these do not correspond to identifiers used by external organizations).
  2. go_term_code: The official GO-term identifier, as provided by the Gene Ontology.
  3. name: Name of this GO-term.

interpro_entries.tsv.gz

Every line in this file corresponds to a single InterPro-entry.

Example

1	IPR000126	Active_site	Serine proteases, V8 family, serine active site
2	IPR000138	Active_site	Hydroxymethylglutaryl-CoA lyase, active site
3	IPR000169	Active_site	Cysteine peptidase, cysteine active site
4	IPR000180	Active_site	Membrane dipeptidase, active site
5	IPR000189	Active_site	Prokaryotic transglycosylase, active site

Columns

  1. id: Internal identifier of this InterPro-entry, as used by Unipept (these do not correspond to identifiers used by external organizations).
  2. interpro_entry_code: The official InterPro-entry identifier, as provided by the InterPro organization.
  3. name: Name of this InterPro-entry.

lineages.tsv.gz

A file containing all NCBI taxa ID's and a link to each parent node in the NCBI taxonomy on every supported rank of the taxonomy.

Example

1       \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N       \N      \N      \N      \N      \N      \N      \N      \N
2       2       \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N      \N       \N      \N      \N      \N      \N      \N      \N      \N
6       2       \N      \N      \N      1224    \N      \N      28211   \N      \N      356     \N      \N      \N      335928  \N      \N      \N      6\N      \N      \N      \N      \N      \N      \N      \N

Columns

See this page for all taxonomic ranks that are supported by Unipept.

  • taxon_id: NCBI identifier of this taxon.
  • superkingdom: Parent of this taxon on the superkingdom rank.
  • ...
  • forma: Parent of this taxon on the forma rank.

sequences.tsv.gz

A file containing all tryptic peptide sequences that where generated by an in-silico tryptic digest of the UniProtKB proteins. Both the sequences where the amino acids I and L are considered equal, and a version where this does not happen, are generated and kept in this file.

Example

2	AAAAAA	87882	87882	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}	{"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
3	AAAAAAAAA	272568	272568	{"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}	{"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}
4	AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK	7227	\N	{"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0000785":1,"GO:0005634":1,"GO:0035517":1,"GO:0003682":1,"GO:0035800":1,"GO:0003677":1,"GO:0046872":1,"GO:0009887":1,"GO:0007469":1,"GO:0009948":1,"GO:0007475":1,"GO:0001709":1,"GO:0006325":1,"GO:0035522":1,"GO:0045892":1,"GO:0045893":1,"GO:0045944":1,"GO:0010468":1,"GO:0006357":1,"GO:0045498":1,"GO:0035186":1,"EC:":1,"IPR:IPR026905":1,"IPR:IPR024811":1,"IPR:IPR028020":1,"IPR:IPR044867":1}}	\N

Columns

  1. id: Internal identifier of this sequence. This identifier is used by the peptides.tsv.gz to refer to sequences in this file.
  2. sequence: String-representation of the sequence of amino acids that this peptide consists of.
  3. lca: Lowest common ancestor of the taxa associated with all proteins that contain this peptide sequence (in the case that I and L are not considered equal).
  4. lca_il: Lowest common ancestor of the taxa associated with all proteins that contain this peptide sequence (in the case that I and L are considered equal).
  5. fa: JSON-object containing the functional annotations of this sequence (in the case that I and L are not considered equal).
  6. fa_il: JSON-object containing the functional annotations of this sequence (in the case that I and L are considered equal).

taxons.tsv.gz

A file containing all NCBI taxa ID's, mapped onto the associated name, rank and parent node in the NCBI taxonomy.

Example

1       root    no rank 1
2       Bacteria        superkingdom    131567
6       Azorhizobium    genus   335928
7       Azorhizobium caulinodans        species 6
9       Buchnera aphidicola     species 32199

Columns

  1. id: The NCBI ID for this taxon.
  2. name: Official name of this taxon, as defined by NCBI.
  3. rank: NCBI rank for this taxon.
  4. parent_id: NCBI ID of the parent taxon in the NCBI taxonomy.

uniprot_entries.tsv.gz

A file containing one protein per line, including all of it's functional and taxonomic annotations. Each of these proteins come from the UniProtKB resource.

Example

1	C0HLH3	7	2546662	swissprot	Collagen alpha-1(I) chain	GGISVPGPMGPSGPRGLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGLPGTAGLPGMKGHRGFSGLDGAKGDAGPAGPKGEPGSPGENGAPGQMGPRGPGERGRPGASGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGARGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGGEPGPTGIQGPPGPAGEEGKRGARGEPGPTGLPGPPGERGGPGSRGFPGADGVAGPKGSPGEAGRPGEAGLPGAKGLTGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGPGFQGLPGPAGPPGEAGKPGEQGVPGDLGAPGPSGARGERGFPGERGVQGPPGPAGPRGSSQGAPGLQGMPGERGAAGLPGPKGDRGDAGPKGADGAPGKDGVRGLTGPIGPPGPAGAPGDKGESGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKGEPGDAGAKGDAGPPGPAGPTGAPGPIGNLGAPGPKGARGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPVGKEGGKGPRGETGPAGEVGPPGPPGPSGEKGSPGADGPAGAPGTPGPQGISGQRGVVGLPGQRGERGFPGLPGPSGEPGKQGPSGSSGERGPPGPMGPPGLAGPPGESGREGPGAEGSPGRDGSPGPKGDRGEGPPGAPGAPGAPGPVGPAGKSGDRGETGPGPAGPAGPAGARGPAGPQGPRGDKGETGEQGDRGIKGHRGFSGLQGPAGPPGSPGEQGPSGASGPAGPRGPPGSAGSPGKDGLNGLPGPIGPPGPRGRTGDAGPVGPPGPPGPPGPPGPP
2	C0HLH1	6	2546656	swissprot	Collagen alpha-1(I) chain	SAGGISVPGPMGPSGPRGLPGPPGAPGPQGFQGPPGEPGEPGSGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGLPGTAGLPGMKGHRGFSGLDGAKGDAGPAGPKGAPGQMGPRGPGERGRPGASGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGARGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGSKAKGEPGPTGIQGPPGPAGEEGKRGARGEPGPTGLPGPGERGGPGSRGFPGADGVAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGLPGAKGLTGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGPGFQGLPGPAGPPGEAGKPGEQGVPGDLGAPGPSGARGERGFPGERGVQGPPGPAGPRGSSQGAPGLQGMPGERGAAGLPGPKGDRGDAGPKGADGAPGKDGVRGLTGPIGPPGPAGAPGDKGESGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKGEPGDAGAKGDAGPPGPAGPTGAPGPIGNLGAPGPKGARGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPVGKEGGKGPRGETGPAGEVGPPGPPGPGEKGSPGADGPAGAPGTPGPQGISGQRGVVGLPGQRGERGFPGLPGPSGEPGKQGPSGSSGERGPPGPMGPPGLAGPPGESGREGPGAEGSPGRDGSPGPKGDRGETGPGPPGAPGAPGAPGPVGPAGKSGDRGETGPGPAGPAGPAGARGPAGPQGPRGDKGETGEQGDRGIKRGFSGLQGPAGPPGSPGEQGPSGASGPAGPRGPPGSAGSPGKDGLNGLPGPIGPPGPRGRTGDAGPVGPPGPPGPPGPPGPP

Columns

  1. id: Internal identifier for this entry.
  2. uniprot_accession_number: Official UniProt ID (accession number) for this protein.
  3. version: Version of this protein as provided by UniProt.
  4. taxon_id: NCBI taxonomy ID for the organism that is associated with this protein.
  5. type: The database source from which this protein originates (either SwissProt or TrEMBL).
  6. name: Official name of this protein (as provided by UniProt).
  7. protein: Full protein sequence, without prior adjustments.

Clone this wiki locally