-
Notifications
You must be signed in to change notification settings - Fork 2
Final output files
This page lists all output files that are generated by the Unipept database construction script. These files are all compressed TSV-files that each represent a single table required for the Unipept database and can thus be directly loaded into an SQL-database
Every line in this file corresponds to a single EC-number.
1 1.-.-.- Oxidoreductases
2 1.1.-.- Acting on the CH-OH group of donors
3 1.1.1.- With NAD(+) or NADP(+) as acceptor
4 1.1.2.- With a cytochrome as acceptor
5 1.1.3.- With oxygen as acceptor
6 1.1.4.- With a disulfide as acceptor
7 1.1.5.- With a quinone or similar compound as acceptor
8 1.1.7.- With an iron-sulfur protein as acceptor
9 1.1.9.- With a copper protein as acceptor- id: Internal identifier of this EC-number, as used by Unipept (these do not correspond to identifiers used by external organizations).
- ec_number_code: The official EC-number identifier, as provided by the Enzyme Commission.
- name: Name of this EC-number.
Every line in this file corresponds to a single GO-term.
3 GO:0000003 biological process reproduction
4 GO:0019952 biological process reproduction
5 GO:0050876 biological process reproduction
6 GO:0000005 molecular function obsolete ribosomal chaperone activity
7 GO:0000006 molecular function high-affinity zinc transmembrane transporter activity- id: Internal identifier of this GO-term, as used by Unipept (these do not correspond to identifiers used by external organizations).
- go_term_code: The official GO-term identifier, as provided by the Gene Ontology.
- name: Name of this GO-term.
Every line in this file corresponds to a single InterPro-entry.
1 IPR000126 Active_site Serine proteases, V8 family, serine active site
2 IPR000138 Active_site Hydroxymethylglutaryl-CoA lyase, active site
3 IPR000169 Active_site Cysteine peptidase, cysteine active site
4 IPR000180 Active_site Membrane dipeptidase, active site
5 IPR000189 Active_site Prokaryotic transglycosylase, active site- id: Internal identifier of this InterPro-entry, as used by Unipept (these do not correspond to identifiers used by external organizations).
- interpro_entry_code: The official InterPro-entry identifier, as provided by the InterPro organization.
- name: Name of this InterPro-entry.
A file containing all NCBI taxa ID's and a link to each parent node in the NCBI taxonomy on every supported rank of the taxonomy.
1 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N
2 2 \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N \N
6 2 \N \N \N 1224 \N \N 28211 \N \N 356 \N \N \N 335928 \N \N \N 6\N \N \N \N \N \N \N \NSee this page for all taxonomic ranks that are supported by Unipept.
- taxon_id: NCBI identifier of this taxon.
-
superkingdom: Parent of this taxon on the
superkingdomrank. - ...
-
forma: Parent of this taxon on the
formarank.
A file containing all tryptic peptide sequences that where generated by an in-silico tryptic digest of the UniProtKB proteins. Both the sequences where the amino acids I and L are considered equal, and a version where this does not happen, are generated and kept in this file.
2 AAAAAA 87882 87882 {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}} {"num":{"all":2,"EC":2,"GO":2,"IPR":2},"data":{"GO:0004477":2,"GO:0004488":2,"GO:0000105":2,"GO:0009086":2,"GO:0006164":2,"GO:0035999":2,"EC:1.5.1.5":2,"EC:3.5.4.9":2,"IPR:IPR046346":2,"IPR:IPR036291":2,"IPR:IPR000672":2,"IPR:IPR020630":2,"IPR:IPR020867":2,"IPR:IPR020631":2}}
3 AAAAAAAAA 272568 272568 {"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}} {"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0005737":1,"GO:1990904":1,"GO:0005840":1,"GO:0003735":1,"GO:0006412":1,"EC:":1,"IPR:IPR000307":1,"IPR:IPR020592":1,"IPR:IPR023803":1}}
4 AAAAAAAAAAAAAAAAAAAAQAQATSSYPSAISPGSK 7227 \N {"num":{"all":1,"EC":1,"GO":1,"IPR":1},"data":{"GO:0000785":1,"GO:0005634":1,"GO:0035517":1,"GO:0003682":1,"GO:0035800":1,"GO:0003677":1,"GO:0046872":1,"GO:0009887":1,"GO:0007469":1,"GO:0009948":1,"GO:0007475":1,"GO:0001709":1,"GO:0006325":1,"GO:0035522":1,"GO:0045892":1,"GO:0045893":1,"GO:0045944":1,"GO:0010468":1,"GO:0006357":1,"GO:0045498":1,"GO:0035186":1,"EC:":1,"IPR:IPR026905":1,"IPR:IPR024811":1,"IPR:IPR028020":1,"IPR:IPR044867":1}} \N-
id: Internal identifier of this sequence. This identifier is used by the
peptides.tsv.gzto refer to sequences in this file. - sequence: String-representation of the sequence of amino acids that this peptide consists of.
- lca: Lowest common ancestor of the taxa associated with all proteins that contain this peptide sequence (in the case that I and L are not considered equal).
- lca_il: Lowest common ancestor of the taxa associated with all proteins that contain this peptide sequence (in the case that I and L are considered equal).
- fa: JSON-object containing the functional annotations of this sequence (in the case that I and L are not considered equal).
- fa_il: JSON-object containing the functional annotations of this sequence (in the case that I and L are considered equal).
A file containing all NCBI taxa ID's, mapped onto the associated name, rank and parent node in the NCBI taxonomy.
1 root no rank 1
2 Bacteria superkingdom 131567
6 Azorhizobium genus 335928
7 Azorhizobium caulinodans species 6
9 Buchnera aphidicola species 32199- id: The NCBI ID for this taxon.
- name: Official name of this taxon, as defined by NCBI.
- rank: NCBI rank for this taxon.
- parent_id: NCBI ID of the parent taxon in the NCBI taxonomy.
A file containing one protein per line, including all of it's functional and taxonomic annotations. Each of these proteins come from the UniProtKB resource.
1 C0HLH3 7 2546662 swissprot Collagen alpha-1(I) chain GGISVPGPMGPSGPRGLPGPPGPGPQGFQGPPGEPGEPGSSGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGLPGTAGLPGMKGHRGFSGLDGAKGDAGPAGPKGEPGSPGENGAPGQMGPRGPGERGRPGASGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGARGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGGEPGPTGIQGPPGPAGEEGKRGARGEPGPTGLPGPPGERGGPGSRGFPGADGVAGPKGSPGEAGRPGEAGLPGAKGLTGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGPGFQGLPGPAGPPGEAGKPGEQGVPGDLGAPGPSGARGERGFPGERGVQGPPGPAGPRGSSQGAPGLQGMPGERGAAGLPGPKGDRGDAGPKGADGAPGKDGVRGLTGPIGPPGPAGAPGDKGESGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKGEPGDAGAKGDAGPPGPAGPTGAPGPIGNLGAPGPKGARGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPVGKEGGKGPRGETGPAGEVGPPGPPGPSGEKGSPGADGPAGAPGTPGPQGISGQRGVVGLPGQRGERGFPGLPGPSGEPGKQGPSGSSGERGPPGPMGPPGLAGPPGESGREGPGAEGSPGRDGSPGPKGDRGEGPPGAPGAPGAPGPVGPAGKSGDRGETGPGPAGPAGPAGARGPAGPQGPRGDKGETGEQGDRGIKGHRGFSGLQGPAGPPGSPGEQGPSGASGPAGPRGPPGSAGSPGKDGLNGLPGPIGPPGPRGRTGDAGPVGPPGPPGPPGPPGPP
2 C0HLH1 6 2546656 swissprot Collagen alpha-1(I) chain SAGGISVPGPMGPSGPRGLPGPPGAPGPQGFQGPPGEPGEPGSGPMGPRGPPGPPGKNGDDGEAGKPGRPGERGPPGPQGARGLPGTAGLPGMKGHRGFSGLDGAKGDAGPAGPKGAPGQMGPRGPGERGRPGASGPAGARGNDGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPQGARGSEGPQGVRGEPGPPGPAGAAGPAGNPGADGQPGAKGANGAPGIAGAPGFPGARGPSGPQGPSGPPGPKGNSGEPGAPGSKAKGEPGPTGIQGPPGPAGEEGKRGARGEPGPTGLPGPGERGGPGSRGFPGADGVAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGLPGAKGLTGSPGSPGPDGKTGPPGPAGQDGRPGPPGPPGARGQAGVMGFPGPKGAAGEPGKAGERGVPGPPGAVGPAGKDGEAGAQGPPGPAGPAGERGEQGPAGPGFQGLPGPAGPPGEAGKPGEQGVPGDLGAPGPSGARGERGFPGERGVQGPPGPAGPRGSSQGAPGLQGMPGERGAAGLPGPKGDRGDAGPKGADGAPGKDGVRGLTGPIGPPGPAGAPGDKGESGPSGPAGPTGARGAPGDRGEPGPPGPAGFAGPPGADGQPGAKGEPGDAGAKGDAGPPGPAGPTGAPGPIGNLGAPGPKGARGSAGPPGATGFPGAAGRVGPPGPSGNAGPPGPPGPVGKEGGKGPRGETGPAGEVGPPGPPGPGEKGSPGADGPAGAPGTPGPQGISGQRGVVGLPGQRGERGFPGLPGPSGEPGKQGPSGSSGERGPPGPMGPPGLAGPPGESGREGPGAEGSPGRDGSPGPKGDRGETGPGPPGAPGAPGAPGPVGPAGKSGDRGETGPGPAGPAGPAGARGPAGPQGPRGDKGETGEQGDRGIKRGFSGLQGPAGPPGSPGEQGPSGASGPAGPRGPPGSAGSPGKDGLNGLPGPIGPPGPRGRTGDAGPVGPPGPPGPPGPPGPP- id: Internal identifier for this entry.
- uniprot_accession_number: Official UniProt ID (accession number) for this protein.
- version: Version of this protein as provided by UniProt.
- taxon_id: NCBI taxonomy ID for the organism that is associated with this protein.
- type: The database source from which this protein originates (either SwissProt or TrEMBL).
- name: Official name of this protein (as provided by UniProt).
- protein: Full protein sequence, without prior adjustments.