Skip to content

UniProt importer: use abbrev, not id field#1850

Draft
ialarmedalien wants to merge 1 commit into
biopragmatics:mainfrom
ialarmedalien:uniprot_use_abbreviation_not_db_id
Draft

UniProt importer: use abbrev, not id field#1850
ialarmedalien wants to merge 1 commit into
biopragmatics:mainfrom
ialarmedalien:uniprot_use_abbreviation_not_db_id

Conversation

@ialarmedalien

Copy link
Copy Markdown
Contributor

The UniProt importer currently uses the id field from the UniProt database download list, but the id field is just an internal UniProt identifier and not used in any of the public-facing data (i.e. in the mappings to other databases).

This PR edits the parser so that it takes the abbrev field instead, which is where the db prefixes are held.

Added tests for the UniProt parser at the same time.

@ialarmedalien ialarmedalien marked this pull request as draft March 17, 2026 15:31
@codecov

codecov Bot commented Mar 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.35%. Comparing base (8950e70) to head (9316a57).
⚠️ Report is 930 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1850      +/-   ##
==========================================
+ Coverage   42.51%   45.35%   +2.84%     
==========================================
  Files         117      136      +19     
  Lines        8327    10168    +1841     
  Branches     1963     1753     -210     
==========================================
+ Hits         3540     4612    +1072     
- Misses       4582     5223     +641     
- Partials      205      333     +128     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cthoyt

cthoyt commented Mar 17, 2026

Copy link
Copy Markdown
Member

I'm not sure I agree with this. UniProt has a consistent approach towards the way it assigns identifiers in its various vocabularies that all look like DB-0174, such as:

Bioregistry also takes the approach for several other resources that have "well-formed" identifier spaces, but also keep track of "prefixes" or "short forms", then the alignment uses the short forms (e.g., Wikidata, FAIRsharing, IntegBio)

Can you please elaborate on a concrete scenario where you think that this change would be helpful?

@ialarmedalien ialarmedalien force-pushed the uniprot_use_abbreviation_not_db_id branch from 3c808b9 to 9316a57 Compare March 17, 2026 15:58
@ialarmedalien

ialarmedalien commented Mar 17, 2026

Copy link
Copy Markdown
Contributor Author

It would probably be more helpful for you to answer my Q about how to make Bioregistry work for my use case.

I want to use the bioregistry to remap prefixes from UniProt data downloads to the recommended BR prefix. In an ideal world, I would be able to run normalize_prefix on all the identifiers in a UniProt dataset and the prefixes would be remapped to the BR preferred prefix.

I can't do that at the moment because the uniprot values in the mappings section of each prefix have IDs of the form DB-xxxx. Those IDs only appear in the UniProt database list; they are not used in any of the UniProt data downloads or in the UI.

(There is also the problem that values in the mappings section are not accessed by normalise_prefix, but one thing at a time.)

I was hoping to use BR as much as possible and avoid having to write my own parsers for different dbxref systems (I'm also using data from GO, NCBI, various others), especially if it recapitulates what's already in BR, either in terms of data or functionality.

The question:

Can I use bioregistry to remap prefixes from those used in the UniProt datasets to those recommended by BR? (I know how to do it without BR, so no need to elaborate on that.)

ETA: I have only looked at the curie/prefix-related functions in BR as those are most directly applicable to my desired usage.

@cthoyt

cthoyt commented Mar 17, 2026

Copy link
Copy Markdown
Member

Can you please give an explicit example of a UniProt dataset where you're trying to do this?

An alternate solution might be to rewrite the bioregistry.normalize_prefix() function to index the abbreviations from UniProt, which I think would make more sense

@ialarmedalien

Copy link
Copy Markdown
Contributor Author

Certainly. Here's a snippet of the UniProt XML data download, available from their FTP site.

<?xml version="1.0" encoding="UTF-8"  standalone="no" ?>
<uniprot xmlns="https://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/docs/uniprot.xsd">
  <entry dataset="Swiss-Prot" created="2011-07-27" modified="2025-04-09" version="54" xmlns="https://uniprot.org/uniprot">
    <dbReference type="EC" id="3.4.22.10"/>
    <dbReference type="EMBL" id="L26146">
      <property type="protein sequence ID" value="AAA27012.1"/>
      <property type="molecule type" value="Genomic_DNA"/>
    </dbReference>
    <dbReference type="EMBL" id="AE014074">
      <property type="protein sequence ID" value="AAM80349.1"/>
      <property type="molecule type" value="Genomic_DNA"/>
    </dbReference>
    <dbReference type="RefSeq" id="WP_002991253.1">
      <property type="nucleotide sequence ID" value="NC_004070.1"/>
    </dbReference>
    <dbReference type="AlphaFoldDB" id="P0DD38"/>
    <dbReference type="BMRB" id="P0DD38"/>
    <dbReference type="SMR" id="P0DD38"/>
    <dbReference type="MEROPS" id="C10.001"/>
    <dbReference type="KEGG" id="spg:SpyM3_1742"/>
    <dbReference type="HOGENOM" id="CLU_716727_0_0_9"/>
    <dbReference type="Proteomes" id="UP000000564">
      <property type="component" value="Chromosome"/>
 <!-- etc. -->

They provide a load of cross-references for each UniProt entry, with the type of each dbReference being one of the abbrevs from the database list. There is no mention of any DB-nnnn values anywhere in this file or in the other data file downloads that I have looked at (which is all their core protein-related data and their mappings to other database systems).

My workflow involves loading these dbxrefs and then normalising the type (aka prefix) to align with what's in Bioregistry.

I am not sure what the difference between a BR entry "synonym" and a BR "mapping" is -- in the case of UniProt, the DB-nnnn ID refers to a database, not to a prefix, and for prefix normalisation it would be useful to add the abbrev field from the UniProt db list to the synonyms for the relevant prefix.

@ialarmedalien

Copy link
Copy Markdown
Contributor Author

(Just to add -- I use normalise_prefix to convert prefixes from various Gene Ontology project files, and it works beautifully, so thanks for that -- it's a big time saver!)

@cthoyt

cthoyt commented Mar 18, 2026

Copy link
Copy Markdown
Member

here's a demo of reusing some of the code I added today/yesterday that partially addresses your use case:

import click
import requests
from tabulate import tabulate

import bioregistry
from bioregistry.external.uniprot import get_uniprot


def main(use_direct: bool = True) -> None:
    # get a mapping from short names in the UniProt database, i.e., abbreviations,
    # to UniProt Database IDs
    if use_direct:
        abbreviation_to_database_id = _get_uniprot_short_name_to_prefix_direct()
    else:
        # this implementation doesn't currently take into account providers nor
        # mapped ones, like AlphaFoldDB, which don't have their own corresponding
        # bioregistry prefix, but are in the SSSOM file added in
        # https://github.com/biopragmatics/bioregistry/pull/1851
        abbreviation_to_database_id = bioregistry.get_registry_short_name_to_prefix("uniprot")

    # Get a mapping from UniProt Database IDs to Bioregistry prefixes
    # This now contains manually curated providers like AlphaFoldDB
    database_id_to_bioregistry = bioregistry.get_registry_invmap("uniprot")

    data = requests.get("https://rest.uniprot.org/uniprotkb/P0DD38.json", timeout=5).json()
    rows = []
    for xref in data["uniProtKBCrossReferences"]:
        abbreviation = xref["database"]
        database_id = abbreviation_to_database_id.get(abbreviation)
        luid = xref["id"]

        bioregistry_prefix = database_id_to_bioregistry.get(database_id)
        if bioregistry_prefix:
            # fix bananas, like ones in GO identifiers
            luid = bioregistry.standardize_identifier(bioregistry_prefix, luid)

        rows.append((bioregistry_prefix, database_id, abbreviation, luid))

    click.echo(
        tabulate(
            rows,
            tablefmt="github",
            headers=["bioregistry", "uniprot-database", "uniprot-abbrev", "identifier"],
        )
    )


def _get_uniprot_short_name_to_prefix_direct() -> dict[str, str]:
    return {
        short_name: uniprot_database_id
        for uniprot_database_id, record in get_uniprot().items()
        for short_name in record.get("short_names", [])
    }



if __name__ == "__main__":
    main()

which results in

bioregistry uniprot-database uniprot-abbrev identifier
ena.embl DB-0022 EMBL L26146
ena.embl DB-0022 EMBL AE014074
refseq DB-0117 RefSeq WP_002991253.1
uniprot DB-0262 AlphaFoldDB P0DD38
bmrb.entry DB-0256 BMRB P0DD38
uniprot DB-0098 SMR P0DD38
merops.entry DB-0059 MEROPS C10.001
kegg DB-0053 KEGG spg:SpyM3_1742
hogenom DB-0044 HOGENOM CLU_716727_0_0_9
uniprot.proteome DB-0191 Proteomes UP000000564
go DB-0037 GO 0005576
go DB-0037 GO 0044164
go DB-0037 GO 0043655
go DB-0037 GO 0004197
go DB-0037 GO 0090729
go DB-0037 GO 0006508
go DB-0037 GO 0034050
go DB-0037 GO 0042783
go DB-0037 GO 0140321
cath.superfamily DB-0029 Gene3D 3.90.70.50
interpro DB-0052 InterPro IPR038765
interpro DB-0052 InterPro IPR000200
interpro DB-0052 InterPro IPR025896
interpro DB-0052 InterPro IPR044934
pfam DB-0073 Pfam PF13734
pfam DB-0073 Pfam PF01640
prints DB-0082 PRINTS PR00797
supfam DB-0155 SUPFAM SSF54001

interestingly, the bmrb.entry looks problematic, since that entry in Bioregistry appears to have its own dedicated identifier scheme that UniProt isn't using

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants