UniProt importer: use `abbrev`, not `id` field by ialarmedalien · Pull Request #1850 · biopragmatics/bioregistry

ialarmedalien · 2026-03-17T15:23:58Z

The UniProt importer currently uses the id field from the UniProt database download list, but the id field is just an internal UniProt identifier and not used in any of the public-facing data (i.e. in the mappings to other databases).

This PR edits the parser so that it takes the abbrev field instead, which is where the db prefixes are held.

Added tests for the UniProt parser at the same time.

codecov · 2026-03-17T15:34:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 45.35%. Comparing base (8950e70) to head (9316a57).
⚠️ Report is 930 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1850      +/-   ##
==========================================
+ Coverage   42.51%   45.35%   +2.84%     
==========================================
  Files         117      136      +19     
  Lines        8327    10168    +1841     
  Branches     1963     1753     -210     
==========================================
+ Hits         3540     4612    +1072     
- Misses       4582     5223     +641     
- Partials      205      333     +128

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cthoyt · 2026-03-17T15:35:18Z

I'm not sure I agree with this. UniProt has a consistent approach towards the way it assigns identifiers in its various vocabularies that all look like DB-0174, such as:

Bioregistry also takes the approach for several other resources that have "well-formed" identifier spaces, but also keep track of "prefixes" or "short forms", then the alignment uses the short forms (e.g., Wikidata, FAIRsharing, IntegBio)

Can you please elaborate on a concrete scenario where you think that this change would be helpful?

…g tests to validate importer functionality

ialarmedalien · 2026-03-17T16:16:14Z

It would probably be more helpful for you to answer my Q about how to make Bioregistry work for my use case.

I want to use the bioregistry to remap prefixes from UniProt data downloads to the recommended BR prefix. In an ideal world, I would be able to run normalize_prefix on all the identifiers in a UniProt dataset and the prefixes would be remapped to the BR preferred prefix.

I can't do that at the moment because the uniprot values in the mappings section of each prefix have IDs of the form DB-xxxx. Those IDs only appear in the UniProt database list; they are not used in any of the UniProt data downloads or in the UI.

(There is also the problem that values in the mappings section are not accessed by normalise_prefix, but one thing at a time.)

I was hoping to use BR as much as possible and avoid having to write my own parsers for different dbxref systems (I'm also using data from GO, NCBI, various others), especially if it recapitulates what's already in BR, either in terms of data or functionality.

The question:

Can I use bioregistry to remap prefixes from those used in the UniProt datasets to those recommended by BR? (I know how to do it without BR, so no need to elaborate on that.)

ETA: I have only looked at the curie/prefix-related functions in BR as those are most directly applicable to my desired usage.

cthoyt · 2026-03-17T17:11:20Z

Can you please give an explicit example of a UniProt dataset where you're trying to do this?

An alternate solution might be to rewrite the bioregistry.normalize_prefix() function to index the abbreviations from UniProt, which I think would make more sense

ialarmedalien · 2026-03-17T18:00:45Z

Certainly. Here's a snippet of the UniProt XML data download, available from their FTP site.

<?xml version="1.0" encoding="UTF-8"  standalone="no" ?>
<uniprot xmlns="https://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/docs/uniprot.xsd">
  <entry dataset="Swiss-Prot" created="2011-07-27" modified="2025-04-09" version="54" xmlns="https://uniprot.org/uniprot">
    <dbReference type="EC" id="3.4.22.10"/>
    <dbReference type="EMBL" id="L26146">
      <property type="protein sequence ID" value="AAA27012.1"/>
      <property type="molecule type" value="Genomic_DNA"/>
    </dbReference>
    <dbReference type="EMBL" id="AE014074">
      <property type="protein sequence ID" value="AAM80349.1"/>
      <property type="molecule type" value="Genomic_DNA"/>
    </dbReference>
    <dbReference type="RefSeq" id="WP_002991253.1">
      <property type="nucleotide sequence ID" value="NC_004070.1"/>
    </dbReference>
    <dbReference type="AlphaFoldDB" id="P0DD38"/>
    <dbReference type="BMRB" id="P0DD38"/>
    <dbReference type="SMR" id="P0DD38"/>
    <dbReference type="MEROPS" id="C10.001"/>
    <dbReference type="KEGG" id="spg:SpyM3_1742"/>
    <dbReference type="HOGENOM" id="CLU_716727_0_0_9"/>
    <dbReference type="Proteomes" id="UP000000564">
      <property type="component" value="Chromosome"/>
 <!-- etc. -->

They provide a load of cross-references for each UniProt entry, with the type of each dbReference being one of the abbrevs from the database list. There is no mention of any DB-nnnn values anywhere in this file or in the other data file downloads that I have looked at (which is all their core protein-related data and their mappings to other database systems).

My workflow involves loading these dbxrefs and then normalising the type (aka prefix) to align with what's in Bioregistry.

I am not sure what the difference between a BR entry "synonym" and a BR "mapping" is -- in the case of UniProt, the DB-nnnn ID refers to a database, not to a prefix, and for prefix normalisation it would be useful to add the abbrev field from the UniProt db list to the synonyms for the relevant prefix.

ialarmedalien · 2026-03-18T16:44:28Z

(Just to add -- I use normalise_prefix to convert prefixes from various Gene Ontology project files, and it works beautifully, so thanks for that -- it's a big time saver!)

cthoyt · 2026-03-18T20:11:47Z

here's a demo of reusing some of the code I added today/yesterday that partially addresses your use case:

import click
import requests
from tabulate import tabulate

import bioregistry
from bioregistry.external.uniprot import get_uniprot


def main(use_direct: bool = True) -> None:
    # get a mapping from short names in the UniProt database, i.e., abbreviations,
    # to UniProt Database IDs
    if use_direct:
        abbreviation_to_database_id = _get_uniprot_short_name_to_prefix_direct()
    else:
        # this implementation doesn't currently take into account providers nor
        # mapped ones, like AlphaFoldDB, which don't have their own corresponding
        # bioregistry prefix, but are in the SSSOM file added in
        # https://github.com/biopragmatics/bioregistry/pull/1851
        abbreviation_to_database_id = bioregistry.get_registry_short_name_to_prefix("uniprot")

    # Get a mapping from UniProt Database IDs to Bioregistry prefixes
    # This now contains manually curated providers like AlphaFoldDB
    database_id_to_bioregistry = bioregistry.get_registry_invmap("uniprot")

    data = requests.get("https://rest.uniprot.org/uniprotkb/P0DD38.json", timeout=5).json()
    rows = []
    for xref in data["uniProtKBCrossReferences"]:
        abbreviation = xref["database"]
        database_id = abbreviation_to_database_id.get(abbreviation)
        luid = xref["id"]

        bioregistry_prefix = database_id_to_bioregistry.get(database_id)
        if bioregistry_prefix:
            # fix bananas, like ones in GO identifiers
            luid = bioregistry.standardize_identifier(bioregistry_prefix, luid)

        rows.append((bioregistry_prefix, database_id, abbreviation, luid))

    click.echo(
        tabulate(
            rows,
            tablefmt="github",
            headers=["bioregistry", "uniprot-database", "uniprot-abbrev", "identifier"],
        )
    )


def _get_uniprot_short_name_to_prefix_direct() -> dict[str, str]:
    return {
        short_name: uniprot_database_id
        for uniprot_database_id, record in get_uniprot().items()
        for short_name in record.get("short_names", [])
    }



if __name__ == "__main__":
    main()

which results in

bioregistry	uniprot-database	uniprot-abbrev	identifier
ena.embl	DB-0022	EMBL	L26146
ena.embl	DB-0022	EMBL	AE014074
refseq	DB-0117	RefSeq	WP_002991253.1
uniprot	DB-0262	AlphaFoldDB	P0DD38
bmrb.entry	DB-0256	BMRB	P0DD38
uniprot	DB-0098	SMR	P0DD38
merops.entry	DB-0059	MEROPS	C10.001
kegg	DB-0053	KEGG	spg:SpyM3_1742
hogenom	DB-0044	HOGENOM	CLU_716727_0_0_9
uniprot.proteome	DB-0191	Proteomes	UP000000564
go	DB-0037	GO	0005576
go	DB-0037	GO	0044164
go	DB-0037	GO	0043655
go	DB-0037	GO	0004197
go	DB-0037	GO	0090729
go	DB-0037	GO	0006508
go	DB-0037	GO	0034050
go	DB-0037	GO	0042783
go	DB-0037	GO	0140321
cath.superfamily	DB-0029	Gene3D	3.90.70.50
interpro	DB-0052	InterPro	IPR038765
interpro	DB-0052	InterPro	IPR000200
interpro	DB-0052	InterPro	IPR025896
interpro	DB-0052	InterPro	IPR044934
pfam	DB-0073	Pfam	PF13734
pfam	DB-0073	Pfam	PF01640
prints	DB-0082	PRINTS	PR00797
supfam	DB-0155	SUPFAM	SSF54001

interestingly, the bmrb.entry looks problematic, since that entry in Bioregistry appears to have its own dedicated identifier scheme that UniProt isn't using

ialarmedalien marked this pull request as draft March 17, 2026 15:31

Updating uniprot importer to use the abbrev field for prefixes; addin…

9316a57

…g tests to validate importer functionality

ialarmedalien force-pushed the uniprot_use_abbreviation_not_db_id branch from 3c808b9 to 9316a57 Compare March 17, 2026 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UniProt importer: use `abbrev`, not `id` field#1850

UniProt importer: use `abbrev`, not `id` field#1850
ialarmedalien wants to merge 1 commit into
biopragmatics:mainfrom
ialarmedalien:uniprot_use_abbreviation_not_db_id

ialarmedalien commented Mar 17, 2026

Uh oh!

codecov Bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

cthoyt commented Mar 17, 2026 •

edited

Loading

Uh oh!

ialarmedalien commented Mar 17, 2026 •

edited

Loading

Uh oh!

cthoyt commented Mar 17, 2026

Uh oh!

ialarmedalien commented Mar 17, 2026

Uh oh!

ialarmedalien commented Mar 18, 2026

Uh oh!

cthoyt commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ialarmedalien commented Mar 17, 2026

Uh oh!

codecov Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cthoyt commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ialarmedalien commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cthoyt commented Mar 17, 2026

Uh oh!

ialarmedalien commented Mar 17, 2026

Uh oh!

ialarmedalien commented Mar 18, 2026

Uh oh!

cthoyt commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Mar 17, 2026 •

edited

Loading

cthoyt commented Mar 17, 2026 •

edited

Loading

ialarmedalien commented Mar 17, 2026 •

edited

Loading

cthoyt commented Mar 18, 2026 •

edited

Loading