UniProt importer: use abbrev, not id field#1850
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1850 +/- ##
==========================================
+ Coverage 42.51% 45.35% +2.84%
==========================================
Files 117 136 +19
Lines 8327 10168 +1841
Branches 1963 1753 -210
==========================================
+ Hits 3540 4612 +1072
- Misses 4582 5223 +641
- Partials 205 333 +128 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'm not sure I agree with this. UniProt has a consistent approach towards the way it assigns identifiers in its various vocabularies that all look like DB-0174, such as:
Bioregistry also takes the approach for several other resources that have "well-formed" identifier spaces, but also keep track of "prefixes" or "short forms", then the alignment uses the short forms (e.g., Wikidata, FAIRsharing, IntegBio) Can you please elaborate on a concrete scenario where you think that this change would be helpful? |
…g tests to validate importer functionality
3c808b9 to
9316a57
Compare
|
It would probably be more helpful for you to answer my Q about how to make Bioregistry work for my use case. I want to use the bioregistry to remap prefixes from UniProt data downloads to the recommended BR prefix. In an ideal world, I would be able to run I can't do that at the moment because the (There is also the problem that values in the I was hoping to use BR as much as possible and avoid having to write my own parsers for different dbxref systems (I'm also using data from GO, NCBI, various others), especially if it recapitulates what's already in BR, either in terms of data or functionality. The question: Can I use bioregistry to remap prefixes from those used in the UniProt datasets to those recommended by BR? (I know how to do it without BR, so no need to elaborate on that.) ETA: I have only looked at the curie/prefix-related functions in BR as those are most directly applicable to my desired usage. |
|
Can you please give an explicit example of a UniProt dataset where you're trying to do this? An alternate solution might be to rewrite the |
|
Certainly. Here's a snippet of the UniProt XML data download, available from their FTP site. <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<uniprot xmlns="https://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="2011-07-27" modified="2025-04-09" version="54" xmlns="https://uniprot.org/uniprot">
<dbReference type="EC" id="3.4.22.10"/>
<dbReference type="EMBL" id="L26146">
<property type="protein sequence ID" value="AAA27012.1"/>
<property type="molecule type" value="Genomic_DNA"/>
</dbReference>
<dbReference type="EMBL" id="AE014074">
<property type="protein sequence ID" value="AAM80349.1"/>
<property type="molecule type" value="Genomic_DNA"/>
</dbReference>
<dbReference type="RefSeq" id="WP_002991253.1">
<property type="nucleotide sequence ID" value="NC_004070.1"/>
</dbReference>
<dbReference type="AlphaFoldDB" id="P0DD38"/>
<dbReference type="BMRB" id="P0DD38"/>
<dbReference type="SMR" id="P0DD38"/>
<dbReference type="MEROPS" id="C10.001"/>
<dbReference type="KEGG" id="spg:SpyM3_1742"/>
<dbReference type="HOGENOM" id="CLU_716727_0_0_9"/>
<dbReference type="Proteomes" id="UP000000564">
<property type="component" value="Chromosome"/>
<!-- etc. -->They provide a load of cross-references for each UniProt entry, with the My workflow involves loading these dbxrefs and then normalising the I am not sure what the difference between a BR entry "synonym" and a BR "mapping" is -- in the case of UniProt, the |
|
(Just to add -- I use |
|
here's a demo of reusing some of the code I added today/yesterday that partially addresses your use case: import click
import requests
from tabulate import tabulate
import bioregistry
from bioregistry.external.uniprot import get_uniprot
def main(use_direct: bool = True) -> None:
# get a mapping from short names in the UniProt database, i.e., abbreviations,
# to UniProt Database IDs
if use_direct:
abbreviation_to_database_id = _get_uniprot_short_name_to_prefix_direct()
else:
# this implementation doesn't currently take into account providers nor
# mapped ones, like AlphaFoldDB, which don't have their own corresponding
# bioregistry prefix, but are in the SSSOM file added in
# https://github.com/biopragmatics/bioregistry/pull/1851
abbreviation_to_database_id = bioregistry.get_registry_short_name_to_prefix("uniprot")
# Get a mapping from UniProt Database IDs to Bioregistry prefixes
# This now contains manually curated providers like AlphaFoldDB
database_id_to_bioregistry = bioregistry.get_registry_invmap("uniprot")
data = requests.get("https://rest.uniprot.org/uniprotkb/P0DD38.json", timeout=5).json()
rows = []
for xref in data["uniProtKBCrossReferences"]:
abbreviation = xref["database"]
database_id = abbreviation_to_database_id.get(abbreviation)
luid = xref["id"]
bioregistry_prefix = database_id_to_bioregistry.get(database_id)
if bioregistry_prefix:
# fix bananas, like ones in GO identifiers
luid = bioregistry.standardize_identifier(bioregistry_prefix, luid)
rows.append((bioregistry_prefix, database_id, abbreviation, luid))
click.echo(
tabulate(
rows,
tablefmt="github",
headers=["bioregistry", "uniprot-database", "uniprot-abbrev", "identifier"],
)
)
def _get_uniprot_short_name_to_prefix_direct() -> dict[str, str]:
return {
short_name: uniprot_database_id
for uniprot_database_id, record in get_uniprot().items()
for short_name in record.get("short_names", [])
}
if __name__ == "__main__":
main()which results in
interestingly, the |
The UniProt importer currently uses the
idfield from the UniProt database download list, but theidfield is just an internal UniProt identifier and not used in any of the public-facing data (i.e. in the mappings to other databases).This PR edits the parser so that it takes the
abbrevfield instead, which is where the db prefixes are held.Added tests for the UniProt parser at the same time.