Skip to content

Commit 9316511

Browse files
authored
Add converter with configurable pre-processing (#171)
This PR adds an extension to the `curies.Converter` that allows for pre-configuring string processing. This is necessary in many places where what;s possible with simple contraction and expansion isn't enough to parse CURIEs, URIs, or other strings that might appear in places where CURIEs or URIs are supposed to be. The idea and draft code for this PR existed already in PyOBO, but this PR generalizes and makes it fully reusable.
1 parent 3835d56 commit 9316511

7 files changed

Lines changed: 783 additions & 12 deletions

File tree

docs/source/api.rst

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,4 @@ API Reference
22
=============
33

44
.. automodapi:: curies
5-
:no-inheritance-diagram:
65
:no-heading:

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,4 @@ The most recent code and data can be installed directly from GitHub with:
6868
services/index
6969
typing
7070
w3c
71+
preprocessing

docs/source/preprocessing.rst

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
Converter with Preprocessing
2+
============================
3+
4+
When simple expansion and contraction aren't enough, and you want to inject global or
5+
context-specific rewrite rules, you can wrap a :class:`curies.Converter` and
6+
preprocessing rules encoded in an instance of :class:`curies.PreprocessingRules` inside
7+
a :class:`curies.PreprocessingConverter`.
8+
9+
Rewrites
10+
--------
11+
12+
For example, you always want to fix legacy references to the ``OBO_REL`` namespace:
13+
14+
.. code-block:: python
15+
16+
import curies
17+
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites
18+
19+
rules = PreprocessingRules(
20+
rewrites=PreprocessingRewrites(
21+
full={"OBO_REL:is_a": "rdfs:subClassOf"},
22+
),
23+
)
24+
25+
converter = curies.get_obo_converter()
26+
converter = PreprocessingConverter.from_converter(
27+
converter, rules=rules,
28+
)
29+
30+
>>> converter.parse_curie("OBO_REL:is_a")
31+
ReferenceTuple('rdfs', 'subClassOf')
32+
33+
Similarly, there may be a whole class of references that need to be fixed based on their
34+
prefix, such as the ``APOLLO:SV_`` references that are mangled by the OWLAPI due to the
35+
OBO Foundry's PURL rules
36+
37+
.. code-block:: python
38+
39+
import curies
40+
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites
41+
42+
rules = PreprocessingRules(
43+
rewrites=PreprocessingRewrites(
44+
prefix={"APOLLO:SV_": "APOLLO_SV:"},
45+
)
46+
)
47+
48+
converter = curies.get_obo_converter()
49+
converter = PreprocessingConverter.from_converter(
50+
converter, rules=rules,
51+
)
52+
53+
>>> converter.parse_curie("APOLLO:SV_1234567")
54+
ReferenceTuple('APOLLO_SV', '1234567')
55+
56+
The CURIE and URI rewrites are unified. Therefore, you can also use a URI as a rewrite,
57+
such as handling Creative Commons license URLs, which unfortunately aren't themselves
58+
part of a semantic space for licenses. Luckily, SPDX is, and we can remap to that.
59+
60+
.. code-block:: python
61+
62+
import curies
63+
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites
64+
65+
rules = PreprocessingRules(
66+
rewrites=PreprocessingRewrites(
67+
full={"http://creativecommons.org/licenses/by/3.0/": "spdx:CC-BY-3.0",},
68+
)
69+
)
70+
71+
converter = curies.get_obo_converter()
72+
converter.add_prefix("spdx", "https://spdx.org/licenses/")
73+
converter = PreprocessingConverter.from_converter(
74+
converter, rules=rules,
75+
)
76+
77+
>>> converter.parse_uri("http://creativecommons.org/licenses/by/3.0/")
78+
ReferenceTuple('spdx', 'CC-BY-3.0')
79+
80+
Some rewrite rules only apply to a specific resource, because of its own quirks in
81+
curation or encoding. For example, CHMO encodes OrangeBook entries with ``orange`` as a
82+
prefix, which is not typically specific enough to warrant curating ``orange`` as a
83+
prefix, e.g., in the Bioregistry
84+
85+
.. code-block:: python
86+
87+
import curies
88+
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingRewrites
89+
90+
rules = PreprocessingRules(
91+
rewrites=PreprocessingRewrites(
92+
resource_prefix={
93+
"CHMO": {"orange:": "orangebook:"},
94+
},
95+
),
96+
)
97+
98+
converter = curies.get_obo_converter()
99+
converter.add_prefix("orangebook", "https://bioregistry.io/orangebook:")
100+
converter = PreprocessingConverter.from_converter(
101+
converter, rules=rules,
102+
)
103+
104+
>>> converter.parse_curie("orange:10.2.1.1.3")
105+
ReferenceTuple('orangebook', '10.2.1.1.3')
106+
107+
Similarly, this can be used to inject knowledge about resources that improperly import
108+
EDAM sub-trees such as MCRO, which uses ``format`` as a prefix where it means
109+
``edam.format``
110+
111+
Blocks
112+
------
113+
114+
Some references are _never_ informative, and can be configured to be thrown away, such
115+
as ``Bgee:curators``, ``BioGRID:curators``, ``GROUP:OBI``, and similar group curation
116+
flags.
117+
118+
.. code-block:: python
119+
120+
import curies
121+
from curies import PreprocessingRules, PreprocessingConverter, PreprocessingBlocklists
122+
123+
rules = PreprocessingRules(
124+
blocklists=PreprocessingBlocklists(
125+
full=["Bgee:curators", "BioGRID:curators", "GROUP:OBI"],
126+
),
127+
)
128+
129+
converter = curies.get_obo_converter()
130+
converter = PreprocessingConverter.from_converter(
131+
converter, rules=rules,
132+
)
133+
134+
# raises a BlocklistError
135+
>>> converter.parse_curie("GROUP:OBI")
136+
137+
Blocklists cause throwing an exception that can be handled by downstream code, such as
138+
returning a None. This is done because in some places, it's nice to have the distinction
139+
between ``None`` being returned by parsing failing, versus actively being blocked. This
140+
can be toggled with the ``block_action`` argument.

src/curies/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,12 @@
2525
write_tsv,
2626
)
2727
from .discovery import discover, discover_from_rdf
28+
from .preprocessing import (
29+
PreprocessingBlocklists,
30+
PreprocessingConverter,
31+
PreprocessingRewrites,
32+
PreprocessingRules,
33+
)
2834
from .reconciliation import remap_curie_prefixes, remap_uri_prefixes, rewire
2935
from .sources import (
3036
get_bioregistry_converter,
@@ -45,6 +51,10 @@
4551
"NamedReference",
4652
"Prefix",
4753
"PrefixMap",
54+
"PreprocessingBlocklists",
55+
"PreprocessingConverter",
56+
"PreprocessingRewrites",
57+
"PreprocessingRules",
4858
"Record",
4959
"Records",
5060
"Reference",

src/curies/api.py

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1500,26 +1500,30 @@ def compress_or_standardize(
15001500

15011501
# docstr-coverage:excused `overload`
15021502
@overload
1503-
def parse(self, uri_or_curie: str, *, strict: Literal[True]) -> ReferenceTuple: ...
1503+
def parse(
1504+
self, str_or_uri_or_curie: str, *, strict: Literal[True] = True
1505+
) -> ReferenceTuple: ...
15041506

15051507
# docstr-coverage:excused `overload`
15061508
@overload
1507-
def parse(self, uri_or_curie: str, *, strict: Literal[False]) -> ReferenceTuple | None: ...
1509+
def parse(
1510+
self, str_or_uri_or_curie: str, *, strict: Literal[False] = False
1511+
) -> ReferenceTuple | None: ...
15081512

1509-
def parse(self, uri_or_curie: str, *, strict: bool) -> ReferenceTuple | None:
1510-
"""Parse a URI or CURIE."""
1511-
if self.is_uri(uri_or_curie):
1513+
def parse(self, str_or_uri_or_curie: str, *, strict: bool = False) -> ReferenceTuple | None:
1514+
"""Parse a string, URI, or CURIE."""
1515+
if self.is_uri(str_or_uri_or_curie):
15121516
if strict:
1513-
return self.parse_uri(uri_or_curie, strict=True, return_none=True)
1517+
return self.parse_uri(str_or_uri_or_curie, strict=True, return_none=True)
15141518
else:
1515-
return self.parse_uri(uri_or_curie, strict=False, return_none=True)
1516-
if self.is_curie(uri_or_curie):
1519+
return self.parse_uri(str_or_uri_or_curie, strict=False, return_none=True)
1520+
if self.is_curie(str_or_uri_or_curie):
15171521
if strict:
1518-
return self.parse_curie(uri_or_curie, strict=True)
1522+
return self.parse_curie(str_or_uri_or_curie, strict=True)
15191523
else:
1520-
return self.parse_curie(uri_or_curie, strict=False)
1524+
return self.parse_curie(str_or_uri_or_curie, strict=False)
15211525
if strict:
1522-
raise CompressionError(uri_or_curie)
1526+
raise CompressionError(str_or_uri_or_curie)
15231527
return None
15241528

15251529
def compress_strict(self, uri: str) -> str:

0 commit comments

Comments
 (0)