Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ provides:
4. A confidence model granular at the curator-level, mapping set-level, and
community feedback-level

We also provide the SeMRA Raw Mappings Database, a set of pre-assembled semantic
mappings from hundreds of ontologies and databases, on Zenodo at
We also provide the SeMRA Raw Semantic Mappings Database, a set of pre-assembled
semantic mappings from hundreds of ontologies and databases, on Zenodo at
https://doi.org/10.5281/zenodo.11082038 that can be rebuilt with `semra build`.
More information [here](https://semra.readthedocs.io/en/latest/artifacts.html).

Expand Down Expand Up @@ -71,29 +71,29 @@ mapping = Mapping(

### Assembly

Mappings can be assembled from many source formats using functions in the
`semra.io` submodule:
Mappings can be assembled from many source formats using I/O functions exposed
through the top-level `semra` submodule:

```python
import semra.io
import semra

# load mappings from any standardized SSSOM file as a file path or URL, via `pandas.read_csv`
sssom_url = "https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"
mappings = semra.io.from_sssom(
mappings = semra.from_sssom(
sssom_url, license="spdx:CC0-1.0", mapping_set_title="biomappings",
)

# alternatively, metadata can be passed via a file/URL
mappings_alt = semra.io.from_sssom(
mappings_alt = semra.from_sssom(
sssom_url,
metadata="https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.yml"
)

# load mappings from the Gene Ontology (via OBO format)
go_mappings = semra.io.from_pyobo("go")
go_mappings = semra.from_pyobo("go")

# load mappings from the Uber Anatomy Ontology (via OWL format)
uberon_mappings = semra.io.from_bioontologies("uberon")
uberon_mappings = semra.from_bioontologies("uberon")
```

SeMRA also implements custom importers in the `semra.sources` submodule. It's
Expand Down Expand Up @@ -281,7 +281,7 @@ these references can be standardized in a deterministic and principled way.

```python
import chembl_downloader
import semra.io
import semra
from semra.api import prioritize_df

# A dataframe of indication-disease pairs, where the
Expand All @@ -291,7 +291,7 @@ df = chembl_downloader.query("SELECT DISTINCT drugind_id, efo_id FROM DRUG_INDIC
# a pre-calculated prioritization of diseases and phenotypes from MONDO, DOID,
# HPO, ICD, GARD, and more.
url = "https://zenodo.org/records/15164180/files/priority.sssom.tsv?download=1"
mappings = semra.io.from_sssom(url)
mappings = semra.from_sssom(url)

# the dataframe will now have a new column with standardized references
prioritize_df(mappings, df, column="efo_id", target_column="priority_indication_curie")
Expand Down
4 changes: 2 additions & 2 deletions docs/source/artifacts.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Raw Mapping Database
====================
SeMRA Raw Semantic Mappings Database
====================================

.. automodapi:: semra.database
:no-heading:
Expand Down
1 change: 1 addition & 0 deletions docs/source/img/architecture.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/img/datastruct.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/img/pipeline.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
33 changes: 20 additions & 13 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,13 @@ the digital humanities. Get started by loading external mappings:

.. code-block:: python

import semra.io
import semra

# load mappings from any standardized SSSOM file as a file path or URL, via `pandas.read_csv`
sssom_url = "https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"
mappings = semra.io.from_sssom(
sssom_url, license="spdx:CC0-1.0", mapping_set_title="biomappings",
mappings = semra.from_sssom(
# load mappings from any standardized SSSOM file as a file path or URL
"https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv",
license="spdx:CC0-1.0",
mapping_set_title="biomappings",
)

Or by creating your own mappings:
Expand Down Expand Up @@ -77,6 +78,10 @@ Features
4. A confidence model granular at the curator-level, mapping set-level, and community
feedback-level

Here's a conceptual diagram of SeMRA's architecture:

.. image:: img/architecture.svg

What SeMRA Isn't
----------------
SeMRA isn't a tool for predicting semantic mappings like
Expand All @@ -93,15 +98,15 @@ web application for your use-case specific mapping database.
SeMRA isn't itself a curation tool, but it has the option to integrate :mod:`biomappings`
in deployments of its local web application for curation purposes.

SeMRA isn't an tool for merging ontologies like `CoMerger <https://arxiv.org/abs/2005.02659>`_,
but it outputs detailed and comprehensive semantic mappings that are critical
as input for such tools.
SeMRA isn't an tool for merging ontologies like `CoMerger <https://arxiv.org/abs/2005.02659>`_
or `OntoMerger <https://arxiv.org/abs/2206.02238>`_, but it outputs detailed
and comprehensive semantic mappings that are critical as input for such tools.

Artifacts Overview
------------------

SeMRA was used to produce the `SeMRA Raw Mappings Database <https://doi.org/10.5281/zenodo.11082038>`_,
a comprehensive raw mappings database, and five domain-specific
SeMRA was used to produce the `SeMRA Raw Semantic Mappings Database <https://doi.org/10.5281/zenodo.11082038>`_,
a comprehensive raw semantic mappings database, and five domain-specific
mapping databases (each with a landscape analysis). The results of the
domain-specific landscape analyses can be found on the SeMRA `GitHub
repository <https://github.com/biopragmatics/semra/tree/main/notebooks/landscape>`_.
Expand Down Expand Up @@ -145,11 +150,13 @@ Table of Contents
:name: start

installation
io
usage
cli
pipeline
artifacts
tutorial
struct
io
reference
cli

Indices and Tables
------------------
Expand Down
1 change: 1 addition & 0 deletions docs/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ Getting and Writing Semantic Mappings
=====================================

.. automodapi:: semra.io
:no-heading:
6 changes: 6 additions & 0 deletions docs/source/pipeline.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Mapping Assembly Pipeline
=========================

.. automodapi:: semra.pipeline
:no-heading:
:no-inheritance-diagram:
11 changes: 4 additions & 7 deletions docs/source/reference.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,13 @@
Reference
=========

.. automodapi:: semra.struct
:skip: Reference

.. automodapi:: semra.pipeline
This contains several SeMRA submodules with low-level functionality. You can use these
to build your own mapping processing workflows and I/O.

.. automodapi:: semra.api
:no-inheritance-diagram:


Constants
---------
.. automodapi:: semra.inference

.. automodapi:: semra.rules
:include-all-objects:
Expand Down
6 changes: 6 additions & 0 deletions docs/source/struct.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Data Structure
==============

.. automodapi:: semra.struct
:skip: Reference,Triple
:no-heading:
30 changes: 30 additions & 0 deletions docs/source/tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Prioritizing CURIEs in a Dataframe
==================================

SeMRA provides tools for data scientists to standardize references using semantic
mappings.

For example, the drug indications table in ChEMBL contains a variety of references to
EFO, MONDO, DOID, and other controlled vocabularies (described in detail in `this blog
post <https://cthoyt.com/2025/04/17/chembl-indications-efo-exploration.html>`_). Using
SeMRA's pre-constructed `disease and phenotype prioritization mapping
<https://doi.org/10.5281/zenodo.11091885>`_, these references can be standardized in a
deterministic and principled way.

.. code-block:: python

import chembl_downloader
import semra.io
from semra.api import prioritize_df

# A dataframe of indication-disease pairs, where the
# "efo_id" column is actually an arbitrary disease or phenotype query
df = chembl_downloader.query("SELECT DISTINCT drugind_id, efo_id FROM DRUG_INDICATION")

# a pre-calculated prioritization of diseases and phenotypes from MONDO, DOID,
# HPO, ICD, GARD, and more.
url = "https://zenodo.org/records/15164180/files/priority.sssom.tsv?download=1"
mappings = semra.io.from_sssom(url)

# the dataframe will now have a new column with standardized references
prioritize_df(mappings, df, column="efo_id", target_column="priority_indication_curie")
6 changes: 0 additions & 6 deletions docs/source/usage.rst

This file was deleted.

20 changes: 11 additions & 9 deletions scripts/gilda_reprocess.py → notebooks/gilda_reprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@
import pystow
from gilda import Grounder
from gilda.grounder import load_entries_from_terms_file
from gilda.resources import get_grounding_terms, resource_dir
from gilda.resources import get_grounding_terms

from semra.gilda_utils import (
GILDA_TO_BIOREGISTRY,
print_scored_matches,
standardize_terms,
update_terms,
standardize_gilda_terms,
update_gilda_terms,
)
from semra.pipeline import Configuration, Input, Mutation, get_priority_mappings_from_config
from semra.pipeline import AssembleReturnType, Configuration, Input, Mutation, assemble

MODULE = pystow.module("semra", "gilda-demo")
PROCESSED_GILDA_TERMS_PATH = resource_dir.joinpath("grounding_terms_standardized.tsv.gz")
PROCESSED_GILDA_TERMS_PATH = MODULE.join(name="grounding_terms_standardized.tsv.gz")

PRIORITY = [
"HP",
Expand All @@ -42,6 +42,8 @@
PRIORITY = [GILDA_TO_BIOREGISTRY[p] for p in PRIORITY]

CONFIGURATION = Configuration(
key="gilda",
name="Gilda Reprocessing",
inputs=[
Input(source="biomappings"),
Input(source="gilda"),
Expand Down Expand Up @@ -72,14 +74,14 @@ def _get_terms() -> list[gilda.Term]:
from gilda.generate_terms import dump_terms

terms: list[gilda.Term] = list(load_entries_from_terms_file(get_grounding_terms()))
terms = standardize_terms(terms)
terms = standardize_gilda_terms(terms)
dump_terms(terms, PROCESSED_GILDA_TERMS_PATH)
return terms


def main():
def main() -> None:
"""Reprocess the gilda default lexical index."""
mappings = get_priority_mappings_from_config(CONFIGURATION)
mappings = assemble(CONFIGURATION, return_type=AssembleReturnType.priority)
if not mappings:
raise ValueError("Bad mapping priority definition resulted in no mappings")

Expand All @@ -91,7 +93,7 @@ def main():
if missing:
raise ValueError(f"Missing: {sorted(missing)}")

terms = update_terms(terms, mappings)
terms = update_gilda_terms(terms, mappings)

grounder = Grounder(terms)
s = "Pelvic lipomatosis"
Expand Down
18 changes: 18 additions & 0 deletions notebooks/landscape/anatomy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ Charles Tapley Hoyt (orcid:0000-0003-4423-4370)
</li>
</ul>

## Reproduction

The SeMRA Anatomy Mappings Database can be rebuilt with the following commands:

```console
$ git clone https://github.com/biopragmatics/semra.git
$ cd semra
$ uv pip install .[landscape]
$ python -m semra.landscape.anatomy
```

Note that downloading raw data resources can take on the order of hours to tens
of hours depending on your internet connection and the reliability of the
resources' respective servers.

Processing and analysis can be run overnight on commodity hardware (e.g., a 2023
MacBook Pro with 36GB RAM).

## Resource Summary

The following resources are represented in processed mappings generated. They
Expand Down
19 changes: 19 additions & 0 deletions notebooks/landscape/cell/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,25 @@ Charles Tapley Hoyt (orcid:0000-0003-4423-4370)
</li>
</ul>

## Reproduction

The SeMRA Cell and Cell Line Mappings Database can be rebuilt with the following
commands:

```console
$ git clone https://github.com/biopragmatics/semra.git
$ cd semra
$ uv pip install .[landscape]
$ python -m semra.landscape.cell
```

Note that downloading raw data resources can take on the order of hours to tens
of hours depending on your internet connection and the reliability of the
resources' respective servers.

Processing and analysis can be run overnight on commodity hardware (e.g., a 2023
MacBook Pro with 36GB RAM).

## Resource Summary

The following resources are represented in processed mappings generated. They
Expand Down
Loading
Loading