biopragmatics · cthoyt · Jul 4, 2025 · Jul 3, 2025 · Jul 3, 2025 · Jul 3, 2025
diff --git a/README.md b/README.md
@@ -38,8 +38,8 @@ provides:
 4. A confidence model granular at the curator-level, mapping set-level, and
    community feedback-level
 
-We also provide the SeMRA Raw Mappings Database, a set of pre-assembled semantic
-mappings from hundreds of ontologies and databases, on Zenodo at
+We also provide the SeMRA Raw Semantic Mappings Database, a set of pre-assembled
+semantic mappings from hundreds of ontologies and databases, on Zenodo at
 https://doi.org/10.5281/zenodo.11082038 that can be rebuilt with `semra build`.
 More information [here](https://semra.readthedocs.io/en/latest/artifacts.html).
 
@@ -71,29 +71,29 @@ mapping = Mapping(
 
 ### Assembly
 
-Mappings can be assembled from many source formats using functions in the
-`semra.io` submodule:
+Mappings can be assembled from many source formats using I/O functions exposed
+through the top-level `semra` submodule:
 
 ```python
-import semra.io
+import semra
 
 # load mappings from any standardized SSSOM file as a file path or URL, via `pandas.read_csv`
 sssom_url = "https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"
-mappings = semra.io.from_sssom(
+mappings = semra.from_sssom(
     sssom_url, license="spdx:CC0-1.0", mapping_set_title="biomappings",
 )
 
 # alternatively, metadata can be passed via a file/URL
-mappings_alt = semra.io.from_sssom(
+mappings_alt = semra.from_sssom(
     sssom_url,
     metadata="https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.yml"
 )
 
 # load mappings from the Gene Ontology (via OBO format)
-go_mappings = semra.io.from_pyobo("go")
+go_mappings = semra.from_pyobo("go")
 
 # load mappings from the Uber Anatomy Ontology (via OWL format)
-uberon_mappings = semra.io.from_bioontologies("uberon")
+uberon_mappings = semra.from_bioontologies("uberon")
 ```
 
 SeMRA also implements custom importers in the `semra.sources` submodule. It's
@@ -281,7 +281,7 @@ these references can be standardized in a deterministic and principled way.
 
 ```python
 import chembl_downloader
-import semra.io
+import semra
 from semra.api import prioritize_df
 
 # A dataframe of indication-disease pairs, where the
@@ -291,7 +291,7 @@ df = chembl_downloader.query("SELECT DISTINCT drugind_id, efo_id FROM DRUG_INDIC
 # a pre-calculated prioritization of diseases and phenotypes from MONDO, DOID,
 # HPO, ICD, GARD, and more.
 url = "https://zenodo.org/records/15164180/files/priority.sssom.tsv?download=1"
-mappings = semra.io.from_sssom(url)
+mappings = semra.from_sssom(url)
 
 # the dataframe will now have a new column with standardized references
 prioritize_df(mappings, df, column="efo_id", target_column="priority_indication_curie")

diff --git a/docs/source/artifacts.rst b/docs/source/artifacts.rst
@@ -1,5 +1,5 @@
-Raw Mapping Database
-====================
+SeMRA Raw Semantic Mappings Database
+====================================
 
 .. automodapi:: semra.database
     :no-heading:

diff --git a/docs/source/img/architecture.svg b/docs/source/img/architecture.svg
diff --git a/docs/source/img/datastruct.svg b/docs/source/img/datastruct.svg
diff --git a/docs/source/img/pipeline.svg b/docs/source/img/pipeline.svg
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -31,12 +31,13 @@ the digital humanities. Get started by loading external mappings:
 
 .. code-block:: python
 
-    import semra.io
+    import semra
 
-    # load mappings from any standardized SSSOM file as a file path or URL, via `pandas.read_csv`
-    sssom_url = "https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv"
-    mappings = semra.io.from_sssom(
-        sssom_url, license="spdx:CC0-1.0", mapping_set_title="biomappings",
+    mappings = semra.from_sssom(
+        # load mappings from any standardized SSSOM file as a file path or URL
+        "https://w3id.org/biopragmatics/biomappings/sssom/biomappings.sssom.tsv",
+        license="spdx:CC0-1.0",
+         mapping_set_title="biomappings",
     )
 
 Or by creating your own mappings:
@@ -77,6 +78,10 @@ Features
 4. A confidence model granular at the curator-level, mapping set-level, and community
    feedback-level
 
+Here's a conceptual diagram of SeMRA's architecture:
+
+.. image:: img/architecture.svg
+
 What SeMRA Isn't
 ----------------
 SeMRA isn't a tool for predicting semantic mappings like
@@ -93,15 +98,15 @@ web application for your use-case specific mapping database.
 SeMRA isn't itself a curation tool, but it has the option to integrate :mod:`biomappings`
 in deployments of its local web application for curation purposes.
 
-SeMRA isn't an tool for merging ontologies like `CoMerger <https://arxiv.org/abs/2005.02659>`_,
-but it outputs detailed and comprehensive semantic mappings that are critical
-as input for such tools.
+SeMRA isn't an tool for merging ontologies like `CoMerger <https://arxiv.org/abs/2005.02659>`_
+or `OntoMerger <https://arxiv.org/abs/2206.02238>`_, but it outputs detailed
+and comprehensive semantic mappings that are critical as input for such tools.
 
 Artifacts Overview
 ------------------
 
-SeMRA was used to produce the `SeMRA Raw Mappings Database <https://doi.org/10.5281/zenodo.11082038>`_,
-a comprehensive raw mappings database, and five domain-specific
+SeMRA was used to produce the `SeMRA Raw Semantic Mappings Database <https://doi.org/10.5281/zenodo.11082038>`_,
+a comprehensive raw semantic mappings database, and five domain-specific
 mapping databases (each with a landscape analysis). The results of the
 domain-specific landscape analyses can be found on the SeMRA `GitHub
 repository <https://github.com/biopragmatics/semra/tree/main/notebooks/landscape>`_.
@@ -145,11 +150,13 @@ Table of Contents
     :name: start
 
     installation
-    io
-    usage
-    cli
+    pipeline
     artifacts
+    tutorial
+    struct
+    io
     reference
+    cli
 
 Indices and Tables
 ------------------

diff --git a/docs/source/io.rst b/docs/source/io.rst
@@ -2,3 +2,4 @@ Getting and Writing Semantic Mappings
 =====================================
 
 .. automodapi:: semra.io
+    :no-heading:
diff --git a/docs/source/pipeline.rst b/docs/source/pipeline.rst
@@ -0,0 +1,6 @@
+Mapping Assembly Pipeline
+=========================
+
+.. automodapi:: semra.pipeline
+    :no-heading:
+    :no-inheritance-diagram:
diff --git a/docs/source/reference.rst b/docs/source/reference.rst
@@ -1,16 +1,13 @@
 Reference
 =========
 
-.. automodapi:: semra.struct
-    :skip: Reference
-
-.. automodapi:: semra.pipeline
+This contains several SeMRA submodules with low-level functionality. You can use these
+to build your own mapping processing workflows and I/O.
 
 .. automodapi:: semra.api
+    :no-inheritance-diagram:
 
-
-Constants
----------
+.. automodapi:: semra.inference
 
 .. automodapi:: semra.rules
     :include-all-objects:

diff --git a/docs/source/struct.rst b/docs/source/struct.rst
@@ -0,0 +1,6 @@
+Data Structure
+==============
+
+.. automodapi:: semra.struct
+    :skip: Reference,Triple
+    :no-heading:
diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -0,0 +1,30 @@
+Prioritizing CURIEs in a Dataframe
+==================================
+
+SeMRA provides tools for data scientists to standardize references using semantic
+mappings.
+
+For example, the drug indications table in ChEMBL contains a variety of references to
+EFO, MONDO, DOID, and other controlled vocabularies (described in detail in `this blog
+post <https://cthoyt.com/2025/04/17/chembl-indications-efo-exploration.html>`_). Using
+SeMRA's pre-constructed `disease and phenotype prioritization mapping
+<https://doi.org/10.5281/zenodo.11091885>`_, these references can be standardized in a
+deterministic and principled way.
+
+.. code-block:: python
+
+    import chembl_downloader
+    import semra.io
+    from semra.api import prioritize_df
+
+    # A dataframe of indication-disease pairs, where the
+    # "efo_id" column is actually an arbitrary disease or phenotype query
+    df = chembl_downloader.query("SELECT DISTINCT drugind_id, efo_id FROM DRUG_INDICATION")
+
+    # a pre-calculated prioritization of diseases and phenotypes from MONDO, DOID,
+    # HPO, ICD, GARD, and more.
+    url = "https://zenodo.org/records/15164180/files/priority.sssom.tsv?download=1"
+    mappings = semra.io.from_sssom(url)
+
+    # the dataframe will now have a new column with standardized references
+    prioritize_df(mappings, df, column="efo_id", target_column="priority_indication_curie")
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
diff --git a/scripts/gilda_reprocess.py → notebooks/gilda_reprocess.py b/scripts/gilda_reprocess.py → notebooks/gilda_reprocess.py
@@ -5,18 +5,18 @@
 import pystow
 from gilda import Grounder
 from gilda.grounder import load_entries_from_terms_file
-from gilda.resources import get_grounding_terms, resource_dir
+from gilda.resources import get_grounding_terms
 
 from semra.gilda_utils import (
     GILDA_TO_BIOREGISTRY,
     print_scored_matches,
-    standardize_terms,
-    update_terms,
+    standardize_gilda_terms,
+    update_gilda_terms,
 )
-from semra.pipeline import Configuration, Input, Mutation, get_priority_mappings_from_config
+from semra.pipeline import AssembleReturnType, Configuration, Input, Mutation, assemble
 
 MODULE = pystow.module("semra", "gilda-demo")
-PROCESSED_GILDA_TERMS_PATH = resource_dir.joinpath("grounding_terms_standardized.tsv.gz")
+PROCESSED_GILDA_TERMS_PATH = MODULE.join(name="grounding_terms_standardized.tsv.gz")
 
 PRIORITY = [
     "HP",
@@ -42,6 +42,8 @@
 PRIORITY = [GILDA_TO_BIOREGISTRY[p] for p in PRIORITY]
 
 CONFIGURATION = Configuration(
+    key="gilda",
+    name="Gilda Reprocessing",
     inputs=[
         Input(source="biomappings"),
         Input(source="gilda"),
@@ -72,14 +74,14 @@ def _get_terms() -> list[gilda.Term]:
     from gilda.generate_terms import dump_terms
 
     terms: list[gilda.Term] = list(load_entries_from_terms_file(get_grounding_terms()))
-    terms = standardize_terms(terms)
+    terms = standardize_gilda_terms(terms)
     dump_terms(terms, PROCESSED_GILDA_TERMS_PATH)
     return terms
 
 
-def main():
+def main() -> None:
     """Reprocess the gilda default lexical index."""
-    mappings = get_priority_mappings_from_config(CONFIGURATION)
+    mappings = assemble(CONFIGURATION, return_type=AssembleReturnType.priority)
     if not mappings:
         raise ValueError("Bad mapping priority definition resulted in no mappings")
 
@@ -91,7 +93,7 @@ def main():
     if missing:
         raise ValueError(f"Missing: {sorted(missing)}")
 
-    terms = update_terms(terms, mappings)
+    terms = update_gilda_terms(terms, mappings)
 
     grounder = Grounder(terms)
     s = "Pelvic lipomatosis"

diff --git a/notebooks/landscape/anatomy/README.md b/notebooks/landscape/anatomy/README.md
@@ -11,6 +11,24 @@ Charles Tapley Hoyt (orcid:0000-0003-4423-4370)
 </li>
 </ul>
 
+## Reproduction
+
+The SeMRA Anatomy Mappings Database can be rebuilt with the following commands:
+
+```console
+$ git clone https://github.com/biopragmatics/semra.git
+$ cd semra
+$ uv pip install .[landscape]
+$ python -m semra.landscape.anatomy
+```
+
+Note that downloading raw data resources can take on the order of hours to tens
+of hours depending on your internet connection and the reliability of the
+resources' respective servers.
+
+Processing and analysis can be run overnight on commodity hardware (e.g., a 2023
+MacBook Pro with 36GB RAM).
+
 ## Resource Summary
 
 The following resources are represented in processed mappings generated. They

diff --git a/notebooks/landscape/cell/README.md b/notebooks/landscape/cell/README.md
@@ -12,6 +12,25 @@ Charles Tapley Hoyt (orcid:0000-0003-4423-4370)
 </li>
 </ul>
 
+## Reproduction
+
+The SeMRA Cell and Cell Line Mappings Database can be rebuilt with the following
+commands:
+
+```console
+$ git clone https://github.com/biopragmatics/semra.git
+$ cd semra
+$ uv pip install .[landscape]
+$ python -m semra.landscape.cell
+```
+
+Note that downloading raw data resources can take on the order of hours to tens
+of hours depending on your internet connection and the reliability of the
+resources' respective servers.
+
+Processing and analysis can be run overnight on commodity hardware (e.g., a 2023
+MacBook Pro with 36GB RAM).
+
 ## Resource Summary
 
 The following resources are represented in processed mappings generated. They
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,3 +2,4 @@ Getting and Writing Semantic Mappings
		=====================================

		.. automodapi:: semra.io
		:no-heading: