diff --git a/docs/source/cypher.rst b/docs/source/cypher.rst new file mode 100644 index 00000000..2fa06850 --- /dev/null +++ b/docs/source/cypher.rst @@ -0,0 +1,185 @@ +Querying with Cypher +==================== + +SeMRA constructs data artifacts and docker configuration for locally deploying a Neo4j +graph databases and a web application via :func:`semra.io.write_neo4j` (for example +outputs, see :mod:`semra.database` or :mod:`semra.landscape`). The resulting graph +database can be queried directly with the `Cypher query language +`_ in one of the following +ways: + +1. By connecting with a client via the ``bolt`` protocol on port 7687, which is exposed + in the Dockerfile +2. By navigating to http://localhost:7474 in the web browser to use Neo4j's builtin + graphical front-end, where you can type in Cypher queries and interact with the + results. + +The contents of the grpah database have the following schema: + +.. image:: img/graph-schema.svg + +Below, some example Cypher queries are given to show what is possible by direct querying +of the database. + +Lookup by CURIE +--------------- + +The following Cypher queries allow for looking up concepts, mappings, evidences, and +mapping sets. + +Look up a concept (e.g., a cell line) by its CURIE: + +.. code-block:: cypher + + MATCH (n:concept) + WHERE n.curie = "cellosaurus:0440" + RETURN n + +The same is possible for mappings, evidences, and mapping sets. Each of these three +types of entities has SeMRA-specific CURIE generation. For a mapping: + +.. code-block:: cypher + + MATCH (m:mapping) + WHERE m.curie = "..." + RETURN m + +For an evidence: + +.. code-block:: cypher + + MATCH (e:evidence) + WHERE e.curie = "..." + RETURN e + +For a mapping set: + +.. code-block:: cypher + + MATCH (s:mappingset) + WHERE s.curie = "..." + RETURN s + +Cypher also lets you return certain parts from each record. The list of what fields are +available can be found in the following documentation: + +=========== =============================================== +Concept :data:`semra.io.neo4j_io.CONCEPT_NODES_HEADER` +Mapping :data:`semra.io.neo4j_io.MAPPING_NODES_HEADER` +Evidence :data:`semra.io.neo4j_io.EVIDENCE_NODES_HEADER` +Mapping Set :data:`semra.io.neo4j_io.MAPPING_NODES_HEADER` +=========== =============================================== + +For example, you can look up a concept by its CURIE and return specific parts, such as +the name: + +.. code-block:: cypher + + MATCH (n:concept) + WHERE n.curie = "cellosaurus:0440" + RETURN n.name + +Traversing Mappings +------------------- + +Get all targets for exact match mappings where ``cellosaurus:0440`` is the source: + +.. code-block:: cypher + + MATCH + (source:concept)-[:`skos:exactMatch`]->(target:concept) + WHERE source.curie = "cellosaurus:0440" + RETURN target + +The same query can be reified using ``owl:annotatedSource``, ``owl:annotatedTarget``, +and the ``mapping`` node type: + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m)-[:`owl:annotatedSource`]->(target:concept) + WHERE source.curie = "cellosaurus:0440" and m.predicate == "skos:exactMatch" + RETURN target + +After reifying, you can extend the query to return evidences. In the interactive view, +returning multiple elements will also automatically show edges between them + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m)-[:`owl:annotatedSource`]->(target:concept) + (m)-[:hasEvidence]->(e:evidence) + WHERE source.curie = "cellosaurus:0440" and m.predicate == "skos:exactMatch" + RETURN source, target, m, e + +Reification is useful for doing complex filters, e.g., on mapping justification. The +following query returns exact matches to ``cellosaurus:0440`` that have manual mapping +justification + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m)-[:`owl:annotatedSource`]->(target:concept) + (m)-[:hasEvidence]->(e:evidence) + WHERE + source.curie = "cellosaurus:0440" + and m.predicate == "skos:exactMatch" + and e.mapping_justification == "semapv:ManualMappingCuration" + RETURN target + +The previous query can be reformulated to filter for minimum confidence: + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m)-[:`owl:annotatedSource`]->(target:concept) + (m)-[:hasEvidence]->(e:evidence) + WHERE + source.curie = "cellosaurus:0440" + and m.predicate == "skos:exactMatch" + and e.confidence > 0.3 + RETURN target + +It can also be extended to return the authors of the evidences: + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m)-[:`owl:annotatedSource`]->(target:concept) + (m)-[:hasEvidence]->(e:evidence) + (e)-[:hasAuthor]->(author:concept) + WHERE + source.curie = "cellosaurus:0440" + and m.predicate == "skos:exactMatch" + and e.mapping_justification == "semapv:ManualMappingCuration" + RETURN target, author + +The following query gets all mappings (with associated evidences, mapping sets, and +authors) where ``cellosaurus:0440`` is the source, with optional matches for mapping +sets and authors: + +.. code-block:: cypher + + MATCH + (m:mapping)-[:`owl:annotatedSource`]->(source:concept) , + (m:mapping)-[:`owl:annotatedTarget`]->(target:concept) , + (m)-[:hasEvidence]->(e:evidence) + WHERE source.curie = "cellosaurus:0440" + OPTIONAL MATCH + (e)-[:fromSet]->(mset:mappingset) + OPTIONAL MATCH + (e)-[:hasAuthor]->(author:concept) + RETURN source, target, m, e, mset, author + +Neo4j Output Reference +---------------------- + +.. automodapi:: semra.io.neo4j_io + :skip: write_neo4j + :include-all-objects: + :no-heading: diff --git a/docs/source/img/graph-schema.svg b/docs/source/img/graph-schema.svg new file mode 100644 index 00000000..7a2ee15d --- /dev/null +++ b/docs/source/img/graph-schema.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 96287302..1f70b286 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -157,6 +157,7 @@ Table of Contents io reference cli + cypher Indices and Tables ------------------ diff --git a/src/semra/io/neo4j_io.py b/src/semra/io/neo4j_io.py index 775549bb..decbbc7d 100644 --- a/src/semra/io/neo4j_io.py +++ b/src/semra/io/neo4j_io.py @@ -26,6 +26,16 @@ from ..utils import gzip_path __all__ = [ + "CONCEPT_NODES_HEADER", + "DERIVED_PREDICATE", + "EDGES_HEADER", + "EDGES_SUPPLEMENT_HEADER", + "EVIDENCE_NODES_HEADER", + "FROM_SET_PREDICATE", + "HAS_AUTHOR_PREDICATE", + "HAS_EVIDENCE_PREDICATE", + "MAPPING_NODES_HEADER", + "MAPPING_SET_NODES_HEADER", "write_neo4j", ] @@ -39,7 +49,9 @@ PYTHON = "python3.13" +#: The column headers for the concept nodes in the SeMRA Neo4j graph database export CONCEPT_NODES_HEADER = ["curie:ID", "prefix", "name", "priority:boolean"] +#: The column headers for the mapping nodes in the SeMRA Neo4j graph database export MAPPING_NODES_HEADER = [ "curie:ID", "prefix", @@ -49,6 +61,7 @@ "secondary:boolean", "tertiary:boolean", ] +#: The column headers for evidence nodes in the SeMRA Neo4j graph database export EVIDENCE_NODES_HEADER = [ "curie:ID", "prefix", @@ -65,6 +78,8 @@ "version", "confidence:float", ] + +#: The column headers for properties attached to simple mappings EDGES_HEADER = [ ":START_ID", ":TYPE", @@ -75,7 +90,10 @@ "tertiary:boolean", "mapping_sets:string[]", ] -# for extra edges that aren't mapping edges +#: for extra edges that aren't mapping edges, such as +#: those with :data:`HAS_EVIDENCE_PREDICATE`, +#: :data:`FROM_SET_PREDICATE`, :data:`DERIVED_PREDICATE`, +#: and :data:`HAS_AUTHOR_PREDICATE` EDGES_SUPPLEMENT_HEADER = [ ":START_ID", ":TYPE",