|
| 1 | +--- |
| 2 | +title: Graph & Semantic Data |
| 3 | +slug: /concepts/graph-and-semantic-data |
| 4 | +--- |
| 5 | + |
| 6 | +# Graph & Semantic Data |
| 7 | + |
| 8 | +Open Pulse stores its data in two complementary layers, each tuned for a |
| 9 | +different kind of question: |
| 10 | + |
| 11 | +- **Neo4j property graph** — fast traversal of the collaboration |
| 12 | + network. Users, repositories and organizations connected by |
| 13 | + contribution, ownership, membership and fork links. |
| 14 | +- **SPARQL store** (`sparql_store` — typically Oxigraph, but any |
| 15 | + SPARQL 1.1 + Graph Store HTTP Protocol backend works) — semantically |
| 16 | + rich metadata about each repository, person and organization, modelled |
| 17 | + with [the Open Pulse vocabulary](metadata-and-ontology.md). |
| 18 | + |
| 19 | +Both layers are produced by the same pipeline; the two run in parallel |
| 20 | +so the right shape of question can hit the right backend. |
| 21 | + |
| 22 | +```mermaid |
| 23 | +flowchart LR |
| 24 | + C[Open Pulse Crawler] --> N[(Neo4j<br/>property graph)] |
| 25 | + N --> M[git-metadata-extractor] |
| 26 | + M -->|JSON-LD| Q[Quest step:<br/>sparql_upload] |
| 27 | + Q --> S[(SPARQL store<br/>sparql_store)] |
| 28 | + N -. fast traversal .-> A1[Network analysis<br/>centrality, communities] |
| 29 | + S -. semantic query .-> A2[Metadata queries<br/>license, discipline, FAIR] |
| 30 | +``` |
| 31 | + |
| 32 | +## Neo4j: the community network |
| 33 | + |
| 34 | +### Schema |
| 35 | + |
| 36 | +```mermaid |
| 37 | +graph LR |
| 38 | + U((User)) -- CONTRIBUTES_TO --> R((Repo)) |
| 39 | + O((Org)) -- OWNS --> R |
| 40 | + U -- MEMBER_OF --> O |
| 41 | + R -- FORK_OF --> R2((Repo)) |
| 42 | +``` |
| 43 | + |
| 44 | +Three node labels and four relationship types. Property keys on the |
| 45 | +nodes: |
| 46 | + |
| 47 | +| Label | Properties | |
| 48 | +| ------ | ---------------------------------------------------------------------------- | |
| 49 | +| `Repo` | `id`, `name`, `full_name`, `owner`, `is_explored`, `exploration_timestamp` | |
| 50 | +| `User` | `id`, `login`, `name`, `type`, `is_explored`, `exploration_timestamp` | |
| 51 | +| `Org` | `id`, `login`, `name`, `type`, `is_explored`, `exploration_timestamp` | |
| 52 | + |
| 53 | +### Cypher examples |
| 54 | + |
| 55 | +Each snippet below runs against the live Neo4j instance |
| 56 | +(`bolt://localhost:7504` from the host, `bolt://neo4j:7687` inside the |
| 57 | +stack). |
| 58 | + |
| 59 | +**Top contributors by repository breadth.** |
| 60 | + |
| 61 | +```cypher |
| 62 | +MATCH (u:User)-[:CONTRIBUTES_TO]->(r:Repo) |
| 63 | +RETURN u.login AS user, count(r) AS repos |
| 64 | +ORDER BY repos DESC |
| 65 | +LIMIT 10 |
| 66 | +``` |
| 67 | + |
| 68 | +**All repositories an organization owns.** |
| 69 | + |
| 70 | +```cypher |
| 71 | +MATCH (o:Org {login: "sdsc-ordes"})-[:OWNS]->(r:Repo) |
| 72 | +RETURN r.full_name AS repo |
| 73 | +ORDER BY repo |
| 74 | +``` |
| 75 | + |
| 76 | +**Find users who contribute to two specific repos (co-contributors).** |
| 77 | + |
| 78 | +```cypher |
| 79 | +MATCH (u:User)-[:CONTRIBUTES_TO]->(r1:Repo {full_name: "sdsc-ordes/gimie"}), |
| 80 | + (u)-[:CONTRIBUTES_TO]->(r2:Repo) |
| 81 | +WHERE r2.full_name <> r1.full_name |
| 82 | +RETURN u.login AS user, collect(DISTINCT r2.full_name) AS also_contributes_to |
| 83 | +ORDER BY size(also_contributes_to) DESC |
| 84 | +LIMIT 10 |
| 85 | +``` |
| 86 | + |
| 87 | +**Repositories with the most forks in the store.** |
| 88 | + |
| 89 | +```cypher |
| 90 | +MATCH (fork:Repo)-[:FORK_OF]->(parent:Repo) |
| 91 | +RETURN parent.full_name AS repo, count(fork) AS forks |
| 92 | +ORDER BY forks DESC |
| 93 | +LIMIT 10 |
| 94 | +``` |
| 95 | + |
| 96 | +**Shortest collaboration path between two users.** |
| 97 | + |
| 98 | +```cypher |
| 99 | +MATCH p = shortestPath( |
| 100 | + (a:User {login: "caviri"})-[:CONTRIBUTES_TO|:MEMBER_OF*..6]-(b:User {login: "cmdoret"}) |
| 101 | +) |
| 102 | +RETURN [n IN nodes(p) | coalesce(n.login, n.full_name)] AS hops |
| 103 | +``` |
| 104 | + |
| 105 | +### Neo4j Browser |
| 106 | + |
| 107 | +A graphical Cypher console is available at |
| 108 | +[http://localhost:7503](http://localhost:7503). Authentication uses the |
| 109 | +`NEO4J_AUTH` credentials from `infra/.env` (default user: `neo4j`). |
| 110 | + |
| 111 | +## SPARQL store: semantic queries |
| 112 | + |
| 113 | +The same entities exist in the SPARQL store as RDF resources, modelled |
| 114 | +with a small custom vocabulary plus schema.org and the W3C Organization |
| 115 | +and Time ontologies. See |
| 116 | +[Metadata & Ontology](metadata-and-ontology.md) for the vocabulary |
| 117 | +reference and SPARQL examples. |
| 118 | + |
| 119 | +### When to use which layer |
| 120 | + |
| 121 | +| Question shape | Best layer | |
| 122 | +| ------------------------------------------------------- | --------------------- | |
| 123 | +| "Shortest path between two contributors" | Neo4j (graph algos) | |
| 124 | +| "Centrality / community detection / PageRank" | Neo4j + GDS plugin | |
| 125 | +| "Which repos are MIT-licensed and written in Python?" | SPARQL store | |
| 126 | +| "All people whose membership in `sdsc-ordes` is still open" | SPARQL store | |
| 127 | +| "Repositories enriched with linked external IDs (ORCID, …)" | SPARQL store | |
| 128 | +| "Aggregate contribution counts per discipline" | SPARQL store | |
| 129 | + |
| 130 | +The same repository appears in both layers: a `Repo` node in Neo4j |
| 131 | +(identified by `full_name`) maps to a `schema:SoftwareSourceCode` |
| 132 | +resource in the SPARQL store (identified by |
| 133 | +`op:githubRepositoryHandle`). |
| 134 | + |
| 135 | +## Cross-layer joins from Python |
| 136 | + |
| 137 | +Pipeline steps and notebooks talk to both layers through the |
| 138 | +[Services](../services/index.md) container. Outside the pipeline, the |
| 139 | +two endpoints can be queried side-by-side from a notebook: |
| 140 | + |
| 141 | +```python |
| 142 | +from neo4j import GraphDatabase |
| 143 | +from SPARQLWrapper import SPARQLWrapper, JSON |
| 144 | + |
| 145 | +neo = GraphDatabase.driver("bolt://localhost:7504", auth=("neo4j", "<password>")) |
| 146 | +sparql = SPARQLWrapper("http://localhost:7502/query") |
| 147 | +sparql.setReturnFormat(JSON) |
| 148 | + |
| 149 | +# 1. Graph traversal in Neo4j |
| 150 | +with neo.session() as s: |
| 151 | + repos = [r["full_name"] for r in s.run( |
| 152 | + "MATCH (:Org {login: 'sdsc-ordes'})-[:OWNS]->(r:Repo) RETURN r.full_name AS full_name" |
| 153 | + )] |
| 154 | + |
| 155 | +# 2. Enrich with semantic metadata from the SPARQL store |
| 156 | +values = " ".join(f'"{h}"' for h in repos) |
| 157 | +sparql.setQuery(f""" |
| 158 | + PREFIX op: <https://open-pulse.epfl.ch/ontology#> |
| 159 | + PREFIX schema: <http://schema.org/> |
| 160 | + SELECT ?handle ?license ?language WHERE {{ |
| 161 | + VALUES ?handle {{ {values} }} |
| 162 | + ?repo op:githubRepositoryHandle ?handle . |
| 163 | + OPTIONAL {{ ?repo schema:license ?license }} |
| 164 | + OPTIONAL {{ ?repo schema:programmingLanguage ?language }} |
| 165 | + }} |
| 166 | +""") |
| 167 | +rows = sparql.query().convert()["results"]["bindings"] |
| 168 | +``` |
| 169 | + |
| 170 | +## Where each backend runs |
| 171 | + |
| 172 | +| Service | Inside the stack | From the host | |
| 173 | +| -------------- | --------------------------------------- | ---------------------------------- | |
| 174 | +| Neo4j Bolt | `bolt://neo4j:7687` | `bolt://localhost:7504` | |
| 175 | +| Neo4j Browser | `http://neo4j:7474` | `http://localhost:7503` | |
| 176 | +| SPARQL endpoint| `http://sparql-proxy:7878/query` | `http://localhost:7502/query` | |
| 177 | + |
| 178 | +Host ports can shift if you customise `infra/.env` — `op deploy ps` |
| 179 | +shows the live mapping. |
0 commit comments