|
| 1 | +--- |
| 2 | +title: Metrics & CHAOSS |
| 3 | +slug: /concepts/metrics-and-chaoss |
| 4 | +--- |
| 5 | + |
| 6 | +# Metrics & CHAOSS |
| 7 | + |
| 8 | +Open Pulse turns raw git and GitHub activity into a stream of |
| 9 | +[CHAOSS](https://chaoss.community/) metrics — community health |
| 10 | +indicators that go beyond stars and citations and capture how a project |
| 11 | +actually behaves over time. The full |
| 12 | +[GrimoireLab](https://chaoss.github.io/grimoirelab/) stack runs as part |
| 13 | +of the default deployment; this page is a tour of what is wired up and |
| 14 | +how to query it. |
| 15 | + |
| 16 | +## What is CHAOSS? |
| 17 | + |
| 18 | +CHAOSS (Community Health Analytics in Open Source Software) is a Linux |
| 19 | +Foundation project that publishes a catalogue of metrics, metrics models |
| 20 | +and reference implementations for measuring open-source community |
| 21 | +health. The catalogue covers four focus areas: |
| 22 | + |
| 23 | +- **Project activity** — commit frequency, PR velocity, issue activity, |
| 24 | + release cadence. |
| 25 | +- **Responsiveness** — issue resolution time, PR review time, first |
| 26 | + response time. |
| 27 | +- **Diversity & inclusion** — contributor count, organisational |
| 28 | + diversity, new-contributor rate, bus factor. |
| 29 | +- **Risk & sustainability** — dependency health, license compliance, |
| 30 | + project age and stability. |
| 31 | + |
| 32 | +The reference implementation is GrimoireLab, which Open Pulse runs in |
| 33 | +full. |
| 34 | + |
| 35 | +## The Open Pulse metrics stack |
| 36 | + |
| 37 | +```mermaid |
| 38 | +flowchart LR |
| 39 | + subgraph collect [Collect] |
| 40 | + GH[GitHub API] |
| 41 | + GIT[Git repos] |
| 42 | + end |
| 43 | +
|
| 44 | + subgraph orchestrate [Orchestrate] |
| 45 | + M[Mordred<br/>open-pulse-mordred] |
| 46 | + P[Perceval<br/>collectors] |
| 47 | + end |
| 48 | +
|
| 49 | + subgraph store [Store] |
| 50 | + OS[(OpenSearch<br/>raw + enriched)] |
| 51 | + SH[(SortingHat<br/>identities)] |
| 52 | + end |
| 53 | +
|
| 54 | + subgraph apply [Configure] |
| 55 | + A[Projects applier<br/>localhost:1235] |
| 56 | + end |
| 57 | +
|
| 58 | + subgraph view [View] |
| 59 | + D[OpenSearch Dashboards<br/>localhost:5601 / 7508] |
| 60 | + PY[Python notebooks] |
| 61 | + end |
| 62 | +
|
| 63 | + GH --> P |
| 64 | + GIT --> P |
| 65 | + M --> P |
| 66 | + P --> OS |
| 67 | + P --> SH |
| 68 | + A -. projects.json .-> M |
| 69 | + OS --> D |
| 70 | + OS --> PY |
| 71 | +``` |
| 72 | + |
| 73 | +Every box in that diagram is a real container shipped by |
| 74 | +`docker-compose.grimoirelab.yml` (the opt-in overlay enabled by |
| 75 | +`op deploy up --with-grimoire`). |
| 76 | + |
| 77 | +### Components |
| 78 | + |
| 79 | +| Component | Role | Endpoint (host) | |
| 80 | +| --------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------ | |
| 81 | +| `open-pulse-mordred` | Orchestrator. Reads a `projects.json`, drives Perceval collectors on a schedule, ships docs to OpenSearch. | (no HTTP surface) | |
| 82 | +| `opensearch-node1` | Storage for raw + enriched documents. | `https://localhost:9200` | |
| 83 | +| `opensearch-dashboards` | Browse + dashboard editor. | `http://localhost:5601` | |
| 84 | +| `nginx` | TLS-terminating reverse proxy for the dashboards. | `http://localhost:7508` | |
| 85 | +| `sortinghat` + `sortinghat_worker` | Author identity unification (one person, many emails). | (internal, REST + worker) | |
| 86 | +| `mariadb` | Backing store for SortingHat + Mordred state. | (internal) | |
| 87 | +| `valkey` | Queue between SortingHat and its worker. | (internal) | |
| 88 | +| `projects-applier` | Receives a fresh `projects.json` over HTTP and atomically swaps it into Mordred. | `http://localhost:1235` | |
| 89 | + |
| 90 | +### Today's data |
| 91 | + |
| 92 | +A live deployment after a few crawl cycles typically contains: |
| 93 | + |
| 94 | +| Index | Docs | What it holds | |
| 95 | +| ----------------------------- | ---------:| ---------------------------------------------- | |
| 96 | +| `git_demo_raw` | 1.7M | Raw git log events from every tracked repo | |
| 97 | +| `git_demo_enriched` | 44K+ | Per-commit enriched docs with 181 CHAOSS fields| |
| 98 | +| `git-aoc_demo_enriched` | 572K | Areas of code (touched-file aggregation) | |
| 99 | +| `git-onion_demo_enriched_*` | 15K | "Onion model" tiers (core / regular / casual) | |
| 100 | + |
| 101 | +The enriched index covers ~193 distinct repositories across 9 |
| 102 | +GrimoireLab projects (one per organisation in |
| 103 | +[the graph](graph-and-semantic-data.md)). |
| 104 | + |
| 105 | +## Feeding the pipeline |
| 106 | + |
| 107 | +Open Pulse keeps the GrimoireLab project configuration in sync with the |
| 108 | +SPARQL store, so what gets indexed by CHAOSS metrics is always the same |
| 109 | +set of repositories that appear in the graph and the RDF store. Three |
| 110 | +CLI commands manage this: |
| 111 | + |
| 112 | +```bash |
| 113 | +# Build a fresh projects.json from the SPARQL store, write it locally. |
| 114 | +open-pulse services grimoire prepare-config |
| 115 | + |
| 116 | +# Same, then POST it to the projects-applier so Mordred picks it up. |
| 117 | +open-pulse services grimoire apply |
| 118 | + |
| 119 | +# Install a cron job that watches a git repo and re-applies on change. |
| 120 | +open-pulse services grimoire install-watcher |
| 121 | +``` |
| 122 | + |
| 123 | +The applier exposes three HTTP endpoints if you need to drive it from |
| 124 | +outside the CLI: |
| 125 | + |
| 126 | +- `GET /healthz` — liveness probe. |
| 127 | +- `GET /current` — the projects.json currently in effect. |
| 128 | +- `POST /apply` — submit a new projects.json (authenticated). |
| 129 | + |
| 130 | +## CHAOSS document shape |
| 131 | + |
| 132 | +Each commit lands in `git_demo_enriched` as a ~181-field document. |
| 133 | +Authoritative names worth knowing when writing queries or notebooks: |
| 134 | + |
| 135 | +- **Identity (post-SortingHat unification):** `Author_uuid`, |
| 136 | + `Author_user_name`, `Author_org_name`, `Author_multi_org_names`, |
| 137 | + `Author_bot`, `Author_gender`. Same family of fields exists with the |
| 138 | + `Commit_` prefix. |
| 139 | +- **Raw author/committer:** lowercase `author_name`, `author_domain`, |
| 140 | + `committer_name`, etc. |
| 141 | +- **Dates:** `author_date`, `author_date_hour`, `author_date_weekday`, |
| 142 | + `commit_date`. |
| 143 | +- **Repo:** `origin` (git URL), `project` (Open Pulse project slug), |
| 144 | + `repo_name`. |
| 145 | +- **Change shape:** `files`, `lines_added`, `lines_removed`, |
| 146 | + `lines_changed`, file-level child docs. |
| 147 | + |
| 148 | +## Accessing metrics |
| 149 | + |
| 150 | +### OpenSearch Dashboards |
| 151 | + |
| 152 | +The browse UI lives at |
| 153 | +[http://localhost:5601](http://localhost:5601) (direct) or |
| 154 | +[http://localhost:7508](http://localhost:7508) (via the nginx |
| 155 | +reverse-proxy). Both surfaces present the same dashboards. Auth uses the |
| 156 | +`OPENSEARCH_INITIAL_ADMIN_PASSWORD` set in `infra/.env`. |
| 157 | + |
| 158 | +### Python |
| 159 | + |
| 160 | +OpenSearch speaks the Elasticsearch API; `opensearch-py` is the most |
| 161 | +straightforward client: |
| 162 | + |
| 163 | +```python |
| 164 | +from opensearchpy import OpenSearch |
| 165 | +import os |
| 166 | + |
| 167 | +os_client = OpenSearch( |
| 168 | + hosts=[{"host": "localhost", "port": 9200, "scheme": "https"}], |
| 169 | + http_auth=("admin", os.environ["OPENSEARCH_INITIAL_ADMIN_PASSWORD"]), |
| 170 | + verify_certs=False, |
| 171 | +) |
| 172 | + |
| 173 | +# Commits per month across all tracked repos |
| 174 | +body = { |
| 175 | + "size": 0, |
| 176 | + "aggs": { |
| 177 | + "monthly": { |
| 178 | + "date_histogram": { |
| 179 | + "field": "author_date", |
| 180 | + "calendar_interval": "month", |
| 181 | + } |
| 182 | + } |
| 183 | + }, |
| 184 | +} |
| 185 | +result = os_client.search(index="git_demo_enriched", body=body) |
| 186 | +for bucket in result["aggregations"]["monthly"]["buckets"]: |
| 187 | + print(bucket["key_as_string"], bucket["doc_count"]) |
| 188 | +``` |
| 189 | + |
| 190 | +For dataframes: |
| 191 | + |
| 192 | +```python |
| 193 | +import pandas as pd |
| 194 | + |
| 195 | +buckets = result["aggregations"]["monthly"]["buckets"] |
| 196 | +df = pd.DataFrame(buckets) |
| 197 | +df["date"] = pd.to_datetime(df["key_as_string"]) |
| 198 | +df.set_index("date")["doc_count"].plot(title="Commits per month") |
| 199 | +``` |
| 200 | + |
| 201 | +### Worked queries |
| 202 | + |
| 203 | +**Top 10 most active repositories by commit count (last 12 months).** |
| 204 | + |
| 205 | +```json |
| 206 | +GET /git_demo_enriched/_search |
| 207 | +{ |
| 208 | + "size": 0, |
| 209 | + "query": { |
| 210 | + "range": { "author_date": { "gte": "now-12M/M" } } |
| 211 | + }, |
| 212 | + "aggs": { |
| 213 | + "by_repo": { |
| 214 | + "terms": { "field": "repo_name", "size": 10 } |
| 215 | + } |
| 216 | + } |
| 217 | +} |
| 218 | +``` |
| 219 | + |
| 220 | +**Bus factor proxy: number of authors who together produced 80% of commits.** |
| 221 | + |
| 222 | +The CHAOSS "bus factor" metric typically reduces to a percentile cut. |
| 223 | +You can approximate it directly in OpenSearch by sorting authors by |
| 224 | +commit count and walking the cumulative distribution; the |
| 225 | +`Author_multi_org_names` field gives you the org-resolved identity. |
| 226 | + |
| 227 | +**License compliance.** Pair OpenSearch counts with the SPARQL store's |
| 228 | +`schema:license` field — see |
| 229 | +[Metadata & Ontology](metadata-and-ontology.md) for the join. |
| 230 | + |
| 231 | +## Cross-layer questions |
| 232 | + |
| 233 | +Some CHAOSS-style questions naturally cross layers: |
| 234 | + |
| 235 | +| Question | Layers | |
| 236 | +| ---------------------------------------------------------- | ---------------------------- | |
| 237 | +| "Which permissive-licensed repos have the highest contributor diversity?" | SPARQL (license) + OpenSearch (authors) | |
| 238 | +| "Show contributor flow between two collaborating orgs over time" | Neo4j (org links) + OpenSearch (dates) | |
| 239 | +| "Rank repos by activity normalised by repo age" | OpenSearch + SPARQL (`schema:dateCreated`) | |
| 240 | + |
| 241 | +Pipeline steps use the |
| 242 | +[Services](../services/index.md) container to call the relevant clients; |
| 243 | +notebooks can compose queries directly as shown above. |
0 commit comments