Skip to content

Commit 2eadab1

Browse files
authored
Merge pull request #47 from sdsc-ordes/feat/docs-content-pr-c-metrics-usecases
docs(content): PR C — metrics-and-chaoss + use-cases
2 parents b0dc797 + 2e15e9a commit 2eadab1

3 files changed

Lines changed: 454 additions & 1 deletion

File tree

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
---
2+
title: Metrics & CHAOSS
3+
slug: /concepts/metrics-and-chaoss
4+
---
5+
6+
# Metrics & CHAOSS
7+
8+
Open Pulse turns raw git and GitHub activity into a stream of
9+
[CHAOSS](https://chaoss.community/) metrics — community health
10+
indicators that go beyond stars and citations and capture how a project
11+
actually behaves over time. The full
12+
[GrimoireLab](https://chaoss.github.io/grimoirelab/) stack runs as part
13+
of the default deployment; this page is a tour of what is wired up and
14+
how to query it.
15+
16+
## What is CHAOSS?
17+
18+
CHAOSS (Community Health Analytics in Open Source Software) is a Linux
19+
Foundation project that publishes a catalogue of metrics, metrics models
20+
and reference implementations for measuring open-source community
21+
health. The catalogue covers four focus areas:
22+
23+
- **Project activity** — commit frequency, PR velocity, issue activity,
24+
release cadence.
25+
- **Responsiveness** — issue resolution time, PR review time, first
26+
response time.
27+
- **Diversity & inclusion** — contributor count, organisational
28+
diversity, new-contributor rate, bus factor.
29+
- **Risk & sustainability** — dependency health, license compliance,
30+
project age and stability.
31+
32+
The reference implementation is GrimoireLab, which Open Pulse runs in
33+
full.
34+
35+
## The Open Pulse metrics stack
36+
37+
```mermaid
38+
flowchart LR
39+
subgraph collect [Collect]
40+
GH[GitHub API]
41+
GIT[Git repos]
42+
end
43+
44+
subgraph orchestrate [Orchestrate]
45+
M[Mordred<br/>open-pulse-mordred]
46+
P[Perceval<br/>collectors]
47+
end
48+
49+
subgraph store [Store]
50+
OS[(OpenSearch<br/>raw + enriched)]
51+
SH[(SortingHat<br/>identities)]
52+
end
53+
54+
subgraph apply [Configure]
55+
A[Projects applier<br/>localhost:1235]
56+
end
57+
58+
subgraph view [View]
59+
D[OpenSearch Dashboards<br/>localhost:5601 / 7508]
60+
PY[Python notebooks]
61+
end
62+
63+
GH --> P
64+
GIT --> P
65+
M --> P
66+
P --> OS
67+
P --> SH
68+
A -. projects.json .-> M
69+
OS --> D
70+
OS --> PY
71+
```
72+
73+
Every box in that diagram is a real container shipped by
74+
`docker-compose.grimoirelab.yml` (the opt-in overlay enabled by
75+
`op deploy up --with-grimoire`).
76+
77+
### Components
78+
79+
| Component | Role | Endpoint (host) |
80+
| --------------------- | --------------------------------------------------------------------------------------------- | ------------------------------------------ |
81+
| `open-pulse-mordred` | Orchestrator. Reads a `projects.json`, drives Perceval collectors on a schedule, ships docs to OpenSearch. | (no HTTP surface) |
82+
| `opensearch-node1` | Storage for raw + enriched documents. | `https://localhost:9200` |
83+
| `opensearch-dashboards` | Browse + dashboard editor. | `http://localhost:5601` |
84+
| `nginx` | TLS-terminating reverse proxy for the dashboards. | `http://localhost:7508` |
85+
| `sortinghat` + `sortinghat_worker` | Author identity unification (one person, many emails). | (internal, REST + worker) |
86+
| `mariadb` | Backing store for SortingHat + Mordred state. | (internal) |
87+
| `valkey` | Queue between SortingHat and its worker. | (internal) |
88+
| `projects-applier` | Receives a fresh `projects.json` over HTTP and atomically swaps it into Mordred. | `http://localhost:1235` |
89+
90+
### Today's data
91+
92+
A live deployment after a few crawl cycles typically contains:
93+
94+
| Index | Docs | What it holds |
95+
| ----------------------------- | ---------:| ---------------------------------------------- |
96+
| `git_demo_raw` | 1.7M | Raw git log events from every tracked repo |
97+
| `git_demo_enriched` | 44K+ | Per-commit enriched docs with 181 CHAOSS fields|
98+
| `git-aoc_demo_enriched` | 572K | Areas of code (touched-file aggregation) |
99+
| `git-onion_demo_enriched_*` | 15K | "Onion model" tiers (core / regular / casual) |
100+
101+
The enriched index covers ~193 distinct repositories across 9
102+
GrimoireLab projects (one per organisation in
103+
[the graph](graph-and-semantic-data.md)).
104+
105+
## Feeding the pipeline
106+
107+
Open Pulse keeps the GrimoireLab project configuration in sync with the
108+
SPARQL store, so what gets indexed by CHAOSS metrics is always the same
109+
set of repositories that appear in the graph and the RDF store. Three
110+
CLI commands manage this:
111+
112+
```bash
113+
# Build a fresh projects.json from the SPARQL store, write it locally.
114+
open-pulse services grimoire prepare-config
115+
116+
# Same, then POST it to the projects-applier so Mordred picks it up.
117+
open-pulse services grimoire apply
118+
119+
# Install a cron job that watches a git repo and re-applies on change.
120+
open-pulse services grimoire install-watcher
121+
```
122+
123+
The applier exposes three HTTP endpoints if you need to drive it from
124+
outside the CLI:
125+
126+
- `GET /healthz` — liveness probe.
127+
- `GET /current` — the projects.json currently in effect.
128+
- `POST /apply` — submit a new projects.json (authenticated).
129+
130+
## CHAOSS document shape
131+
132+
Each commit lands in `git_demo_enriched` as a ~181-field document.
133+
Authoritative names worth knowing when writing queries or notebooks:
134+
135+
- **Identity (post-SortingHat unification):** `Author_uuid`,
136+
`Author_user_name`, `Author_org_name`, `Author_multi_org_names`,
137+
`Author_bot`, `Author_gender`. Same family of fields exists with the
138+
`Commit_` prefix.
139+
- **Raw author/committer:** lowercase `author_name`, `author_domain`,
140+
`committer_name`, etc.
141+
- **Dates:** `author_date`, `author_date_hour`, `author_date_weekday`,
142+
`commit_date`.
143+
- **Repo:** `origin` (git URL), `project` (Open Pulse project slug),
144+
`repo_name`.
145+
- **Change shape:** `files`, `lines_added`, `lines_removed`,
146+
`lines_changed`, file-level child docs.
147+
148+
## Accessing metrics
149+
150+
### OpenSearch Dashboards
151+
152+
The browse UI lives at
153+
[http://localhost:5601](http://localhost:5601) (direct) or
154+
[http://localhost:7508](http://localhost:7508) (via the nginx
155+
reverse-proxy). Both surfaces present the same dashboards. Auth uses the
156+
`OPENSEARCH_INITIAL_ADMIN_PASSWORD` set in `infra/.env`.
157+
158+
### Python
159+
160+
OpenSearch speaks the Elasticsearch API; `opensearch-py` is the most
161+
straightforward client:
162+
163+
```python
164+
from opensearchpy import OpenSearch
165+
import os
166+
167+
os_client = OpenSearch(
168+
hosts=[{"host": "localhost", "port": 9200, "scheme": "https"}],
169+
http_auth=("admin", os.environ["OPENSEARCH_INITIAL_ADMIN_PASSWORD"]),
170+
verify_certs=False,
171+
)
172+
173+
# Commits per month across all tracked repos
174+
body = {
175+
"size": 0,
176+
"aggs": {
177+
"monthly": {
178+
"date_histogram": {
179+
"field": "author_date",
180+
"calendar_interval": "month",
181+
}
182+
}
183+
},
184+
}
185+
result = os_client.search(index="git_demo_enriched", body=body)
186+
for bucket in result["aggregations"]["monthly"]["buckets"]:
187+
print(bucket["key_as_string"], bucket["doc_count"])
188+
```
189+
190+
For dataframes:
191+
192+
```python
193+
import pandas as pd
194+
195+
buckets = result["aggregations"]["monthly"]["buckets"]
196+
df = pd.DataFrame(buckets)
197+
df["date"] = pd.to_datetime(df["key_as_string"])
198+
df.set_index("date")["doc_count"].plot(title="Commits per month")
199+
```
200+
201+
### Worked queries
202+
203+
**Top 10 most active repositories by commit count (last 12 months).**
204+
205+
```json
206+
GET /git_demo_enriched/_search
207+
{
208+
"size": 0,
209+
"query": {
210+
"range": { "author_date": { "gte": "now-12M/M" } }
211+
},
212+
"aggs": {
213+
"by_repo": {
214+
"terms": { "field": "repo_name", "size": 10 }
215+
}
216+
}
217+
}
218+
```
219+
220+
**Bus factor proxy: number of authors who together produced 80% of commits.**
221+
222+
The CHAOSS "bus factor" metric typically reduces to a percentile cut.
223+
You can approximate it directly in OpenSearch by sorting authors by
224+
commit count and walking the cumulative distribution; the
225+
`Author_multi_org_names` field gives you the org-resolved identity.
226+
227+
**License compliance.** Pair OpenSearch counts with the SPARQL store's
228+
`schema:license` field — see
229+
[Metadata & Ontology](metadata-and-ontology.md) for the join.
230+
231+
## Cross-layer questions
232+
233+
Some CHAOSS-style questions naturally cross layers:
234+
235+
| Question | Layers |
236+
| ---------------------------------------------------------- | ---------------------------- |
237+
| "Which permissive-licensed repos have the highest contributor diversity?" | SPARQL (license) + OpenSearch (authors) |
238+
| "Show contributor flow between two collaborating orgs over time" | Neo4j (org links) + OpenSearch (dates) |
239+
| "Rank repos by activity normalised by repo age" | OpenSearch + SPARQL (`schema:dateCreated`) |
240+
241+
Pipeline steps use the
242+
[Services](../services/index.md) container to call the relevant clients;
243+
notebooks can compose queries directly as shown above.

0 commit comments

Comments
 (0)