Skip to content

Commit 19a038a

Browse files
committed
feat: restructure Open Pulse stack and environment configuration
- Consolidated the entire Open Pulse stack under `infra/open-pulse-stack/`, including main compose files, CLI orchestrator overlay, and GrimoireLab assets. - Introduced a single-file environment model with `infra/.env` as the authoritative deployment environment, while `<repo>/.env` is designated for the open-pulse CLI when interacting with external infrastructure. - Updated `.gitignore` to recursively ignore `data/` directories to prevent accidental pollution of the working tree. - Added `infra/.env.example` as a comprehensive template for deployment configuration. - Simplified default authentication credentials across the stack to `openpulse` / `replace-me`, with stronger placeholders where necessary. - Enhanced documentation to reflect the new structure and configuration processes, including a new `.env` wizard for easier setup.
1 parent 27c839e commit 19a038a

71 files changed

Lines changed: 1981 additions & 536 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.example

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# ── Open Pulse — tool / client config ─────────────────────────────────────
2+
# Copy to `<repo>/.env` and fill in.
3+
#
4+
# This file is consumed by the open-pulse Python CLI / hub when running on
5+
# the host as a CLIENT against EXTERNAL infrastructure (Neo4j / SPARQL /
6+
# crawler living elsewhere). Compose never loads it — when you launch the
7+
# stack from `infra/`, all deployment env lives in `infra/.env`.
8+
#
9+
# So:
10+
# - Bringing local infra up? → fill `infra/.env`. Ignore this file.
11+
# - Pointing the CLI at remote? → fill this file. `infra/.env` is unused.
12+
13+
14+
# ── Service endpoints ────────────────────────────────────────────────────
15+
# Override these to point at the remote services you want to talk to. When
16+
# left commented out, the CLI uses the in-network defaults (`neo4j:7687`,
17+
# `sparql-proxy:7878`, …) — those only resolve from inside the compose
18+
# network, so leaving them commented is appropriate for in-stack hub usage
19+
# but not for host-side CLI talking to remote infra.
20+
# NEO4J_BOLT_ENDPOINT=bolt://neo4j.example.com:7687
21+
# NEO4J_HTTP_ENDPOINT=http://neo4j.example.com:7474
22+
# SPARQL_ENDPOINT=https://sparql.example.com
23+
# CRAWLER_ENDPOINT=https://crawler.example.com
24+
# GIT_METADATA_EXTRACTOR_ENDPOINT=https://gme.example.com
25+
26+
27+
# ── Auth for the endpoints above ─────────────────────────────────────────
28+
# Whatever credentials the remote services expect. `replace-me` placeholders
29+
# match the open-pulse local-default convention; rotate before any remote
30+
# use.
31+
# NEO4J_AUTH=neo4j/replace-me
32+
# CRAWLER_API_TOKEN=replace-me
33+
# SPARQL_AUTH=replace-me
34+
# APPLIER_AUTH=replace-me

.github/dependabot.yml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,8 @@ updates:
6666
# Compose-managed images for the local services stack.
6767
- package-ecosystem: docker-compose
6868
directories:
69-
- "/infra/compose"
69+
- "/infra/open-pulse-stack"
7070
- "/infra/services/oxigraph"
71-
- "/infra/services/grimoirelab"
7271
- "/infra/services/sparql-proxy"
7372
- "/infra/services/neo4j"
7473
- "/infra/services/portainer"

.github/workflows/docker-publish.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ on:
44
push:
55
branches:
66
- main
7+
- develop
78
paths:
89
- "src/**"
910
- "pyproject.toml"
@@ -28,7 +29,7 @@ permissions:
2829

2930
env:
3031
REGISTRY: ghcr.io
31-
# Path matches the OPEN_PULSE_IMAGE default in infra/compose/docker-compose.yml.
32+
# Path matches the OPEN_PULSE_IMAGE default in infra/open-pulse-stack/docker-compose.yml.
3233
# Org is sdsc-ordes; package_name within the org is "open-pulse".
3334
IMAGE_NAME: sdsc-ordes/open-pulse
3435
PACKAGE_NAME: open-pulse

.github/workflows/docker-validate.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,14 @@ name: Compose Validate
33
on:
44
pull_request:
55
paths:
6-
- "infra/compose/**"
6+
- "infra/open-pulse-stack/**"
77
- ".github/workflows/docker-validate.yml"
88
push:
99
branches:
1010
- main
1111
- "release/**"
1212
paths:
13-
- "infra/compose/**"
13+
- "infra/open-pulse-stack/**"
1414
- ".github/workflows/docker-validate.yml"
1515

1616
permissions:
@@ -33,8 +33,9 @@ jobs:
3333
fail-fast: false
3434
matrix:
3535
compose_files:
36-
- "-f infra/compose/docker-compose.yml"
37-
- "-f infra/compose/docker-compose.yml -f infra/compose/docker-compose.cli.yml"
36+
- "-f infra/open-pulse-stack/docker-compose.yml"
37+
- "-f infra/open-pulse-stack/docker-compose.yml -f infra/open-pulse-stack/docker-compose.cli.yml"
38+
- "-f infra/open-pulse-stack/docker-compose.yml -f infra/open-pulse-stack/docker-compose.grimoirelab.yml"
3839
steps:
3940
- name: Checkout repository
4041
uses: actions/checkout@v5

.gitignore

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,12 @@ infra/services/airflow/src/env_variables.json
1717
**/airflow/plugins/
1818
**/airflow/logs/
1919

20-
# Docker/database runtime data
21-
/data/
22-
infra/services/graphdb/data/
20+
# Docker/database runtime data — ignore any `data/` dir at any level so an
21+
# accidental relative-path resolution (e.g. `infra/open-pulse-stack/data/`)
22+
# can't pollute the working tree. Per-service subdirs under the configured
23+
# OPEN_PULSE_DATA_DIR are also covered by this single rule.
24+
data/
2325
infra/services/graphdb/graphdb-data/
24-
infra/services/neo4j/data/
25-
infra/services/portainer/data/
2626

2727
# Tentris local data and licenses
2828
infra/services/tentris-server/data/

AGENTS.md

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -129,19 +129,20 @@ overrides:
129129
identity-mapped, so nested `docker compose`
130130
resolves bind paths the same on both sides)
131131
132-
infra/compose/docker-compose.yml ┐
133-
├── neo4j oxigraph sparql-proxy │ image-only refs; OPEN_PULSE_IMAGE
134-
├── crawler extractor selenium │ pulls from GHCR by default
135-
├── hub (--profile hub) │ HUB_AUTH gate
136-
└── grimoirelab-db portainer … ─┘
132+
infra/open-pulse-stack/docker-compose.yml ┐
133+
├── neo4j oxigraph sparql-proxy │ image-only refs;
134+
├── crawler extractor selenium │ OPEN_PULSE_IMAGE pulls
135+
├── hub (--profile hub) │ from GHCR by default;
136+
└── grimoirelab-db portainer … ─┘ HUB_AUTH gates the hub
137137
138-
infra/compose/docker-compose.cli.yml
138+
infra/open-pulse-stack/docker-compose.cli.yml
139139
└── open-pulse-cli (overlay; auto-included when CLI runs inside it)
140140
141-
infra/services/grimoirelab/docker-compose.yml
141+
infra/open-pulse-stack/docker-compose.grimoirelab.yml
142142
└── full GrimoireLab stack (mariadb, valkey, opensearch, mordred,
143143
sortinghat, nginx, projects-applier sidecar) — opt in via
144-
`open-pulse deploy up --with-grimoire`
144+
`open-pulse deploy up --with-grimoire`. Supporting assets at
145+
`infra/open-pulse-stack/grimoirelab/`.
145146
146147
data/ ← single root for all bind mounts
147148
├── neo4j/ oxigraph/ … ← main stack writes here
@@ -158,15 +159,17 @@ uv run pytest -q
158159

159160
# Build the unified image (one image for CLI / orchestrator / hub)
160161
docker build -f tools/images/Dockerfile-open-pulse -t open-pulse:local .
161-
echo "OPEN_PULSE_IMAGE=open-pulse:local" >> .env # otherwise pulls GHCR
162+
echo "OPEN_PULSE_IMAGE=open-pulse:local" >> infra/.env # otherwise pulls GHCR
162163

163164
# Bring up the stack (interactive profile picker if no --profile flags)
164165
./scripts/op deploy up --profile crawler --profile extractor --profile sparql --profile hub
165-
# or directly via the host's docker compose:
166-
docker compose -f infra/compose/docker-compose.yml -f infra/compose/docker-compose.cli.yml \
167-
--env-file .env --profile hub up -d
166+
# or directly via the host's docker compose (only infra/.env is loaded):
167+
docker compose -f infra/open-pulse-stack/docker-compose.yml \
168+
-f infra/open-pulse-stack/docker-compose.cli.yml \
169+
--env-file infra/.env \
170+
--profile hub up -d
168171

169-
# Hub at http://localhost:9090 — log in with admin / $HUB_AUTH
172+
# Hub at http://localhost:9090 — log in with openpulse / $HUB_AUTH
170173

171174
# Talk to the cli orchestrator from the host (handles git-bash path mangling):
172175
./scripts/op deploy ps
@@ -181,13 +184,12 @@ docker compose -f infra/compose/docker-compose.yml -f infra/compose/docker-compo
181184

182185
| Sub-command | Description |
183186
| --- | --- |
184-
| `open-pulse deploy up` | Deploy services via Docker Compose. Without `--profile` flags, opens an interactive selector. Creates `.env` from `infra/env/.env.example` if absent. |
187+
| `open-pulse deploy up` | Deploy services via Docker Compose. Without `--profile` flags, opens an interactive selector. Compose loads only `<repo>/infra/.env` (auto-seeded from `infra/.env.example` if missing). The tool/client `<repo>/.env` is for the open-pulse Python CLI / hub against external infra and is not a compose input. |
185188
| `open-pulse deploy down` | Tear down deployed services. `--volumes` / `-v` also removes named volumes. |
186189
| `open-pulse deploy ps` | Show the status of deployed containers. |
187190

188191
**Profiles:**
189192
- `default` — Core services only (Neo4j)
190-
- `analysis` — Core + analysis notebook
191193
- `crawler` — Open Pulse Crawler API
192194
- `extractor` — GME extractor + Selenium
193195
- `sparql` — Oxigraph + sparql-proxy
@@ -196,10 +198,11 @@ docker compose -f infra/compose/docker-compose.yml -f infra/compose/docker-compo
196198
- `orchestration` — Portainer
197199

198200
**Flags applicable to `up` / `down` / `ps`:**
199-
- `--with-cli` — Include `infra/compose/docker-compose.cli.yml` (auto-included
200-
when the CLI itself runs inside the cli container, via the
201-
`OPEN_PULSE_RUNNING_IN_CLI_CONTAINER=1` marker).
202-
- `--with-grimoire` — Also include `infra/services/grimoirelab/docker-compose.yml`.
201+
- `--with-cli` — Include `infra/open-pulse-stack/docker-compose.cli.yml`
202+
(auto-included when the CLI itself runs inside the cli container, via
203+
the `OPEN_PULSE_RUNNING_IN_CLI_CONTAINER=1` marker).
204+
- `--with-grimoire` — Also include
205+
`infra/open-pulse-stack/docker-compose.grimoirelab.yml`.
203206

204207
**Project root resolution** (`_find_project_root`):
205208
1. `$OPEN_PULSE_PROJECT_ROOT` if set
@@ -310,7 +313,7 @@ read-only at `/data` inside the hub so DuckDB can read other services' files.
310313
- **CI triggers** (path-scoped):
311314
- `src/**`, `tests/**`, `pyproject.toml`, `uv.lock` → Python CI
312315
- `tools/images/**`, `.devcontainer/**`, `pyproject.toml`, `uv.lock`,
313-
`infra/compose/docker-compose*.yml` → Docker validation
316+
`infra/open-pulse-stack/docker-compose*.yml` → Docker validation
314317
- `docs-site/**` → docs build
315318
- **Pre-commit**: `.pre-commit-config.yaml`, Ruff scoped to `src/`
316319

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
99

1010
### Changed
1111

12+
- Consolidated the entire Open Pulse stack under `infra/open-pulse-stack/`. The main compose, the CLI orchestrator overlay, and the full GrimoireLab compose now live alongside each other as `docker-compose.yml`, `docker-compose.cli.yml`, and `docker-compose.grimoirelab.yml`; GrimoireLab supporting assets (applier source, config templates, sigils, one-shot scripts) moved to `infra/open-pulse-stack/grimoirelab/`. Updated `deploy.py` path constants, `health.py` `_COMPOSE_FILE`, `justfile` `regen-grimoire-config`, `.github/dependabot.yml`, the docker-validate workflow (with a third matrix entry exercising the grimoirelab compose), and every doc that referenced the old paths.
13+
- Split `.env` along the principle "when launching from `infra/`, all env lives in `infra/`; otherwise `<repo>/.env` is just for the open-pulse tool acting as a client". `<repo>/infra/.env` is the AUTHORITATIVE deployment env: image refs, ports, resource limits, storage paths, ALL container-side credentials, per-service knobs (HUB_AUTH, CRAWLER_*, EXTRACTOR_*, GrimoireLab block, V2/RAG flags). Compose loads only this file (`--env-file infra/.env`); every service additionally `env_file:`-pulls it so any key set there reaches the container without an explicit `environment:` mapping. `<repo>/.env` is the tool/client env, consumed only by the open-pulse Python CLI / hub when running on the host against EXTERNAL infrastructure — compose never reads it. Both files are gitignored; `infra/.env` auto-seeds from `infra/.env.example` on first `op deploy up`, while `<repo>/.env` is a manual copy from `<repo>/.env.example`. `deploy up` / `down` / `ps` and the cli + grimoirelab compose env_file directives all dropped the second `--env-file` / `env_file:` entry.
14+
- Simplified default auth across the stack to `openpulse` / `replace-me`. Where the underlying service forces a specific username (Neo4j → `neo4j`, OpenSearch admin → `admin`, MariaDB init → `root`) only the password changes; everywhere else (GrimoireLab Postgres user, SortingHat superuser, hub login UI) the default username is `openpulse`. Compose-baked fallbacks (`${NEO4J_AUTH:-...}`, `${GRIMOIRELAB_DB_USER:-...}`, `${GRIMOIRELAB_DB_PASSWORD:-...}`) and the runtime fallback in `gui/hub/routes/stats.py` now match. `replace-me` is a placeholder; rotate before any non-local deployment.
15+
- Pinned `OPEN_PULSE_DATA_DIR` to an absolute path during `.env` seeding. The previous default `./data` resolved differently between `op deploy …` (relative to repo root via `--project-directory`) and raw `docker compose -f infra/open-pulse-stack/…` (relative to the compose file), silently splitting state into two locations. `_ensure_env_files` now substitutes the line with the absolute repo-root data path when seeding `infra/.env` from the template; `OPEN_PULSE_HOST_PATH` is filled the same way. `GRIMOIRE_DATA_DIR` derives from `${OPEN_PULSE_DATA_DIR}` so GrimoireLab data lands alongside Neo4j et al. under a single root.
16+
- Hardened `.gitignore` to ignore `data/` recursively (any nesting level) so an accidental relative-path resolution can't pollute the working tree. Removed the per-path `infra/services/{neo4j,portainer}/data/` entries that the recursive rule now covers.
1217
- Added explicit `mem_limit`, `cpus`, and `restart` policies to every service in `infra/compose/docker-compose.yml` and `infra/services/grimoirelab/docker-compose.yml`. Each cap reads from a per-service env var (e.g. `OPENSEARCH_MEM_LIMIT`, `NEO4J_MEM_LIMIT`, `MORDRED_MEM_LIMIT`, `EXTRACTOR_MEM_LIMIT`) so production deploys override the dev defaults without editing the compose. Neo4j heap and page cache are now explicit (`NEO4J_HEAP`, `NEO4J_PAGECACHE`) and sized to fit under the new cap. Mordred default cap dropped from 4g to 2g (it idled at 1.78 GiB; the previous 4g was generous and contributed to the host-OS pressure that OOM-killed opensearch). See `dev/advise/2026-05-05-resource-caps-and-oom-diagnosis.md` for the full diagnosis, budget math, and how to apply the new caps to running services without downtime.
1318
- Added healthchecks to `opensearch-node1` (`_cluster/health?wait_for_status=yellow` with auth — `yellow` is the success state on a single-node deploy because replicas can't be allocated) and `opensearch-dashboards` (liveness via `GET /`*not* `/api/status`, which requires auth in v3 and would always report unhealthy).
1419
- Simplified Compose topology to a two-file model under `infra/compose/`: `docker-compose.yml` for infra services and `docker-compose.cli.yml` as an optional CLI overlay.
@@ -28,8 +33,14 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
2833
- Split monolithic `tests/test_cli.py` into per-domain test modules: `test_cli.py` (entry point), `test_deploy.py`, `test_quest.py`, `test_grimoire.py`, `test_health.py`, and `test_orchestrator.py`. Added `conftest.py` with shared fixtures.
2934
- Added new test cases: `deploy down --volumes` flag pass-through, `deploy down`/`ps` Docker-unavailable guards, `deploy up --file` compose override, `quest start --resume` flag forwarding, `quest start --config` custom path, pipeline failure propagation, grimoire `install-watcher --clone-dir`, mixed-state container health check, orchestrator checkpoint persistence on success and failure, and empty task list handling.
3035

36+
### Removed
37+
38+
- Removed `infra/env/` and the old `infra/services/grimoirelab/` directories. Their contents either moved to the new structure (`infra/services/grimoirelab/docker-compose.yml``infra/open-pulse-stack/docker-compose.grimoirelab.yml`; the `applier/`, `config/`, `python-scripts/`, `scripts/`, and `README.md` siblings → `infra/open-pulse-stack/grimoirelab/`) or were superseded (`infra/env/.env.example``<repo>/.env.example` + `infra/.env.example`; `infra/services/grimoirelab/.env.dist` → unified `infra/.env.example`).
39+
3140
### Added
3241

42+
- Added `infra/.env.example` (deployment template — comprehensive: every container-side knob the local stack needs) and `<repo>/.env.example` (tool/client template — slim: endpoint overrides + auth for talking to remote infra). The CLI auto-substitutes `OPEN_PULSE_DATA_DIR` and `OPEN_PULSE_HOST_PATH` to absolute paths when seeding `infra/.env`.
43+
- Added `env_file: ../.env` directives to the crawler, extractor, and hub services in `infra/open-pulse-stack/docker-compose.yml` so any per-service knob set in `infra/.env` reaches the container automatically. This is what makes the extractor's V2 / RAG / agent-runtime knobs (V2_*, RCP_TOKEN, OPENALEX_MAILTO, HF_TOKEN, EPFL_GRAPH_*, ORCID_*, INDEX_QDRANT_URL, GRIMOIRE_GITHUB_TOKEN) reach the container without per-key `environment:` plumbing.
3344
- Added new command groups `services` and `gui` with grimoire subcommands split by domain (`services grimoire prepare-config`, `services grimoire install-watcher`, `gui grimoire`).
3445
- Added `open_pulse.utils.grimoire` package (`sparql_config.py`, `cronjob.py`) and `open_pulse.gui.grimoire_streamlit`.
3546
- Added new `open_pulse.services` modules:
@@ -135,6 +146,6 @@ and this project follows [Semantic Versioning](https://semver.org/spec/v2.0.0.ht
135146
- Updated `.gitignore` with docs tooling artifacts (`node_modules/`, `docs-site/build/`).
136147
- Rewrote root `README.md` for onboarding with project purpose, architecture overview, DB stack quick start, `uv`-based analysis quick start, documentation navigation links, and release/contribution references.
137148
- Refactored root `docker-compose.yml` into a profile-aware topology with default Neo4j plus opt-in `analysis`, `grimoirelab`, and `orchestration` services.
138-
- Added healthchecks and dependency readiness gates for key profile services (`neo4j`, `analysis-notebook`, and `grimoirelab-db`).
149+
- Added healthchecks and dependency readiness gates for key profile services (`neo4j` and `grimoirelab-db`).
139150
- Expanded `analysis/README.md` with sequential orchestration usage and checkpoint resume guidance.
140151
- Expanded `analysis/README.md` with container build/smoke/non-root checks and devcontainer setup guidance.

0 commit comments

Comments
 (0)