Skip to content

Commit b388ec3

Browse files
baijumclaude
andcommitted
feat: add optional resource metrics (Prometheus, cAdvisor, node-exporter)
Add 3 metrics services behind Docker Compose profiles (off by default). Enable with ENABLE_METRICS=true on bootstrap or COMPOSE_PROFILES=metrics in .env. Includes Prometheus config, Grafana datasource, and a 12-panel Resource Metrics dashboard. Configs are written unconditionally so enabling later doesn't require re-bootstrap. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2947045 commit b388ec3

3 files changed

Lines changed: 504 additions & 9 deletions

File tree

docs/server-contract.md

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ This document defines the contract between the platform infrastructure (bootstra
66

77
```
88
/opt/platform/ # Platform root (created by bootstrap-server.sh)
9-
docker-compose.yml # Platform services (postgres, redis, minio, caddy, loki, promtail, grafana)
10-
.env # Platform credentials (POSTGRES_PASSWORD, MINIO_ROOT_*, GRAFANA_ADMIN_PASSWORD, ACME_EMAIL, OPS_DOMAIN, ALERT_REPO)
9+
docker-compose.yml # Platform services (7 core + 3 optional metrics)
10+
.env # Platform credentials (POSTGRES_PASSWORD, MINIO_ROOT_*, GRAFANA_ADMIN_PASSWORD, ACME_EMAIL, OPS_DOMAIN, ALERT_REPO, COMPOSE_PROFILES)
1111
.bootstrapped # Timestamp marker from last bootstrap run
1212
Caddyfile # Global Caddyfile — imports /etc/caddy/apps/*.caddy
1313
caddy-apps/ # Per-app Caddyfile fragments (written by deploy/preview workflows)
@@ -25,6 +25,7 @@ This document defines the contract between the platform infrastructure (bootstra
2525
usage-report.sh
2626
loki-config.yml # Loki configuration
2727
promtail-config.yml # Promtail configuration
28+
prometheus.yml # Prometheus scrape config (always created)
2829
grafana/ # Grafana provisioning files and dashboards
2930
provisioning/datasources/
3031
provisioning/dashboards/
@@ -45,12 +46,13 @@ This document defines the contract between the platform infrastructure (bootstra
4546
caddy/config/ # Caddy config state
4647
loki/ # Loki log storage
4748
grafana/ # Grafana state
49+
prometheus/ # Prometheus data (created always, used when metrics enabled)
4850
backups/postgres/ # pg_dump backup files (7-day retention)
4951
```
5052

5153
## Bootstrap to Deploy Lifecycle
5254

53-
1. **Bootstrap the server** — Run `sudo bash infrastructure/bootstrap-server.sh` on a fresh Debian machine. This creates the directory layout above, installs Docker, creates the `deploy` user, generates platform credentials, starts the 7 platform services, copies infrastructure scripts, and installs cron jobs.
55+
1. **Bootstrap the server** — Run `sudo bash infrastructure/bootstrap-server.sh` on a fresh Debian machine. This creates the directory layout above, installs Docker, creates the `deploy` user, generates platform credentials, starts the 7 core platform services (plus 3 optional metrics services if enabled), copies infrastructure scripts, and installs cron jobs.
5456

5557
2. **Configure DNS** — Point app domains and `*.preview.<domain>` to the server IP.
5658

@@ -115,7 +117,7 @@ The bootstrap script applies several security measures automatically. Self-hoste
115117

116118
**Credential isolation** — The platform `.env` file is mode 600 (readable only by its owner). Per-app credentials are generated by `create-app-credentials.sh` and stored in separate files under `/opt/platform/credentials/`, each also mode 600.
117119

118-
**Container resource limits** — Every platform service and every app container has explicit CPU and memory limits in its Docker Compose file. This prevents any single container from exhausting server resources.
120+
**Container resource limits** — Every platform service and every app container has explicit CPU and memory limits in its Docker Compose file. This prevents any single container from exhausting server resources. The 7 core services use ~2.66G / 3.25 CPU. The 3 optional metrics services add 448M / 1.00 CPU when enabled.
119121

120122
**Mandatory Access Control (AppArmor)** — Debian 12 ships with AppArmor enabled by default. Docker automatically applies the `docker-default` AppArmor profile to all containers, which restricts capabilities like writing to `/proc` and `/sys`, mounting filesystems, and accessing raw sockets. No configuration is needed — this works out of the box.
121123

@@ -194,3 +196,26 @@ fi
194196
```
195197

196198
If no credentials file exists, the workflow falls back to whatever is already in `deploy/.env` and logs a warning.
199+
200+
## Resource Metrics (Optional)
201+
202+
Three additional services provide real-time resource visibility in Grafana. They are **off by default** and use Docker Compose profiles to control startup.
203+
204+
| Service | Image | Memory Limit | CPU Limit | Purpose |
205+
|---|---|---|---|---|
206+
| prometheus | prom/prometheus:v2.53.0 | 256M | 0.50 | Metrics storage and queries |
207+
| cadvisor | gcr.io/cadvisor/cadvisor:v0.49.1 | 128M | 0.25 | Container metrics |
208+
| node-exporter | prom/node-exporter:v1.8.1 | 64M | 0.25 | Host metrics |
209+
210+
**To enable** — add `COMPOSE_PROFILES=metrics` to `/opt/platform/.env`, then run `docker compose up -d` in `/opt/platform/`. Or bootstrap with `sudo ENABLE_METRICS=true bash bootstrap-server.sh`.
211+
212+
**To disable** — remove the `COMPOSE_PROFILES=metrics` line from `.env`, then stop the metrics services with `docker compose --profile metrics down`.
213+
214+
The Prometheus config (`/opt/platform/prometheus.yml`), Grafana datasource, and dashboard JSON are always created during bootstrap — they are harmless without running services and avoid needing to re-bootstrap to enable metrics later. Only the service startup is conditional.
215+
216+
The "Resource Metrics" dashboard in Grafana includes:
217+
218+
- **Host overview**: CPU %, memory %, disk %, uptime (stat panels)
219+
- **Host time series**: CPU, memory, disk I/O, network I/O over time
220+
- **Container overview**: table of all containers with CPU, memory, network
221+
- **Per-container detail**: filterable CPU and memory time series per container

0 commit comments

Comments
 (0)