This document describes the CI/CD pipeline and monitoring setup for the Nextcloud OCI Terraform project.
The project uses a staged CI/CD pipeline with GitHub Actions, following real-world best practices for efficient feedback and resource utilization.
┌─────────────────────────────────────────────────────────┐
│ CI PIPELINE (ci.yml) │
│ Runs on: PR + Push │
├─────────────────────────────────────────────────────────┤
│ Stage 1: VALIDATION (< 2 min) │
│ ├─ Terraform fmt + validate │
│ ├─ YAML lint │
│ ├─ Markdown lint │
│ ├─ Docker Compose validate │
│ └─ ShellCheck │
├─────────────────────────────────────────────────────────┤
│ Stage 2: SECURITY (parallel after validation) │
│ ├─ tfsec (Terraform security) │
│ ├─ Trivy (IaC vulnerabilities) │
│ └─ Gitleaks (secret detection) │
├─────────────────────────────────────────────────────────┤
│ Stage 3: DOCKER (parallel with security) │
│ ├─ Trivy Docker Compose scan │
│ ├─ Check privileged containers │
│ ├─ Check Docker socket permissions │
│ └─ Check hardcoded secrets │
├─────────────────────────────────────────────────────────┤
│ Stage 4: PR AUTOMATION (parallel, PR only) │
│ ├─ Display PR info │
│ ├─ Auto-label by files changed │
│ ├─ Size labeling (XS/S/M/L/XL) │
│ └─ Conventional commits check (informational) │
├─────────────────────────────────────────────────────────┤
│ Stage 5: SUMMARY │
│ └─ Aggregate results and report status │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ DEEP SECURITY SCAN (security-deep.yml) │
│ Runs: Weekly (Mon 9:00 UTC) + Manual │
├─────────────────────────────────────────────────────────┤
│ • tfsec with SARIF upload │
│ • Trivy full IaC scan │
│ • ShellCheck (all scripts) │
│ • Gitleaks (full history) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ DOCKER IMAGE SCAN (docker-image-scan.yml) │
│ Runs: Weekly (Wed 3:00 UTC) + Manual │
├─────────────────────────────────────────────────────────┤
│ • Scan nextcloud/all-in-one:latest │
│ • Scan caddy:latest │
│ • Check for image updates │
│ • Test Docker Compose pull │
└─────────────────────────────────────────────────────────┘
Purpose: Fast feedback for developers on every PR and push to main.
Triggers:
- Pull requests (opened, synchronized, reopened)
- Push to main branch
Stages:
- Validation (< 2 min) - Fast checks that fail immediately if code is malformed
- Security (2-5 min) - Security scans after validation passes
- Docker (1-2 min) - Docker-specific checks, parallel with security
- PR Automation (< 1 min) - Auto-labeling and commit checks, parallel
- Summary (< 1 min) - Aggregates all results
Status: ✅ Required - PRs cannot merge if this fails
Design Principles:
- Fail fast: Validation runs first, fails quickly on formatting errors
- Efficient parallelization: Security and Docker checks run in parallel after validation
- Clear feedback: Summary stage shows exactly what failed
- Non-blocking commit checks: Conventional commits are informational only
Purpose: Comprehensive weekly security audit.
Triggers:
- Schedule: Every Monday at 9:00 UTC
- Manual: workflow_dispatch
Checks:
- tfsec: Terraform security best practices
- Trivy: Full IaC vulnerability scanning (config + filesystem)
- ShellCheck: All shell scripts linting
- Gitleaks: Secret detection across full git history
Results: Uploaded to GitHub Security tab (SARIF format)
Status: ℹ️ Informational - Does not block PRs
Purpose: Monitor vulnerabilities in upstream Docker images.
Triggers:
- Schedule: Every Wednesday at 3:00 UTC
- Manual: workflow_dispatch
Checks:
- Vulnerability scan of
nextcloud/all-in-one:latest - Vulnerability scan of
caddy:latest - Check for image updates
- Validate Docker Compose configuration
Results: Uploaded to GitHub Security tab (SARIF format)
Status: ℹ️ Informational - Does not block PRs
Note: We use :latest tags for simplicity. For production at scale, consider digest pinning:
image: nextcloud/all-in-one@sha256:abc123...Prevent CI failures before committing!
Pre-commit hooks automatically format and check your code locally:
./scripts/setup-precommit.sh- Terraform fmt: Auto-format Terraform files
- Markdown fix: Auto-fix markdown formatting issues
- ShellCheck: Lint shell scripts
- YAML validation: Check YAML syntax
- Gitleaks: Detect hardcoded secrets
- File checks: Fix trailing whitespace, line endings, etc.
# Run on all files
pre-commit run --all-files
# Run on staged files only
pre-commit rungit commit --no-verifyAll security scan results are uploaded to the GitHub Security tab in SARIF format:
- Navigate to:
Security→Code scanning alerts - View: Vulnerabilities, security issues, and secret leaks
- Filter by: Tool (tfsec, Trivy, Gitleaks), severity, status
Recommended settings for main branch:
Require status checks to pass:
✅ CI Pipeline / validation
✅ CI Pipeline / security
✅ CI Pipeline / docker
✅ CI Pipeline / summary
Require branches to be up to date: ✅
Require linear history: ✅ (optional)Not required (informational only):
- Deep Security Scan (scheduled)
- Docker Image Scan (scheduled)
- Conventional commits check
The validation stage runs first and fails fast if basic checks don't pass:
# Terraform checks
terraform fmt -check -recursive terraform/
terraform validate
# YAML lint
yamllint docker/docker-compose.yml terraform/cloud-init.yaml .github/workflows/*.yml
# Docker Compose validation
docker compose config --quiet
# Markdown lint
markdownlint-cli2 "**/*.md"
# ShellCheck
shellcheck scripts/*.shAfter validation passes, three stages run in parallel:
Security Stage:
- tfsec → Terraform security scanning
- Trivy → IaC vulnerability detection
- Gitleaks → Secret detection
Docker Stage:
- Trivy Docker Compose scan
- Custom security checks (privileged containers, socket permissions, hardcoded secrets)
PR Automation (PRs only):
- Display PR information
- Auto-label by changed files
- Size labeling (XS/S/M/L/XL based on lines changed)
- Conventional commits validation (informational)
Aggregates results from all stages and provides a clear pass/fail status.
✅ Monitoring stack is fully implemented and ready to deploy!
The monitoring stack uses Prometheus + Grafana with exporters for comprehensive observability:
┌───────────────────────────────────────────────────┐
│ MONITORING STACK │
├───────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Node Exporter│─────▶│ │ │
│ │ (System) │ │ Prometheus │ │
│ └──────────────┘ │ :9090 │ │
│ │ │ │
│ ┌──────────────┐ │ │ │
│ │ cAdvisor │─────▶│ │ │
│ │ (Containers) │ └──────┬───────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Grafana │ │
│ │ :3000 │ │
│ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────┐ │
│ │ Caddy │ │
│ │ Reverse │ │
│ │ Proxy │ │
│ └──────────────┘ │
│ │ │
└───────────────────────────────┼───────────────────┘
▼
https://tailscale-hostname:3000 (via Tailscale Serve)
Core Services:
-
Prometheus - Metrics collection and storage
- 15-second scrape interval
- 30-day data retention
- Localhost access only (security)
-
Grafana - Visualization and dashboards
- Pre-configured Prometheus datasource
- HTTPS access via Tailscale Serve
- URL:
https://tailscale-hostname:3000 (via Tailscale Serve)
Exporters:
-
Node Exporter - System metrics
- CPU usage, load average
- Memory (free, cached, buffers)
- Disk I/O and space
- Network traffic
-
cAdvisor - Container metrics
- Per-container CPU/memory usage
- Container network I/O
- Container health and restarts
Add a DNS A record for the monitoring subdomain:
Subdomain: monitoring.yourdomain
IP Address: (same as main domain)
Edit .env file:
# Generate secure password
openssl rand -base64 32
# Add to .env
GRAFANA_ADMIN_PASSWORD=your-generated-password-hereOn your OCI server:
# Navigate to docker directory
cd /opt/nextcloud/docker
# Make sure .env has GRAFANA_ADMIN_PASSWORD set
# Start all services (including monitoring)
docker compose up -d
# Check monitoring containers
docker ps | grep -E "prometheus|grafana|node-exporter|cadvisor"
# View logs
docker compose logs -f prometheus grafanaURL: https://tailscale-hostname:3000 (via Tailscale Serve)
Username: admin
Password: (from GRAFANA_ADMIN_PASSWORD in .env)
Important: This is self-hosted Grafana, not Grafana Cloud. No cloud account needed!
After logging into Grafana, import pre-built dashboards:
- Go to Dashboards → Import
- Enter dashboard ID:
1860 - Select Prometheus datasource
- Click Import
Metrics included:
- CPU usage (per core and total)
- Memory usage (RAM, swap, cache)
- Disk space and I/O
- Network traffic (RX/TX)
- System load and uptime
- Go to Dashboards → Import
- Enter dashboard ID:
179 - Select Prometheus datasource
- Click Import
Metrics included:
- Per-container CPU usage
- Per-container memory usage
- Container network I/O
- Container restarts
- Container health status
System Metrics (Node Exporter):
# CPU Usage Percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage Percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk Usage Percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Network Traffic (MB/s)
rate(node_network_transmit_bytes_total[5m]) / 1024 / 1024
Container Metrics (cAdvisor):
# Container CPU Usage (%)
rate(container_cpu_usage_seconds_total{name=~"nextcloud.*"}[5m]) * 100
# Container Memory Usage (MB)
container_memory_usage_bytes{name=~"nextcloud.*"} / 1024 / 1024
# Container Network RX (MB/s)
rate(container_network_receive_bytes_total{name=~"nextcloud.*"}[5m]) / 1024 / 1024
Estimated monitoring stack resource consumption:
| Service | CPU | RAM | Disk (30 days) |
|---|---|---|---|
| Prometheus | 0.1-0.3 | 500 MB - 1 GB | 2-5 GB |
| Grafana | 0.05-0.1 | 100-200 MB | 100 MB |
| Node Exporter | 0.01 | 10 MB | - |
| cAdvisor | 0.1 | 50-100 MB | - |
| Total | ~0.3 CPU | ~1 GB | ~3 GB |
Well within OCI free tier (4 vCPU, 24 GB RAM).
Port Bindings:
All monitoring services bind to localhost only:
- Prometheus:
127.0.0.1:9090 - Grafana:
127.0.0.1:3000 - Node Exporter:
127.0.0.1:9100 - cAdvisor:
127.0.0.1:8081
Only Grafana is exposed via HTTPS through Caddy reverse proxy.
Network Isolation:
Monitoring services run on dedicated monitoring network, isolated from nextcloud-aio network.
Authentication:
- Grafana requires username/password
- User sign-up disabled
- Strong password enforcement
Prometheus not scraping metrics:
# Check targets status
# Go to: http://localhost:9090/targets
# All should be "UP"
# Or check from CLI
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .job, health: .health}'Grafana can't connect to Prometheus:
# Test connectivity
docker exec grafana wget -qO- http://prometheus:9090/-/healthy
# Should return: "Prometheus is Healthy."cAdvisor not showing container metrics:
# Check if cAdvisor is running
docker ps | grep cadvisor
# View cAdvisor logs
docker compose logs cadvisor
# Test metrics endpoint
curl http://localhost:8081/metrics | grep container_Create docker/monitoring/alerts.yml for alerting:
groups:
- name: system
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage above 80% for 10 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage above 90%"
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Disk usage above 85%"Then add Alertmanager to docker-compose.yml for notifications.
Complete monitoring documentation available at:
- Setup Guide:
docker/monitoring/README.md - Grafana Dashboards: Import IDs 1860 and 179
- PromQL Queries: See monitoring README for examples
Validation stage fails:
- Run
pre-commit run --all-fileslocally to fix formatting - Check
terraform fmt -check -recursive terraform/ - Verify YAML syntax with
yamllint
Security stage fails:
- Review tfsec findings: usually Terraform security best practices
- Check Trivy output for IaC misconfigurations
- Gitleaks failure means secrets detected - remove and rotate them immediately
Docker stage fails:
- Check for privileged containers or read-write Docker socket
- Verify no hardcoded secrets in docker-compose.yml
- Review Trivy Docker Compose scan results
Hooks not running:
# Reinstall hooks
pre-commit uninstall
pre-commit installHook fails:
# Update hook dependencies
pre-commit autoupdate
# Clear cache and retry
pre-commit clean
pre-commit run --all-filesIf you see "Resource not accessible by integration":
- Ensure workflow has
security-events: writepermission - Check that GitHub Advanced Security is enabled (free for public repos)
- Always use pre-commit hooks - Catch issues before CI runs
- Fix validation errors first - They're fast and easy to fix
- Don't ignore security warnings - Review and address or document exceptions
- Keep dependencies updated - Monitor scheduled scan results
- Use conventional commits - Helps with changelog generation
- Test locally before pushing - Run
docker compose configandterraform validate
Before pushing, test locally:
# 1. Pre-commit checks
pre-commit run --all-files
# 2. Terraform validation
cd terraform
terraform fmt -recursive
terraform init -backend=false
terraform validate
cd ..
# 3. Docker Compose validation
cd docker
docker compose config --quiet
cd ..
# 4. ShellCheck
shellcheck scripts/*.sh
# 5. Markdown lint
npx markdownlint-cli2 "**/*.md"- GitHub Actions Documentation
- Terraform Security Best Practices
- Docker Security
- Conventional Commits
- Pre-commit Framework
Next Steps: Implement Prometheus + Grafana monitoring (see ROADMAP.md Phase 3)