Skip to content

Commit 9f00f16

Browse files
baijumclaude
andcommitted
docs: add runbooks, troubleshooting, onboarding, ADRs, and Mermaid diagrams
Add operational runbooks (restart, add app, rotate credentials, restore backup, debug deploy), symptom-based troubleshooting guide, onboarding checklist, and 5 architecture decision records. Replace ASCII art in architecture.md with Mermaid diagrams for network topology, CI/CD flow, backup/restore, and preview environments. Update CONTRIBUTING.md with validator and infrastructure sections. Add new pages to mkdocs nav. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e73ff53 commit 9f00f16

16 files changed

Lines changed: 1035 additions & 23 deletions

CONTRIBUTING.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,37 @@ mkdocs serve
3838

3939
Then open [http://127.0.0.1:8000](http://127.0.0.1:8000) in your browser. The site auto-reloads on file changes.
4040

41+
## Spec Validator
42+
43+
The platform includes a validator that checks app repositories against the [application specification](docs/spec.md). Run it against any app directory:
44+
45+
```bash
46+
python validator/validate.py /path/to/your-app
47+
```
48+
49+
The validator checks three tiers: file structure, configuration, and runtime compliance. All tiers should pass before deploying.
50+
51+
## Infrastructure Scripts
52+
53+
Scripts in the `infrastructure/` directory follow these conventions:
54+
55+
- **ShellCheck clean** — all scripts pass `shellcheck` with no warnings
56+
- **Idempotent** — every section is guarded with existence checks; safe to re-run
57+
- **No interactive prompts** in automated scripts (cron jobs, CI steps)
58+
- **Documented** — see `docs/server-contract.md` for the full scripts reference table
59+
60+
When modifying infrastructure scripts, test on a fresh Debian 12 server or verify with `infrastructure/verify-server.sh`.
61+
62+
## Operational Tasks
63+
64+
For day-to-day server operations, see the [runbooks](docs/runbooks/):
65+
66+
- [Restart an app](docs/runbooks/restart-app.md)
67+
- [Add a new app](docs/runbooks/add-new-app.md)
68+
- [Rotate credentials](docs/runbooks/rotate-credentials.md)
69+
- [Restore a backup](docs/runbooks/restore-backup.md)
70+
- [Debug a failed deploy](docs/runbooks/debug-failed-deploy.md)
71+
4172
## Code of Conduct
4273

4374
This project follows the [Contributor Covenant v2.1](CODE_OF_CONDUCT.md). By participating, you agree to uphold its standards.

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,10 @@ Full documentation is available at **[towlion.github.io/platform](https://towlio
8080
- [Architecture](https://towlion.github.io/platform/architecture/) — platform design and diagrams
8181
- [App Specification](https://towlion.github.io/platform/spec/) — application contract (ports, endpoints, env vars)
8282
- [Self-Hosting](https://towlion.github.io/platform/self-hosting/) — fork model, server requirements, bootstrap
83+
- [Troubleshooting](https://towlion.github.io/platform/troubleshooting/) — symptom-based debugging guide
84+
- [Onboarding](https://towlion.github.io/platform/onboarding/) — new contributor checklist
85+
- [Runbooks](https://towlion.github.io/platform/runbooks/restart-app/) — operational procedures
86+
- [Architecture Decisions](https://towlion.github.io/platform/decisions/001-github-as-control-plane/) — ADRs
8387

8488
## License
8589

docs/architecture.md

Lines changed: 96 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,36 @@
44

55
The Towlion platform runs on a single Debian server. Applications run as Docker containers and share a set of core infrastructure services.
66

7+
```mermaid
8+
graph TB
9+
User[User Browser] -->|HTTPS| Caddy
10+
11+
subgraph Server["Debian 12 Server"]
12+
subgraph Docker["Docker / towlion network"]
13+
Caddy["Caddy :80/:443"]
14+
15+
Caddy --> App1["App 1 :8000"]
16+
Caddy --> App2["App 2 :8000"]
17+
Caddy --> App3["App 3 :8000"]
18+
Caddy --> Grafana["Grafana :3000"]
19+
20+
App1 --> Postgres[("PostgreSQL :5432")]
21+
App2 --> Postgres
22+
App3 --> Postgres
23+
App1 --> Redis[("Redis :6379")]
24+
App1 --> MinIO["MinIO :9000"]
25+
26+
Promtail["Promtail"] --> Loki["Loki :3100"]
27+
Loki --> Grafana
28+
29+
Prometheus["Prometheus :9090"] -.->|optional| Grafana
30+
cAdvisor["cAdvisor :8080"] -.->|optional| Prometheus
31+
NodeExp["Node Exporter :9100"] -.->|optional| Prometheus
32+
end
33+
end
734
```
8-
Internet
9-
10-
11-
DNS
12-
13-
14-
Reverse Proxy
15-
(Caddy)
16-
/ | \
17-
▼ ▼ ▼
18-
App 1 App 2 App 3
19-
20-
21-
Shared Services
22-
23-
┌────────┼─────────┬─────────┐
24-
▼ ▼ ▼ ▼
25-
PostgreSQL Redis MinIO Workers
26-
```
35+
36+
Dashed lines indicate optional metrics services (enabled via `COMPOSE_PROFILES=metrics`).
2737

2838
## Technology Stack
2939

@@ -214,6 +224,71 @@ pg_dump → /data/backups
214224

215225
Backups can be synced to remote storage using `rclone`.
216226

227+
## CI/CD Flow
228+
229+
```mermaid
230+
graph LR
231+
Push["git push to main"] --> Actions["GitHub Actions"]
232+
233+
subgraph Actions["GitHub Actions"]
234+
Test["Test Job"] --> Deploy["Deploy Job"]
235+
end
236+
237+
Deploy -->|SSH| Server["Server"]
238+
239+
subgraph Server["Server Operations"]
240+
Pull["git pull"] --> Build["docker compose up -d --build"]
241+
Build --> Trivy["Trivy image scan"]
242+
Build --> Migrate["Alembic migrate"]
243+
Migrate --> CaddyWrite["Write Caddyfile"]
244+
CaddyWrite --> Reload["Caddy reload"]
245+
end
246+
```
247+
248+
## Backup and Restore Flow
249+
250+
```mermaid
251+
graph LR
252+
subgraph Backup["Daily Backup (cron 02:00)"]
253+
Cron["cron"] --> Script["backup-postgres.sh"]
254+
Script --> PgDump["pg_dump per database"]
255+
PgDump --> Compress["gzip"]
256+
Compress --> Store["/data/backups/postgres/"]
257+
Store --> Prune["Prune backups > 7 days"]
258+
end
259+
260+
subgraph Restore["Manual Restore"]
261+
List["List backups"] --> Choose["Choose backup file"]
262+
Choose --> RestoreScript["restore-postgres.sh"]
263+
RestoreScript --> Drop["Drop + recreate DB"]
264+
Drop --> Import["Import from backup"]
265+
Import --> Verify["Verify data"]
266+
end
267+
```
268+
269+
## Preview Environment Flow
270+
271+
```mermaid
272+
graph TB
273+
subgraph Create["PR Opened / Updated"]
274+
PR["Pull Request"] --> Actions["GitHub Actions"]
275+
Actions -->|SSH| Clone["Clone to /opt/apps/app-pr-N/"]
276+
Clone --> Schema["Create PR-specific DB schema"]
277+
Schema --> BuildPR["docker compose up -d --build"]
278+
BuildPR --> CaddyPR["Write pr-N.preview.domain.caddy"]
279+
CaddyPR --> ReloadPR["Caddy reload"]
280+
ReloadPR --> Comment["Comment preview URL on PR"]
281+
end
282+
283+
subgraph Cleanup["PR Closed / Merged"]
284+
Closed["PR closed"] --> StopContainers["Stop + remove containers"]
285+
StopContainers --> DropSchema["Drop PR schema"]
286+
DropSchema --> RemoveCaddy["Remove .caddy file"]
287+
RemoveCaddy --> ReloadClean["Caddy reload"]
288+
ReloadClean --> RemoveDir["Remove /opt/apps/app-pr-N/"]
289+
end
290+
```
291+
217292
## Docker Compose Services
218293

219294
Each application lives in its own GitHub repository under the `towlion` organization. The server runs two layers of Compose services:
@@ -249,9 +324,8 @@ services:
249324
env_file: .env
250325
frontend:
251326
build: ./frontend
252-
celery-worker:
253-
build: ./app
254-
command: celery -A app.tasks worker
255327
```
256328
329+
Celery workers are opt-in. To add background task processing, add a `celery-worker` service to your compose file. See the [app-template README](https://github.com/towlion/app-template#background-tasks) for instructions.
330+
257331
For single-app self-hosting (fork scenario), a repository may bundle platform services in its own Compose file so it can run standalone.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# ADR 001: GitHub as the Control Plane
2+
3+
**Date:** 2025-12-01
4+
**Status:** Accepted
5+
6+
## Context
7+
8+
Traditional PaaS platforms (Heroku, Render, Fly.io) require a dedicated control plane — a custom API, dashboard, or CLI that manages deployments, configuration, and access control. Building and maintaining a control plane is a significant engineering effort and introduces another system to secure and operate.
9+
10+
Towlion targets indie developers and small projects. The overhead of a custom control plane would outweigh its benefits for this audience.
11+
12+
## Decision
13+
14+
Use GitHub as the control plane for all platform operations:
15+
16+
- **CI/CD**: GitHub Actions workflows handle building, testing, and deploying
17+
- **Configuration**: GitHub Secrets store deployment credentials
18+
- **Access control**: GitHub repository permissions manage who can deploy
19+
- **Workflow orchestration**: Actions workflows coordinate multi-step deployments
20+
- **Issue tracking**: GitHub Issues for alerts (created by `check-alerts.sh`)
21+
22+
No custom dashboard, API server, or CLI is needed. The deployment flow is:
23+
24+
```
25+
GitHub repository -> GitHub Actions -> SSH -> Docker runtime
26+
```
27+
28+
## Consequences
29+
30+
**Benefits:**
31+
32+
- Zero infrastructure to build or maintain for the control plane
33+
- Familiar interface — developers already use GitHub daily
34+
- Built-in audit trail via commit history and Actions logs
35+
- Free for public repos; generous free tier for private repos
36+
- Access control, 2FA, and SSO handled by GitHub
37+
38+
**Trade-offs:**
39+
40+
- Vendor dependency on GitHub (Actions, Secrets, API)
41+
- Limited to what GitHub Actions can express (no custom UI for deployments)
42+
- Secrets management is per-repository, not centralized
43+
- No real-time deployment dashboard — must check Actions tab for status
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# ADR 002: Caddy Over Nginx for Reverse Proxy
2+
3+
**Date:** 2025-12-01
4+
**Status:** Accepted
5+
6+
## Context
7+
8+
The platform needs a reverse proxy to route traffic to application containers and handle TLS certificate provisioning. The two main contenders are Nginx (with certbot/Let's Encrypt) and Caddy.
9+
10+
## Decision
11+
12+
Use Caddy as the reverse proxy.
13+
14+
## Consequences
15+
16+
**Benefits:**
17+
18+
- **Automatic TLS** — Caddy provisions and renews Let's Encrypt certificates with zero configuration. No certbot cron jobs, no manual renewal scripts.
19+
- **Simple configuration** — A Caddyfile is shorter and more readable than equivalent Nginx config. Adding a new app route is a 3-line file.
20+
- **Per-app config fragments** — The `import /etc/caddy/apps/*.caddy` pattern allows deploy workflows to write individual `.caddy` files per app without editing a monolithic config.
21+
- **Hot reload**`caddy reload` applies config changes without dropping connections. No `nginx -s reload` dance.
22+
- **Single binary** — No module system, no package dependencies. The official Docker image works out of the box.
23+
24+
**Trade-offs:**
25+
26+
- Nginx has broader community knowledge and more Stack Overflow answers
27+
- Nginx supports more advanced configurations (TCP/UDP proxying, complex rewrites)
28+
- Caddy is less battle-tested at high scale (not a concern for single-server deployments)
29+
- Some hosting tutorials assume Nginx, requiring translation to Caddy equivalents
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# ADR 003: Single-Server Architecture
2+
3+
**Date:** 2025-12-01
4+
**Status:** Accepted
5+
6+
## Context
7+
8+
Modern deployment platforms typically use container orchestration (Kubernetes, Docker Swarm, Nomad) to distribute workloads across multiple servers. This provides high availability, auto-scaling, and fault tolerance — but at significant operational complexity.
9+
10+
Towlion targets indie developers, small SaaS products, and hobby projects. These applications typically serve a handful of users and run comfortably on a single server.
11+
12+
## Decision
13+
14+
Limit the platform to a single Debian server. All applications, databases, and infrastructure services run on one machine using Docker Compose.
15+
16+
## Consequences
17+
18+
**Benefits:**
19+
20+
- **Simplicity** — no cluster management, no service mesh, no distributed consensus
21+
- **Low cost** — a single $12-24/month VPS runs the entire platform
22+
- **Easy debugging** — everything is on one machine; `docker logs` and `docker exec` are sufficient
23+
- **Fast deploys** — no image registry push/pull; images build locally from source
24+
- **Predictable** — no network partitions, no node failures, no split-brain scenarios
25+
26+
**Trade-offs:**
27+
28+
- Single point of failure — if the server goes down, all apps go down
29+
- No auto-scaling — resource limits are static
30+
- Brief downtime during deploys (container rebuild takes a few seconds)
31+
- Vertical scaling only — must move to a bigger server when capacity is reached
32+
- Not suitable for high-traffic applications or strict uptime requirements
33+
34+
These are intentional design boundaries, not missing features. For workloads that need HA or auto-scaling, use Kubernetes, Fly.io, or a cloud-native platform instead.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# ADR 004: Fork-Based Self-Hosting
2+
3+
**Date:** 2025-12-01
4+
**Status:** Accepted
5+
6+
## Context
7+
8+
Self-hosted software typically uses one of these distribution models:
9+
10+
1. **Installer script** — download and run a setup script (e.g., `curl | bash`)
11+
2. **Docker image** — pull a pre-built image from a registry
12+
3. **Helm chart / Terraform module** — declarative infrastructure definition
13+
4. **Fork** — fork the repository and deploy from your own copy
14+
15+
Each model has different trade-offs for isolation, customization, and update management.
16+
17+
## Decision
18+
19+
Use the fork model for self-hosting. Each operator forks the application repository, configures their own GitHub Secrets, and deploys from their fork.
20+
21+
```
22+
Original repo -> fork by Alice -> alice.example.com
23+
-> fork by Bob -> bob.example.com
24+
-> fork by Carol -> carol.example.com
25+
```
26+
27+
## Consequences
28+
29+
**Benefits:**
30+
31+
- **Strong isolation** — each fork is completely independent; no shared tenancy
32+
- **Full customization** — operators can modify any code, workflow, or configuration
33+
- **No installer to maintain** — the repository is the installer
34+
- **GitHub handles updates** — operators can sync upstream changes via GitHub's fork sync
35+
- **Built-in CI/CD** — GitHub Actions workflows come with the fork
36+
- **Transparent** — all deployment logic is visible in the repository
37+
38+
**Trade-offs:**
39+
40+
- Operators must manage their own GitHub Secrets and server
41+
- Upstream updates require manual sync (GitHub's "Sync fork" button)
42+
- No centralized management across forks — each is fully independent
43+
- Requires GitHub account and understanding of fork workflows
44+
- Private forks require a GitHub paid plan
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# ADR 005: AppArmor Over SELinux
2+
3+
**Date:** 2026-03-01
4+
**Status:** Accepted
5+
6+
## Context
7+
8+
Mandatory Access Control (MAC) adds a security layer beyond standard Unix permissions. The two main MAC systems on Linux are AppArmor and SELinux. The platform runs on Debian 12, which ships with AppArmor enabled by default.
9+
10+
## Decision
11+
12+
Use AppArmor (Debian's native MAC system) instead of SELinux.
13+
14+
Docker automatically applies the `docker-default` AppArmor profile to all containers, which restricts capabilities like writing to `/proc` and `/sys`, mounting filesystems, and accessing raw sockets. No additional configuration is needed.
15+
16+
## Consequences
17+
18+
**Benefits:**
19+
20+
- **Zero configuration** — AppArmor is already active on Debian 12 out of the box
21+
- **Docker integration** — Docker applies the `docker-default` profile automatically to every container
22+
- **Debian-native** — maintained by the Debian security team, well-tested with Debian packages
23+
- **Simple profile model** — AppArmor profiles are path-based and easier to read than SELinux policies
24+
25+
**Trade-offs:**
26+
27+
- AppArmor is less granular than SELinux (path-based vs. label-based)
28+
- SELinux is the standard on RHEL/Fedora ecosystems; cross-distro documentation often assumes SELinux
29+
30+
**Why not SELinux on Debian:**
31+
32+
- SELinux policies on Debian are incomplete — the `selinux-policy-default` package lags far behind RHEL equivalents
33+
- Docker + SELinux on Debian causes bind-mount labeling issues (`:z`/`:Z` volume flags) with no community support
34+
- Enabling SELinux requires disabling AppArmor, losing Docker's automatic profile enforcement
35+
- The Debian security team does not maintain SELinux policies to the same standard as AppArmor

0 commit comments

Comments
 (0)