Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
3113178
fix(p2p-gpu): move toolkit to dream-server/installers/p2p-gpu/
Arifuzzamanjoy May 23, 2026
ad9f381
fix(p2p-gpu): fix 5 runtime bugs from live Vast.ai deployment
Arifuzzamanjoy May 23, 2026
0779cf4
fix(p2p-gpu): replace all || : with || warn per reviewer contract
Arifuzzamanjoy May 24, 2026
d861fb7
fix(p2p-gpu): guard dpkg lock waits
Arifuzzamanjoy May 24, 2026
8127fbb
fix(p2p-gpu): fix SOUL.md mount error, host agent set -e crash, healt…
Arifuzzamanjoy May 24, 2026
f460d90
fix(p2p-gpu): fix SOUL.md directory cleanup, model verification
Arifuzzamanjoy May 24, 2026
4f0deb9
fix(p2p-gpu): add VRAM-aware context size capping to prevent OOM crashes
Arifuzzamanjoy May 25, 2026
a46a447
fix(p2p-gpu): host agent binding
Arifuzzamanjoy May 25, 2026
5e03529
fix(p2p-gpu): overlay multigpu envs + set LLM_MODEL + total VRAM cap
Arifuzzamanjoy May 26, 2026
18a1e5b
Fix multi-GPU llama env handling in p2p-gpu
Arifuzzamanjoy May 26, 2026
640596a
fix(p2p-gpu): initialize kernel_version and lib_version to prevent un…
Arifuzzamanjoy May 26, 2026
c24175c
fix(p2p-gpu): curl --fail, cd guards, clarify logging deviation comment
Arifuzzamanjoy May 27, 2026
359d2f7
ci(p2p-gpu): re-add workflow at new installers path (syntax + nvml re…
Arifuzzamanjoy May 31, 2026
a345da0
feat(p2p-gpu): isolate missing-image services instead of collapsing t…
Arifuzzamanjoy May 31, 2026
14b702f
Fix multi-GPU OOM issues
Arifuzzamanjoy Jun 4, 2026
0cf8ffb
Revise setup resources in README.md
Arifuzzamanjoy Jun 5, 2026
8d150e9
docs(p2p-gpu): align README with code
Arifuzzamanjoy Jun 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/workflows/p2p-gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
name: P2P GPU checks

on:
push:
branches: [main]
paths:
- "dream-server/installers/p2p-gpu/**"
- ".github/workflows/p2p-gpu.yml"
pull_request:
branches: [main]
paths:
- "dream-server/installers/p2p-gpu/**"
- ".github/workflows/p2p-gpu.yml"

permissions:
contents: read

jobs:
p2p-gpu:
name: P2P GPU syntax + regression
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

- name: Bash syntax check (p2p-gpu)
run: |
shfiles=$(find dream-server/installers/p2p-gpu -name '*.sh' -type f)
if [ -z "$shfiles" ]; then
echo "No .sh files found under dream-server/installers/p2p-gpu"
exit 0
fi
echo "$shfiles" | xargs bash -n

- name: NVML mismatch regression
run: |
# Live Vast.ai + GPU validation is performed manually outside CI.
bash dream-server/installers/p2p-gpu/tests/test-nvml-mismatch.sh
196 changes: 196 additions & 0 deletions dream-server/installers/p2p-gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# P2P GPU Deploy — DreamServer on Peer-to-Peer GPU Marketplaces

Production-hardened deployment of the full DreamServer AI stack on rented GPU instances from peer-to-peer compute marketplaces (Vast.ai tested; architecture is provider-agnostic).

**One command. All bundled services. Any NVIDIA/AMD GPU or CPU-only instance.**

Automatically handles 28 known P2P GPU environment issues: root user rejection, Docker socket permissions, CPU limit overflow, /tmp permissions, NVIDIA toolkit setup, NVML driver/library mismatch, multi-GPU support, SSH tunneling, package manager locks, and more. Includes built-in recovery commands, health checks, and model auto-swap capabilities.

## What It Solves

**The Problem:** Deploying DreamServer on rented GPU instances is fragile. Root-only environments, non-standard filesystem permissions, held package locks, missing GPU drivers, and provider-specific quirks cause silent failures during setup.

**The Solution:** `setup.sh` is a battle-tested orchestrator that detects and fixes the known issues automatically. It handles permission escalation, creates a non-root `dream` user, manages Docker group access, installs missing NVIDIA/AMD toolkits, applies POSIX ACLs for multi-container file sharing, and starts all bundled services (discovered from extension manifests) with health checks. If setup partially completes, recovery commands bring the stack back online without reinstall.

## Quick Start

```bash
# On your GPU instance (as root):
bash setup.sh # Full install (~10 min)
bash setup.sh --status # Health check
bash setup.sh --info # Show connection URLs and SSH tunnel commands
bash setup.sh --teardown # Stop all services
```

## Setup Guide

- [Setup Tutorial_Video](https://drive.google.com/file/d/12CY9-KTyCsqRGtyaauqmvsupoh3jocBL/view?usp=sharing)
- [Setup presentation slides](https://docs.google.com/presentation/d/1XbVNV1n04JiOyAIkA6bU5r5A9T7uBnLr/edit?usp=sharing)

## Quick Recovery (If Phase 9 Fails)

If setup reached "Starting services" but URLs are unreachable:

```bash
bash setup.sh --fix
bash setup.sh --status
bash setup.sh --info
```

This re-applies CPU caps, permissions, network fixes, restarts compose, and
prints fresh access commands.

On Windows, use the all-port tunnel from `--info` (it uses a safe local alias
`58080 -> dashboard` plus direct localhost forwards for service ports).

`--fix` regenerates reconnect scripts:
- `connect-tunnel.sh` (Linux/macOS/WSL)
- `connect-tunnel.ps1` (Windows PowerShell)

## What It Does

The setup script handles 28 known issues with P2P GPU environments:

| # | Issue | Fix |
|---|-------|-----|
| 01 | Root user rejection | Creates non-root `dream` user |
| 02 | Docker socket denied | Adds dream to docker group |
| 03 | /tmp broken | Fixes permissions to 1777 |
| 04 | CPU limit overflow | Auto-caps to actual core count |
| 05 | n8n uid mismatch | Dynamic UID from compose.yaml |
| 06 | dashboard-api write | ACL-based permission system |
| 07 | comfyui models write | AMD/NVIDIA layout detection |
| 08 | WEBUI_SECRET missing | Auto-generated secrets |
| 09 | Dual directory confusion | Smart directory discovery |
| 10 | Dashboard stuck Created | Auto-nudge on startup |
| 11 | HuggingFace throttle | aria2c multi-threaded download |
| 12 | NVIDIA toolkit missing | Auto-installs + configures |
| 13 | Disk space insufficient | Pre-flight validation |
| 14 | Compose v1 syntax | Auto-detects v1 vs v2 |
| 15 | .env duplicates | Idempotent env_set() |
| 16 | Port conflicts | Dynamic port discovery |
| 17 | DNS resolution failure | Google/Cloudflare DNS fallback |
| 18 | /dev/shm too small | Remount /dev/shm to 4GB |
| 19 | Bootstrap model missing | Auto-downloads Qwen3-0.6B |
| 20 | llama-server infinite hang | 45s diagnosis + OOM recovery |
| 21 | No systemd | Host-agent background start |
| 22 | OpenCode crash-loop | Auto-disable non-essential |
| 23 | CUDA OOM on large models | Swap to smallest model |
| 24 | ComfyUI infinite hang | Background download, don't block |
| 25 | Installer hang | 10min cap on the installer run |
| 26 | AMD GPU support | ROCm detection + compose overlay |
| 27 | CPU-only fallback | Works without any GPU |
| 28 | NVML driver/library mismatch | Detect + targeted repair (regression-tested) |

## Architecture

```
p2p-gpu/
├── setup.sh # Orchestrator — sources libs, runs phases
├── config/
│ └── service-hints.yaml # p2p-gpu-only manifest overrides (proxy_mode, startup_behavior)
├── lib/ # Pure function libraries (no side effects)
│ ├── constants.sh # Paths, versions, colors, thresholds
│ ├── logging.sh # log/warn/err/step, cleanup trap, flock, dpkg-lock release
│ ├── environment.sh # .env management, GPU detection, HTTP polling
│ ├── permissions.sh # POSIX ACLs, setgid, UID-specific fixes
│ ├── services.sh # Manifest discovery, compose, startup
│ ├── networking.sh # Caddy proxy, SSH tunnel, Cloudflare
│ ├── models.sh # Model download, URL resolution, swap watcher
│ ├── gpu-topology.sh # Per-GPU enumeration, NVLink/PCIe topology, GPU↔service assignment
│ └── compatibility.sh # Whisper/TTS/ComfyUI/OpenClaw fixes
├── phases/ # Sequential install steps
│ ├── 00-preflight.sh # GPU/disk/Docker/DNS validation
│ ├── 01-dependencies.sh # System package installation
│ ├── 02-user-setup.sh # Create dream user + groups
│ ├── 03-repository.sh # Clone DreamServer repo
│ ├── 04-installer.sh # Run DreamServer installer (with timeout)
│ ├── 05-post-install.sh # Apply fixes, locate working directory
│ ├── 06-bootstrap-model.sh # Ensure usable GGUF model exists
│ ├── 07-model-optimize.sh # Resume/restart downloads with aria2c
│ ├── 08-vastai-quirks.sh # Provider-specific environment fixes
│ ├── 09-services.sh # Start containers + health monitoring
│ ├── 10-voice-stack.sh # TTS/STT model readiness gates
│ ├── 11-access-layer.sh # Caddy proxy + Cloudflare tunnel + SSH
│ └── 12-summary.sh # Print access info
├── subcommands/ # Alternative entry points
│ ├── teardown.sh # Stop all services
│ ├── status.sh # Health check dashboard
│ ├── resume.sh # Quick restart after SSH drop
│ ├── fix.sh # Apply fixes without reinstall
│ └── info.sh # Show connection URLs
└── tests/
└── test-nvml-mismatch.sh # NVML mismatch repair-path regression (run in CI)
```

## Design Principles

Aligned with DreamServer's [CLAUDE.md](../../../CLAUDE.md):

- **Let It Crash** — `set -euo pipefail` throughout; errors are fatal unless a failure is explicitly tolerated with `|| warn`. Non-essential services degrade independently, so a working dashboard with a degraded ComfyUI beats a dead stack on an instance you're paying for.
- **KISS** — readable over clever; one function, one job.
- **Functional core, imperative shell** — `lib/` holds pure helpers; `phases/` is the imperative shell that runs on source.
- **Manifest-driven** — services are discovered from extension manifests, never a hardcoded list.
- **PID-file process tracking** — background jobs (model downloads, swap watcher, tunnels) are tracked by PID file under `/var/run/dreamserver-p2p-gpu/` and stopped by PID.
- **ACL-primary permissions** — shared-data directories use setgid + POSIX ACLs as their only sharing mechanism. Failures on those paths abort the install (`exit 1`) rather than degrading to world-writable permissions; per-extension ACLs are applied independently so one extension's failure doesn't block the rest.

## Commands

| Command | Purpose |
|---------|---------|
| `bash setup.sh` | Full install (first time or re-install) |
| `bash setup.sh --resume` | Quick restart — re-apply fixes + start services |
| `bash setup.sh --status` | Health check — GPU, containers, ports |
| `bash setup.sh --info` | Show connection URLs and SSH tunnel commands |
| `bash setup.sh --fix` | Apply latest fixes without full reinstall |
| `bash setup.sh --teardown` | Stop all services |
| `bash setup.sh --dry-run` | Preview what would happen without making changes |

## Model Download and Auto-Swap

- Setup starts quickly on a small model, downloads the GPU-tier model in background, then auto-swaps when ready.
- Swap updates both `GGUF_FILE` and `LLM_MODEL`, then restarts dependent services.
- Dashboard model downloads (`/models` page) require the Dream host agent; setup auto-starts it during service startup.

```bash
MODEL="Qwen3-30B-A3B-Q4_K_M.gguf"; DS_DIR="${DS_DIR:-/home/dream/dream-server}"; LLM_MODEL="$(echo "$MODEL" | sed -E 's/\.(gguf|GGUF)$//' | sed -E 's/-Q[0-9]+([._][A-Za-z0-9]+)*$//' | tr '[:upper:]' '[:lower:]')"; cd "$DS_DIR" && sed -i "s|^GGUF_FILE=.*|GGUF_FILE=${MODEL}|" .env && { grep -q '^LLM_MODEL=' .env && sed -i "s|^LLM_MODEL=.*|LLM_MODEL=${LLM_MODEL}|" .env || echo "LLM_MODEL=${LLM_MODEL}" >> .env; } && docker compose $(cat .compose-flags 2>/dev/null) up -d llama-server && for c in dream-dreamforge dream-openclaw dream-dashboard-api dream-webui; do docker ps --format '{{.Names}}' | grep -qx "$c" && docker restart "$c" >/dev/null || echo "[warn] ${c} restart failed (non-fatal)" >&2; done
```

```bash
tail -f /home/dream/dream-server/logs/aria2c-download.log
```

```bash
# If Dashboard shows "Failed to start download"
su - dream -c 'cd /home/dream/dream-server && DREAM_HOME=/home/dream/dream-server ./dream-cli agent start'
```

## Provider Support

Currently tested on **Vast.ai**. The architecture is provider-agnostic:
- GPU detection works for any NVIDIA/AMD/CPU-only instance
- Docker + compose requirements are standard
- Provider-specific quirks isolated in `phases/08-vastai-quirks.sh`

The active provider is selected by `PROVIDER_NAME` (override with `P2P_GPU_PROVIDER`
before running). To add a new provider, create `phases/08-<provider>-quirks.sh` with
provider-specific fixes.

## Security

- `.env` files created with `0660` mode, owned `dream:dream` — readable by the `dream` group the containers run under, never world-readable
- SSH private keys forced to `0600`
- Background process PIDs tracked in `/var/run/dreamserver-p2p-gpu/`
- Cloudflare tokens passed via environment variables (not CLI args)
- `cloudflared` binary verified against the upstream SHA256 when the checksum file is reachable; on mismatch the tunnel is skipped
- POSIX ACLs required; world-writable permissions are never used
- Multi-UID directories documented with reasons for broader access

## Related

- [`../../../README.md`](../../../README.md) — DreamServer project overview
- [`../../../CLAUDE.md`](../../../CLAUDE.md) — design philosophy and error-handling rules
- [`../../docs/INSTALLER-ARCHITECTURE.md`](../../docs/INSTALLER-ARCHITECTURE.md) — installer module map and header convention
- [`../../docs/EXTENSIONS.md`](../../docs/EXTENSIONS.md) — service/extension manifest model
- [`../../CONTRIBUTING.md`](../../CONTRIBUTING.md) — contribution and validation guide
- [`../../../SECURITY.md`](../../../SECURITY.md) — security policy and disclosure
23 changes: 23 additions & 0 deletions dream-server/installers/p2p-gpu/config/service-hints.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# P2P GPU deployment hints — service-specific overrides for the setup script.
# These supplement manifest.yaml defaults ONLY within the p2p-gpu context.
# When upstream adopts proxy_mode/startup_behavior as first-class manifest
# fields, delete this file and remove the hints merge in lib/services.sh.

comfyui:
proxy_mode: root
startup_behavior: heavy

dashboard:
proxy_mode: root

open-webui:
proxy_mode: root

perplexica:
startup_behavior: heavy

tts:
startup_behavior: heavy

whisper:
startup_behavior: heavy
Loading
Loading