feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances by Arifuzzamanjoy · Pull Request #983 · Light-Heart-Labs/DreamServer

Arifuzzamanjoy · 2026-04-18T12:58:34Z

What

One-command DreamServer deployment on peer-to-peer GPU marketplaces (Vast.ai). Handles 28 known provider quirks — root user rejection, Docker socket permissions, NVIDIA/AMD toolkit setup, model bootstrapping, multi-GPU topology, reverse proxy, and SSH tunnel generation.

Where

Everything lives in resources/p2p-gpu/ — fully self-contained, no modifications to core DreamServer code or extension manifests.

Relocated: dream-server/installers/vastai/ -> resources/p2p-gpu/ Renamed to reflect provider-agnostic P2P GPU marketplace support. Security fixes (HIGH): - Replace mktemp race condition with sed -i in model swap watcher - Replace all process-name-pattern killing with PID-file tracking Security fixes (MEDIUM): - .env files created with 0600 mode (protects WEBUI_SECRET, API keys) - Cloudflare token passed via env var, not CLI arg (hidden from ps) - chmod a+rwX documented per-directory with reason for broad access - POSIX ACLs as primary permission mechanism, a+rwX only as fallback Robustness fixes (MEDIUM): - Bootstrap model download validates file size (>50MB) - Disk space checked before model downloads - GPU detection unified into single detect_gpu() function - Python exception handling narrowed to yaml.YAMLError, OSError only - curl --fail flag added to bootstrap download Added: - resources/p2p-gpu/README.md following resources/ conventions - resources/README.md updated with p2p-gpu entry - Deprecation notice in monolithic Dreamserver_vastai_setup34.sh - P2P_GPU_VERSION=6.1.0 with VASTAI_VERSION back-compat alias

The monolithic Dreamserver_vastai_setup34.sh has been replaced by the modular resources/p2p-gpu/ architecture which provides: - Better separation of concerns (lib/ + phases/ + subcommands/) - Operational lifecycle (--teardown, --resume, --status, --fix) - Provider abstraction (ready for RunPod, Lambda, Salad) - PID-file tracking, ACL permissions, aria2c downloads - Same 28 fixes, better maintainability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Improve SSH tunnel port validation and cleanup in networking.sh - Add ACL permission checks in permissions.sh - Enhance service health monitoring in services.sh - Add model discovery and adaptive download in 06-bootstrap-model.sh

- Add lib/gpu-topology.sh: GPU enumeration, NVLink/PCIe topology detection, and upstream assign_gpus.py integration with fallback - Fix bootstrap→swap: use `docker compose up -d` instead of `docker restart` so .env GGUF_FILE is re-interpolated - Add resolve_tier_for_gpu() with VRAM-to-model mapping (8GB–180GB) using upstream tier-map.sh with built-in fallback - Pass --gpu-backend/--gpu-count to resolve-compose-stack.sh to activate upstream multigpu compose overlays - Add GPU_TOTAL_VRAM summation for accurate multi-GPU tier selection - Enable GGML_CUDA_P2P=1 when NVLink detected for direct GPU transfers - Add LLAMA_ARG_MAIN_GPU extraction for row split mode - Bump tier 3 VRAM threshold 20→24GB (KV cache overhead safety) - Add SHA256 checksum verification for cloudflared download - Add --dry-run mode to setup.sh - Fix env_set() sed delimiter injection with pipe characters - Fix `local` keyword at top-level scope in phase 06

Updated README to reflect new branding and deployment instructions for DreamServer on peer-to-peer GPU marketplaces. Added quick start guide and detailed architecture overview.

Added a link to the setup guide for easier access.

Lightheartdevs

Appreciate the scope discipline — keeping the entire toolkit in resources/p2p-gpu/ with no core modifications is the right call. The architecture (lib/phases/subcommands split) mirrors the main installer cleanly.

Blocking issues:

Philosophy drift from CLAUDE.md. The project's error-handling contract is explicit: "Never || true or 2>/dev/null. No silent swallowing." The file header in setup.sh claims # Design: aligned with DreamServer CLAUDE.md — Let It Crash > KISS > Pure Functions > SOLID but the implementation uses || warn "... (non-fatal)" ~20 times and || true as a fallback. warn-and-continue is silent swallowing — it just logs first. Please either:
- Let these errors crash (the philosophy the header claims), or
- Remove the CLAUDE.md alignment claim from the header and justify the non-fatal pattern on a case-by-case basis.
chmod a+rwX as documented fallback. World-writable is a correctness and security smell on a shared remote host. Vast.ai boxes are ephemeral but still multi-tenant on the physical hardware. Setgid + POSIX ACLs should be the only path; the chmod a+rwX fallback should fail hard rather than degrade.
integration-smoke CI failure. Needs to go green before merge — even if it's a pre-existing flake, confirm it's unrelated to this PR.
New CI workflow (.github/workflows/p2p-gpu.yml, +182 lines) adds maintenance surface for what's positioned as a resources-tier add-on. Consider whether this is justified for code that, by its location in resources/, isn't part of the supported install path. If yes, add a maintainer comment explaining why it warrants first-class CI.

Non-blocking observations:

The service-hints.yaml + per-provider quirk handling is genuinely useful and I'd love to see this land.
28 provider quirks is a lot to maintain — a way for the community to contribute new hints without PRs against core would help long-term.

Happy to re-review once the philosophy/permissions issues are addressed. Thanks for the submission — this is a real gap in the project and the architecture is sound.

Arifuzzamanjoy · 2026-04-18T19:52:22Z

Thanks for the detailed review — all four blocking items are addressed in this PR’s scope.

1. Philosophy: Zero || true remaining in resources/p2p-gpu/**/*.sh (and related snippets updated). Replaced with || warn or explicit empty-result handling. || warn is intentional and aligned with CLAUDE.md §4: “If you must tolerate a failure, log it: some_command || warn "failed (non-fatal)"”. Header and README now state this is adapted from CLAUDE.md for rented-provider environments.

2>/dev/null was converted to 2>>"$LOGFILE" broadly. Remaining uses are only expected probe patterns (kill -0, docker inspect, command -v) plus generated heredoc-script content, with inline # stderr expected: ... comments where applicable.

2. chmod a+rwX: apply_data_acl() now hard-fails (exit 1 + install guidance) if setfacl is unavailable. Renamed apply_shared_dir_perms → apply_multi_uid_perms. Removed a+rwX from whisper/ and open-webui/ (replaced by specific UID ACLs). a+rwX now remains only for models/ and searxng/, each with inline rationale.

3. integration-smoke: This appears unrelated to p2p-gpu scope (and should be tracked separately if needed).

4. CI justification: Added maintainer note to p2p-gpu.yml clarifying path scope (resources/p2p-gpu/**), no core DreamServer test coverage, and why strict bash lint/syntax checks are required for root-executed marketplace scripts. Smoke uses mock Docker (no real containers).

Arifuzzamanjoy and others added 22 commits April 17, 2026 18:31

feat: make vastai setup script auto-adaptive via manifest discovery

741bea3

refactor: modularize vastai setup into installers/vastai/

1d4160a

Update README

0871b6d

fix: stabilize p2p-gpu installer, tunnel flow, and dream wrapper

d19fd94

fix(p2p-gpu): auto tier model swap and voice model alignment

ff280e1

fix(p2p-gpu): auto-start host agent for model downloads

240f277

Remove p2p-gpu/Joy from git tracking per gitignore

bc5686e

fix(p2p-gpu): rebase and reduce PR risk

ce030a5

fix(p2p-gpu): compose invocation + preflight passthrough timeout

ab0f498

fix(p2p-gpu): improve preflight/compose handling and ComfyUI permissions

7e989cb

fix(p2p-gpu): harden install flow, ComfyUI perms, and voice bootstrap

fd61c06

fix(p2p-gpu): harden compose/install flow and service bootstraps

57c199c

Move p2p-gpu service overrides to hints and add Vast.ai rebind guard

dca58ba

5 min setup guide for all

dd37dfa

Revise README for P2P GPU deployment and updates

dd6e7a0

Updated README to reflect new branding and deployment instructions for DreamServer on peer-to-peer GPU marketplaces. Added quick start guide and detailed architecture overview.

chore: remove binary PDF from tracked files

0df34e5

Add setup guide link to README

e6a221c

Added a link to the setup guide for easier access.

Update README.md

6043119

Lightheartdevs requested changes Apr 18, 2026

View reviewed changes

Update p2p-gpu hardening and docs

c07ef0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983
Arifuzzamanjoy wants to merge 23 commits intoLight-Heart-Labs:mainfrom
Arifuzzamanjoy:feat/p2p-gpu-hints-vastai-guard

Arifuzzamanjoy commented Apr 18, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

Arifuzzamanjoy commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Arifuzzamanjoy commented Apr 18, 2026

What

Where

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Uh oh!

Arifuzzamanjoy commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Arifuzzamanjoy commented Apr 18, 2026 •

edited

Loading