feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983
feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983Arifuzzamanjoy wants to merge 23 commits intoLight-Heart-Labs:mainfrom
Conversation
Relocated: dream-server/installers/vastai/ -> resources/p2p-gpu/ Renamed to reflect provider-agnostic P2P GPU marketplace support. Security fixes (HIGH): - Replace mktemp race condition with sed -i in model swap watcher - Replace all process-name-pattern killing with PID-file tracking Security fixes (MEDIUM): - .env files created with 0600 mode (protects WEBUI_SECRET, API keys) - Cloudflare token passed via env var, not CLI arg (hidden from ps) - chmod a+rwX documented per-directory with reason for broad access - POSIX ACLs as primary permission mechanism, a+rwX only as fallback Robustness fixes (MEDIUM): - Bootstrap model download validates file size (>50MB) - Disk space checked before model downloads - GPU detection unified into single detect_gpu() function - Python exception handling narrowed to yaml.YAMLError, OSError only - curl --fail flag added to bootstrap download Added: - resources/p2p-gpu/README.md following resources/ conventions - resources/README.md updated with p2p-gpu entry - Deprecation notice in monolithic Dreamserver_vastai_setup34.sh - P2P_GPU_VERSION=6.1.0 with VASTAI_VERSION back-compat alias
The monolithic Dreamserver_vastai_setup34.sh has been replaced by the modular resources/p2p-gpu/ architecture which provides: - Better separation of concerns (lib/ + phases/ + subcommands/) - Operational lifecycle (--teardown, --resume, --status, --fix) - Provider abstraction (ready for RunPod, Lambda, Salad) - PID-file tracking, ACL permissions, aria2c downloads - Same 28 fixes, better maintainability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Improve SSH tunnel port validation and cleanup in networking.sh - Add ACL permission checks in permissions.sh - Enhance service health monitoring in services.sh - Add model discovery and adaptive download in 06-bootstrap-model.sh
- Add lib/gpu-topology.sh: GPU enumeration, NVLink/PCIe topology detection, and upstream assign_gpus.py integration with fallback - Fix bootstrap→swap: use `docker compose up -d` instead of `docker restart` so .env GGUF_FILE is re-interpolated - Add resolve_tier_for_gpu() with VRAM-to-model mapping (8GB–180GB) using upstream tier-map.sh with built-in fallback - Pass --gpu-backend/--gpu-count to resolve-compose-stack.sh to activate upstream multigpu compose overlays - Add GPU_TOTAL_VRAM summation for accurate multi-GPU tier selection - Enable GGML_CUDA_P2P=1 when NVLink detected for direct GPU transfers - Add LLAMA_ARG_MAIN_GPU extraction for row split mode - Bump tier 3 VRAM threshold 20→24GB (KV cache overhead safety) - Add SHA256 checksum verification for cloudflared download - Add --dry-run mode to setup.sh - Fix env_set() sed delimiter injection with pipe characters - Fix `local` keyword at top-level scope in phase 06
Updated README to reflect new branding and deployment instructions for DreamServer on peer-to-peer GPU marketplaces. Added quick start guide and detailed architecture overview.
Added a link to the setup guide for easier access.
Lightheartdevs
left a comment
There was a problem hiding this comment.
Appreciate the scope discipline — keeping the entire toolkit in resources/p2p-gpu/ with no core modifications is the right call. The architecture (lib/phases/subcommands split) mirrors the main installer cleanly.
Blocking issues:
-
Philosophy drift from CLAUDE.md. The project's error-handling contract is explicit: "Never
|| trueor2>/dev/null. No silent swallowing." The file header insetup.shclaims# Design: aligned with DreamServer CLAUDE.md — Let It Crash > KISS > Pure Functions > SOLIDbut the implementation uses|| warn "... (non-fatal)"~20 times and|| trueas a fallback.warn-and-continue is silent swallowing — it just logs first. Please either:- Let these errors crash (the philosophy the header claims), or
- Remove the CLAUDE.md alignment claim from the header and justify the non-fatal pattern on a case-by-case basis.
-
chmod a+rwXas documented fallback. World-writable is a correctness and security smell on a shared remote host. Vast.ai boxes are ephemeral but still multi-tenant on the physical hardware. Setgid + POSIX ACLs should be the only path; thechmod a+rwXfallback should fail hard rather than degrade. -
integration-smokeCI failure. Needs to go green before merge — even if it's a pre-existing flake, confirm it's unrelated to this PR. -
New CI workflow (
.github/workflows/p2p-gpu.yml, +182 lines) adds maintenance surface for what's positioned as a resources-tier add-on. Consider whether this is justified for code that, by its location inresources/, isn't part of the supported install path. If yes, add a maintainer comment explaining why it warrants first-class CI.
Non-blocking observations:
- The
service-hints.yaml+ per-provider quirk handling is genuinely useful and I'd love to see this land. - 28 provider quirks is a lot to maintain — a way for the community to contribute new hints without PRs against core would help long-term.
Happy to re-review once the philosophy/permissions issues are addressed. Thanks for the submission — this is a real gap in the project and the architecture is sound.
|
What
One-command DreamServer deployment on peer-to-peer GPU marketplaces (Vast.ai). Handles 28 known provider quirks — root user rejection, Docker socket permissions, NVIDIA/AMD toolkit setup, model bootstrapping, multi-GPU topology, reverse proxy, and SSH tunnel generation.
Where
Everything lives in
resources/p2p-gpu/— fully self-contained, no modifications to core DreamServer code or extension manifests.