Skip to content

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983

Open
Arifuzzamanjoy wants to merge 23 commits intoLight-Heart-Labs:mainfrom
Arifuzzamanjoy:feat/p2p-gpu-hints-vastai-guard
Open

feat(resources): add p2p-gpu deploy toolkit for Vast.ai GPU instances#983
Arifuzzamanjoy wants to merge 23 commits intoLight-Heart-Labs:mainfrom
Arifuzzamanjoy:feat/p2p-gpu-hints-vastai-guard

Conversation

@Arifuzzamanjoy
Copy link
Copy Markdown
Contributor

What

One-command DreamServer deployment on peer-to-peer GPU marketplaces (Vast.ai). Handles 28 known provider quirks — root user rejection, Docker socket permissions, NVIDIA/AMD toolkit setup, model bootstrapping, multi-GPU topology, reverse proxy, and SSH tunnel generation.

Where

Everything lives in resources/p2p-gpu/ — fully self-contained, no modifications to core DreamServer code or extension manifests.

Arifuzzamanjoy and others added 22 commits April 17, 2026 18:31
Relocated: dream-server/installers/vastai/ -> resources/p2p-gpu/
Renamed to reflect provider-agnostic P2P GPU marketplace support.

Security fixes (HIGH):
- Replace mktemp race condition with sed -i in model swap watcher
- Replace all process-name-pattern killing with PID-file tracking

Security fixes (MEDIUM):
- .env files created with 0600 mode (protects WEBUI_SECRET, API keys)
- Cloudflare token passed via env var, not CLI arg (hidden from ps)
- chmod a+rwX documented per-directory with reason for broad access
- POSIX ACLs as primary permission mechanism, a+rwX only as fallback

Robustness fixes (MEDIUM):
- Bootstrap model download validates file size (>50MB)
- Disk space checked before model downloads
- GPU detection unified into single detect_gpu() function
- Python exception handling narrowed to yaml.YAMLError, OSError only
- curl --fail flag added to bootstrap download

Added:
- resources/p2p-gpu/README.md following resources/ conventions
- resources/README.md updated with p2p-gpu entry
- Deprecation notice in monolithic Dreamserver_vastai_setup34.sh
- P2P_GPU_VERSION=6.1.0 with VASTAI_VERSION back-compat alias
The monolithic Dreamserver_vastai_setup34.sh has been replaced by the
modular resources/p2p-gpu/ architecture which provides:
- Better separation of concerns (lib/ + phases/ + subcommands/)
- Operational lifecycle (--teardown, --resume, --status, --fix)
- Provider abstraction (ready for RunPod, Lambda, Salad)
- PID-file tracking, ACL permissions, aria2c downloads
- Same 28 fixes, better maintainability

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Improve SSH tunnel port validation and cleanup in networking.sh
- Add ACL permission checks in permissions.sh
- Enhance service health monitoring in services.sh
- Add model discovery and adaptive download in 06-bootstrap-model.sh
- Add lib/gpu-topology.sh: GPU enumeration, NVLink/PCIe topology
  detection, and upstream assign_gpus.py integration with fallback
- Fix bootstrap→swap: use `docker compose up -d` instead of
  `docker restart` so .env GGUF_FILE is re-interpolated
- Add resolve_tier_for_gpu() with VRAM-to-model mapping (8GB–180GB)
  using upstream tier-map.sh with built-in fallback
- Pass --gpu-backend/--gpu-count to resolve-compose-stack.sh to
  activate upstream multigpu compose overlays
- Add GPU_TOTAL_VRAM summation for accurate multi-GPU tier selection
- Enable GGML_CUDA_P2P=1 when NVLink detected for direct GPU transfers
- Add LLAMA_ARG_MAIN_GPU extraction for row split mode
- Bump tier 3 VRAM threshold 20→24GB (KV cache overhead safety)
- Add SHA256 checksum verification for cloudflared download
- Add --dry-run mode to setup.sh
- Fix env_set() sed delimiter injection with pipe characters
- Fix `local` keyword at top-level scope in phase 06
Updated README to reflect new branding and deployment instructions for DreamServer on peer-to-peer GPU marketplaces. Added quick start guide and detailed architecture overview.
Added a link to the setup guide for easier access.
Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the scope discipline — keeping the entire toolkit in resources/p2p-gpu/ with no core modifications is the right call. The architecture (lib/phases/subcommands split) mirrors the main installer cleanly.

Blocking issues:

  1. Philosophy drift from CLAUDE.md. The project's error-handling contract is explicit: "Never || true or 2>/dev/null. No silent swallowing." The file header in setup.sh claims # Design: aligned with DreamServer CLAUDE.md — Let It Crash > KISS > Pure Functions > SOLID but the implementation uses || warn "... (non-fatal)" ~20 times and || true as a fallback. warn-and-continue is silent swallowing — it just logs first. Please either:

    • Let these errors crash (the philosophy the header claims), or
    • Remove the CLAUDE.md alignment claim from the header and justify the non-fatal pattern on a case-by-case basis.
  2. chmod a+rwX as documented fallback. World-writable is a correctness and security smell on a shared remote host. Vast.ai boxes are ephemeral but still multi-tenant on the physical hardware. Setgid + POSIX ACLs should be the only path; the chmod a+rwX fallback should fail hard rather than degrade.

  3. integration-smoke CI failure. Needs to go green before merge — even if it's a pre-existing flake, confirm it's unrelated to this PR.

  4. New CI workflow (.github/workflows/p2p-gpu.yml, +182 lines) adds maintenance surface for what's positioned as a resources-tier add-on. Consider whether this is justified for code that, by its location in resources/, isn't part of the supported install path. If yes, add a maintainer comment explaining why it warrants first-class CI.

Non-blocking observations:

  • The service-hints.yaml + per-provider quirk handling is genuinely useful and I'd love to see this land.
  • 28 provider quirks is a lot to maintain — a way for the community to contribute new hints without PRs against core would help long-term.

Happy to re-review once the philosophy/permissions issues are addressed. Thanks for the submission — this is a real gap in the project and the architecture is sound.

@Arifuzzamanjoy
Copy link
Copy Markdown
Contributor Author

Arifuzzamanjoy commented Apr 18, 2026

Thanks for the detailed review — all four blocking items are addressed in this PR’s scope.

1. Philosophy: Zero || true remaining in resources/p2p-gpu/**/*.sh (and related snippets updated). Replaced with || warn or explicit empty-result handling. || warn is intentional and aligned with CLAUDE.md §4: “If you must tolerate a failure, log it: some_command || warn "failed (non-fatal)". Header and README now state this is adapted from CLAUDE.md for rented-provider environments.

2>/dev/null was converted to 2>>"$LOGFILE" broadly. Remaining uses are only expected probe patterns (kill -0, docker inspect, command -v) plus generated heredoc-script content, with inline # stderr expected: ... comments where applicable.

2. chmod a+rwX: apply_data_acl() now hard-fails (exit 1 + install guidance) if setfacl is unavailable. Renamed apply_shared_dir_permsapply_multi_uid_perms. Removed a+rwX from whisper/ and open-webui/ (replaced by specific UID ACLs). a+rwX now remains only for models/ and searxng/, each with inline rationale.

3. integration-smoke: This appears unrelated to p2p-gpu scope (and should be tracked separately if needed).

4. CI justification: Added maintainer note to p2p-gpu.yml clarifying path scope (resources/p2p-gpu/**), no core DreamServer test coverage, and why strict bash lint/syntax checks are required for root-executed marketplace scripts. Smoke uses mock Docker (no real containers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants