Skip to content

fix(extensions-library): wait for healthy databases in paperless-ngx depends_on#537

Closed
yasinBursali wants to merge 44 commits intoLight-Heart-Labs:resources/devfrom
yasinBursali:ext/fix-paperless-depends-on
Closed

fix(extensions-library): wait for healthy databases in paperless-ngx depends_on#537
yasinBursali wants to merge 44 commits intoLight-Heart-Labs:resources/devfrom
yasinBursali:ext/fix-paperless-depends-on

Conversation

@yasinBursali
Copy link
Copy Markdown
Contributor

What

Change depends_on from array form to map form with condition: service_healthy for both postgres and redis.

Why

Array-form depends_on only waits for containers to start, not for them to be healthy. Paperless-ngx can crash-loop on slow hardware when postgres hasn't finished initializing.

How

# Before
depends_on:
  - postgres
  - redis

# After
depends_on:
  postgres:
    condition: service_healthy
  redis:
    condition: service_healthy

Both postgres and redis already have proper healthchecks defined in the same compose file.

Scope

All changes within resources/dev/extensions-library/services/paperless-ngx/compose.yaml.

Merging Order

Merge after PR #526 (paperless secret key) — same file, different lines, no textual conflict.

Testing

  • YAML validation: passed
  • Pattern matches librechat's depends_on convention
  • Critique Guardian: APPROVED

Review

Critique Guardian verdict: APPROVED — correct pattern, healthchecks verified, low regression risk.

Originally reported in yasinBursali#88

Lightheartdevs and others added 30 commits March 20, 2026 08:11
…validation

Five changes to eliminate the support pain we experienced with real
users on Strix Halo:

1. Symlink `dream` to /usr/local/bin during install
   Users had no idea dream-cli existed at ~/dream-server/dream-cli.
   Now `dream status`, `dream restart perplexica` etc. work immediately.

2. Save compose flags at install time (.compose-flags)
   Users were manually chaining 5+ compose files to restart a single
   service. Now dream-cli reads saved flags — no compose knowledge needed.

3. Add `dream repair <service>` command
   Stops container, nukes volume, recreates, and re-seeds config.
   Includes Perplexica repair script that sets API key, base URL,
   model, and marks setupComplete via HTTP API.

4. Post-install validation in phase 13
   - Re-runs Perplexica config seed if phase 12 failed silently
   - Warns AMD users if not in render/video groups (ComfyUI won't work)

5. Dashboard GPU detection — AMD-aware messages
   PreFlightChecks now uses backend-specific error from API instead of
   hardcoded "Install NVIDIA drivers." TroubleshootingAssistant includes
   AMD ROCm solutions alongside NVIDIA.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Invalidate .compose-flags cache in cmd_enable/cmd_disable so extension
  changes take effect immediately instead of using stale cached flags
- Validate .compose-flags content on read (must start with '-f ') and
  remove corrupt/stale files to fall through to dynamic resolution
- Add [y/N] confirmation prompt to dream repair before destroying
  service volumes (matches existing rollback/preset-restore pattern)
- Replace || true with || warn in cmd_repair for visible error reporting
- Tighten volume grep from substring match to anchored pattern to
  prevent matching unrelated services (e.g. dashboard matching dashboard-api)
- Add set -euo pipefail to repair-perplexica.sh
- Fix shell injection: use os.environ in Python instead of shell variable
  interpolation inside heredoc (single-quoted delimiter prevents expansion)
- Use lib/python-cmd.sh for Python detection matching phase 12 pattern
- Guard .compose-flags write in 11-services.sh against full-disk failure
- Redirect stderr to LOG_FILE in 13-summary.sh Perplexica validation
  instead of /dev/null so failures are diagnosable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ainers

On fresh Ubuntu installs with Strix Halo, /dev/kfd may not exist if
the amdkfd kernel module isn't loaded. The installer detected the GPU
via sysfs and configured GTT memory, but never verified the compute
devices existed. Containers then fail with a cryptic Docker error:
"error gathering device information while adding custom device"

Fix: after group setup, attempt modprobe amdkfd if /dev/kfd is missing.
Also verify /dev/dri and renderD128 exist. Clear warnings tell the user
what to do instead of a silent Docker failure.

Found during real Strix Halo user install on Ubuntu 24.04 Desktop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker 29.3.0 fails with "error gathering device information while
adding custom device /dev/dri: no such file or directory" on AMD GPUs,
even when /dev/dri and /dev/kfd both exist. This blocks llama-server
and comfyui from starting on Strix Halo.

Confirmed: Docker 29.2.1 on the same kernel (6.17.0-19) and same
hardware works perfectly. Docker 29.3.0 does not.

Fix: after Docker install/detection, check version. If 29.3.x and
AMD backend, automatically downgrade to 29.2.1 with clear messaging.
Supports apt (Ubuntu/Debian) and dnf (Fedora/RHEL).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: v2.2 UX — dream CLI, repair command, AMD GPU fixes, Docker 29.3 pin
Bump version in manifest.json, constants.sh, README install URL,
and get-dream-server.sh bootstrap comment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Thin wrapper around install.sh that forces --tier 0 (Qwen 3.5 2B,
~1.5GB download) and --non-interactive mode. All other installer
behavior is identical. Pass additional flags as needed:

  ./test-install.sh                # Minimal test install
  ./test-install.sh --all          # All services, tiny model
  ./test-install.sh --dry-run      # Preview without changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 600s (10 min) timeout was too aggressive for large image pulls
like the ~10GB CUDA llama-server image on slower connections. Bumps
to 3600s (60 min) to prevent false timeouts during legitimate downloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the installer is run via `curl ... | bash`, stdin is the piped
script content, so `read` gets EOF and `set -e` kills the process.
All 16 interactive read commands now explicitly read from /dev/tty,
which is the user's terminal regardless of how stdin is wired.

Affected files:
- installers/lib/ui.sh (install menu)
- installers/lib/detection.sh (reboot prompt)
- installers/phases/02-detection.sh (reboot prompt)
- installers/phases/03-features.sh (feature toggles)
- installers/phases/04-requirements.sh (ollama + continue prompts)
- installers/phases/05-docker.sh (sudo docker prompt)
- installers/macos/install-macos.sh (all interactive prompts)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Services exit immediately on success, so longer timeouts only affect
failure cases. Previous limits (1-5 min) were too aggressive for slow
hardware, large model loads (FLUX, whisper-large), and first-boot
scenarios where models download on startup.

All services now get 150 attempts with adaptive backoff (2s→8s cap),
giving ~20 minutes before the installer gives up. Zero cost on fast
machines — the check returns instantly once the service responds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bumps wget --timeout from 300s/600s to 3600s for GGUF and FLUX model
downloads. This is the network stall timeout (no data received), not
a total time cap. Prevents false failures on slow or intermittent
connections without affecting fast downloads.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- STT model download (whisper-large ~1.5GB): 120s → 3600s
- Offline embedding model download: 600s → 3600s
- Background task wait (bootstrap model): 300s → 1200s (20 min)

Previous limits assumed fast connections. These are all no-cost on
fast hardware — they only prevent premature failures on slow links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
curl is consistently faster than wget for HuggingFace downloads due
to better HTTP/2 support and connection reuse. Also eliminates wget
as an installer dependency — curl is already required everywhere.

Flags: -fSL (fail on error, silent, follow redirects), -C - (resume
partial downloads), --connect-timeout 30, --max-time 3600.

Updates test-network-timeouts.sh assertions to match (wget -> curl,
--timeout -> --max-time).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For Tier 1+ installs, the installer now downloads a tiny 2B bootstrap
model (~1.5GB, ~1 min) first and starts services immediately. The full
tier-appropriate model downloads in the background and auto hot-swaps
via bootstrap-upgrade.sh when ready (~30s interruption).

This eliminates 80%+ of install wait time. Users can start chatting
within 2-3 minutes instead of waiting 10-30 min for large model
downloads.

New files:
- installers/lib/bootstrap-model.sh: constants + bootstrap_needed()
- scripts/bootstrap-upgrade.sh: background download + auto hot-swap

Modified files:
- installers/phases/11-services.sh: bootstrap flow before compose up
- install-core.sh: --no-bootstrap flag
- installers/phases/13-summary.sh: bootstrap status in summary

Behavior:
- Tier 0: no change (full model IS the bootstrap model)
- Tier 1+: bootstrap → background download → auto-swap
- --no-bootstrap: opt out, download full model in foreground
- --offline/--cloud: bootstrap skipped automatically
- Re-install with model on disk: bootstrap skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update all version references for the v2.3.0 release:
- constants.sh, 06-directories.sh fallback
- get-dream-server.sh curl URL
- Both READMEs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The .env file is generated in phase 06 with the full model's GGUF_FILE
and LLM_MODEL values. When bootstrap mode is active, phase 11 swaps
these variables for the download and models.ini, but docker compose
reads from .env — so llama-server tried to load a model file that
doesn't exist yet (the full model is still downloading in background).

Now phase 11 patches GGUF_FILE, LLM_MODEL, and MAX_CONTEXT in .env
to match the bootstrap model before running compose up. The background
upgrade script (bootstrap-upgrade.sh) already updates .env back to
the full model values when the download completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 6 sequential docker compose builds ran silently with output
suppressed to the log file. The terminal went dead for several
minutes, making it look like the installer had exited.

Each build now runs in background with spin_task, showing:
  [1/6] Building dashboard
  [2/6] Building dashboard-api
  ...
with ✓/⚠ status on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes:

1. check_service() used if/then around curl, which consumed the exit
   code — $? was always 0 after the if block, so the timeout (124)
   vs connection-refused (7) distinction never worked. Switched to
   cmd && { success } pattern so $? reflects the actual curl exit.

2. Container build loop now shows [1/6] Building <service> spinner
   instead of going silent for several minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ght-Heart-Labs#511)

The nohup and bg_task_start calls had trailing spaces but no
backslashes, so bash treated each line as a separate command.
The nohup ran bootstrap-upgrade.sh with no arguments, then
"$INSTALL_DIR" was executed as a command, crashing the installer
under set -euo pipefail.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Default MODELS_DIR was $DREAM_DIR/models but the installer stores
models in $DREAM_DIR/data/models. Script was completely non-functional
as a standalone tool.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-Labs#512)

Release b5570 no longer has the Vulkan Windows asset on GitHub —
downloads return a 9-byte 404 body saved as a "zip", causing
"Central Directory corrupt" on extraction.

Updates to b8248 which matches the Linux Docker image tag and has
a confirmed Vulkan binary. Also syncs DS_VERSION to 2.3.0.

Fixes #209

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-Heart-Labs#513)

ComfyUI and its FLUX models were installed unconditionally — even
"Custom" installs had no way to skip image generation. This adds
ENABLE_COMFYUI flag following the same pattern as ENABLE_VOICE,
ENABLE_WORKFLOWS, ENABLE_RAG, and ENABLE_OPENCLAW.

When disabled:
- Skips 34GB FLUX model download
- Skips ComfyUI Docker image pull + build
- Skips ComfyUI health check
- Sets ENABLE_IMAGE_GENERATION=false in .env so Open WebUI
  hides the image generation button entirely

Also fixes "Core Only" menu option which previously didn't disable
any optional services (all ENABLE flags defaulted to true).

New CLI flag: --comfyui (included in --all)
Custom install prompt: "Enable image generation (ComfyUI + FLUX, ~34GB)? [Y/n]"

Fixes #196

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bs#514)

pro.json was binding to 0.0.0.0 (all interfaces) with
dangerouslyDisableDeviceAuth enabled, exposing the OpenClaw
gateway to the network without authentication.

Changes host from 0.0.0.0 to 127.0.0.1, matching the other
two configs (openclaw.json, openclaw-strix-halo.json). With
localhost-only binding, the disabled device auth is safe —
only local processes (Docker containers via internal network)
can reach the gateway.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art-Labs#516)

The optional ComfyUI feature (Light-Heart-Labs#513) added this env var to the .env
generator but not to the schema, causing schema validation to fail
during install.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ight-Heart-Labs#517)

Docker Compose v5+ errors when an overlay references a service not
defined in any included compose file. The tier0 overlay referenced
qdrant, n8n, whisper, tts, openclaw, embeddings, etc. — all optional
services that may not be in the stack.

The old comment "Docker Compose ignores overrides for services not
defined in base" was true for Compose v2 but is false for v5.

Now only overrides the 4 base services: llama-server, dashboard,
dashboard-api, open-webui. Optional services use their own compose
files' default limits.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

* fix: resolve .env port overrides for health checks

SERVICE_PORTS reads from manifest defaults (e.g. 8080 for llama-server)
but .env may override them (e.g. OLLAMA_PORT=11434 on Strix Halo).
Health checks were hitting the wrong port, timing out for 20 minutes,
then reporting failure even though the service was running fine.

Phase 12 now reads port vars from .env after sr_load and updates
SERVICE_PORTS via SERVICE_PORT_ENVS (indirect variable expansion).
Also fixes bootstrap-upgrade.sh which runs via nohup and doesn't
inherit env vars from the parent shell.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: also resolve .env port overrides in phase 13 summary

Phase 13 uses SERVICE_PORTS for Perplexica auto-config URLs and the
final "YOUR DREAM SERVER IS LIVE" display. Without resolving .env
overrides, users see wrong URLs (e.g. localhost:3000 when WebUI is
actually on a different port).

Same resolution pattern as phase 12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: centralize port resolution in sr_resolve_ports()

Extracts the SERVICE_PORTS override logic into a shared function in
service-registry.sh instead of inline code in each consumer. All 9
post-install scripts and both installer phases now call sr_resolve_ports()
after loading .env, ensuring SERVICE_PORTS reflects actual port config
everywhere (not just manifest defaults).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Light-Heart-Labs#519)

* fix: dashboard health checks on Docker Desktop (Windows/WSL2)

Root cause: Docker Desktop's embedded DNS takes ~4 seconds to return
NXDOMAIN for non-running containers. With 19 services checked
concurrently via asyncio.gather, the slow DNS lookups blocked running
services from being checked in time, causing everything to show as
"degraded" on the dashboard.

Fix (three-part):

1. Fresh session per poll cycle — eliminates stale connection pool
   issues. The global aiohttp session accumulated dead connections
   from non-running services, poisoning subsequent polls. Now each
   cycle creates a fresh session with force_close=True and
   use_dns_cache=False, then closes it.

2. Not-deployed cache with TTL — services that fail DNS get cached
   for 15 seconds. Subsequent polls skip them entirely, so the slow
   4-second DNS lookups only happen once per service.

3. Two-phase polling — Phase 1 returns cached not_deployed results
   instantly. Phase 2 checks remaining services with a semaphore
   (limit=4) to prevent DNS contention. Total timeout raised to 30s
   so the first poll (which has no cache) can complete even with
   slow DNS.

Net effect: first poll takes ~4-5 seconds (DNS for non-deployed
services), subsequent polls complete in <50ms. All running services
show healthy with 1-5ms response times. No behavior change on native
Linux Docker where DNS failures are instant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: background polling for dashboard health checks

Replaces request-triggered health checks with a background polling
loop. API endpoints return cached results instantly (<1ms) instead
of running live checks on every request (8-16s on Docker Desktop).

Architecture:
- Background task polls get_all_services() every 10 seconds
- Results stored in module-level cache
- All endpoints read from cache, falling back to live check
  only on first request before the poll completes

helpers.py changes (reverted from previous PR, minimal diff):
- Restored original shared aiohttp session pattern
- Increased total timeout from 5s to 30s (no user impact since
  it only runs in the background poll)
- Added asyncio.TimeoutError handling in _check_host_service_health
  (bug fix: was raising unhandled NameError)
- Added get_cached_services() / set_services_cache() for the
  background poll to write and endpoints to read

main.py changes:
- Added _poll_service_health() background task (started on app startup)
- Added _get_services() async helper for cache-or-live fallback
- Updated /services, /status, _build_api_status() to read from cache

routers/features.py:
- Updated /api/features to read cached services instead of live check

Tested on:
- Windows Docker Desktop (RTX 5090): 11 healthy, 0 degraded, <350ms
- Linux native Docker (Strix Halo): 18/18 healthy (no regression)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tall (Light-Heart-Labs#520)

* feat(windows): add bootstrap fast-start for instant chat during install

Ports the Linux bootstrap pattern to the Windows installer. For Tier 1+
installs, downloads a tiny 2B model (~1.5GB, ~1 min) first so users can
chat immediately. The full tier model downloads in the background via
bootstrap-upgrade.sh (which already works on Windows via Git Bash) and
auto-swaps when ready.

Changes:
- tier-map.ps1: Add bootstrap constants, Get-TierRank, Should-UseBootstrap
- install-windows.ps1 phase 8: Bootstrap check, variable swap, .env patch,
  background upgrade launch via Start-Process

Before: Tier 3 install blocked for 30+ min downloading 18GB model.
After: Chat available in ~2 min. Full model downloads invisibly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat(macos): add bootstrap fast-start for instant chat during install

Same pattern as Windows PR — ports Linux bootstrap to macOS installer.
Tier 1+ installs download the tiny 2B model first, then the full model
downloads in the background via nohup + bootstrap-upgrade.sh.

Uses macOS sed -i '' syntax for .env patching (BSD sed, not GNU).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightheartdevs and others added 3 commits March 21, 2026 00:57
…Labs#521)

Start-Process with -ArgumentList "-c", $bashArgs passes them as
separate arguments. bash -c needs the command as a single string.
Changed to single-quoted args inside the -c string so bash receives
all 6 arguments correctly.

Tested: bootstrap-upgrade.sh starts and receives install_dir, gguf_file,
gguf_url, sha256, llm_model, max_context — begins downloading.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng (Light-Heart-Labs#522)

PowerShell Start-Process cannot reliably pass empty arguments (like
SHA256 for NV_ULTRA/SH_LARGE tiers) through the Windows command line
to bash. Empty strings get collapsed during command-line parsing,
shifting all subsequent arguments.

Fix: write a temp wrapper script (logs/bootstrap-run.sh) with the
arguments embedded as bash double-quoted strings. Empty arguments
become "" which bash preserves correctly. No command-line quoting
involved — Start-Process just runs the wrapper script.

Tested with empty SHA256: all 6 arguments arrive in the correct
positions. Script starts downloading successfully.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update all version references across Linux, Windows, and macOS
installers. Also syncs macOS DS_VERSION from 2.0.0-strix-halo.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightheartdevs and others added 4 commits March 21, 2026 17:10
…y for Arch/Void/Alpine (Light-Heart-Labs#551)

The installer unconditionally used get.docker.com for Docker installation,
which only supports Debian/Ubuntu/Fedora/RHEL/SLES. This broke installation
on Arch Linux and other distros using pacman, xbps, or apk. Similarly, the
service registry's Python manifest parser requires PyYAML, which is
pre-installed on Debian/Fedora but not on Arch/Void/Alpine.

Docker install (fixes Light-Heart-Labs#546):
- Add case dispatch on $PKG_MANAGER in 05-docker.sh
- apt/dnf/zypper: unchanged get.docker.com path (zero regression)
- pacman: pkg_install docker + systemctl enable
- xbps: pkg_install docker + runit service link
- apk: pkg_install docker + OpenRC enable/start
- Unknown: get.docker.com fallback with improved error mentioning --skip-docker

PyYAML dependency (fixes Light-Heart-Labs#545):
- Add python3-pyyaml canonical name to pkg_resolve() for all 6 package managers
- sr_load() now checks for `import yaml` before running the manifest parser
- Auto-installs the distro-appropriate PyYAML package when packaging functions
  are available (installer context) with declare -f guards for safety in
  dream-cli context
- Fix silent heredoc failure: wrap Python heredoc in `if !` to capture exit
  code instead of silently continuing with empty SERVICE_* arrays
- Use _SR_FAILED flag for failure signaling (return 0 always) so dream-cli
  doesn't crash under set -e while installer can detect and retry

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ht-Heart-Labs#552)

* fix(installer): Windows port 8080 conflict + ComfyUI tier-aware gating

Two issues reported by beta tester on Windows 11 + WSL2 (7GB RAM,
NVIDIA 940MX 2GB VRAM, Tier 0):

1. Port 8080 conflict: Windows env generator hardcoded OLLAMA_PORT=8080,
   but wslrelay occupies port 8080 on every WSL2 system. Changed default
   to 11434, matching the Linux default in .env.example. The Docker
   internal port stays 8080 — only the host-facing port changes.

2. ComfyUI crashes low-RAM systems: Full Stack silently enabled ComfyUI,
   which requests shm_size 8GB + memory limit 24GB. On a 7GB system,
   Docker can't allocate shared memory, causing a network bridge failure
   that kills the entire compose-up. Added tier-aware auto-disable for
   Tier 0 and Tier 1 in Full Stack mode, with a warning and re-prompt
   in Custom mode.

Changes:
- env-generator.ps1: OLLAMA_PORT 8080 → 11434
- ui.sh: Full Stack auto-disables ComfyUI on Tier 0/1 with user message
- install-core.sh: add --no-comfyui / --comfyui flags + usage docs
- 03-features.sh: Custom mode warns Tier 0/1 users and flips default to N

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(windows): add ComfyUI tier-aware gating to Windows installer

PR Light-Heart-Labs#552 added ComfyUI tier gating for the Linux bash installer but the
Windows PowerShell installer has its own parallel code paths that were
untouched. ComfyUI was always included in the compose stack on Windows
because the service skip switch had no "comfyui" case.

This was the actual root cause of the beta tester's fatal crash on
Windows 11 + WSL2 (7GB RAM, Tier 0) — ComfyUI's shm_size: 8g
exceeded available memory, crashing Docker's network bridge creation.

Changes:
- install.ps1: add -Comfyui and -NoComfyui switch parameters
- install-windows.ps1: add params, context vars, and "comfyui" case
  to the service skip switch (the critical fix)
- 03-features.ps1: add $enableComfyui variable, Full Stack auto-disables
  on Tier 0/1 with user message, Custom mode adds ComfyUI prompt with
  tier warning, Core Only disables ComfyUI

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: strip spurious UTF-8 BOMs from PowerShell files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(installer): add ComfyUI tier safety net for non-interactive mode

The tier-aware ComfyUI gating only ran inside the interactive menu block.
Non-interactive installs (--non-interactive / -NonInteractive) skipped
the menu entirely, leaving ENABLE_COMFYUI=true on Tier 0/1 systems
where ComfyUI's shm_size 8GB exceeds available RAM.

Add a safety net after the interactive block on both Linux and Windows
that unconditionally disables ComfyUI on Tier 0/1. In interactive mode
this is a no-op (menu already handled it). In non-interactive mode this
prevents the crash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(installer): scope ComfyUI safety net to non-interactive mode only

The safety net unconditionally overrode ENABLE_COMFYUI on Tier 0/1,
which would silently undo an explicit user confirmation in Custom mode
(user says Y to ComfyUI, Y to the tier warning, then safety net
disables it anyway).

Guard with ! $INTERACTIVE (Linux) / $nonInteractive (Windows) so it
only fires in headless mode where the user was never prompted.
Interactive mode already has its own tier checks in the menu.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(windows): remove incorrect nonInteractive guard from Custom mode warning

The previous commit accidentally applied the $nonInteractive guard to
the Custom mode tier warning prompt inside the interactive menu block.
Since the menu block itself is gated by -not $nonInteractive, the
condition was always false and the warning never fired.

The guard should only be on the safety net after the menu block.
The Custom mode warning is interactive by definition.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…art-Labs#553)

On NVIDIA systems, embeddings (TEI) and Open WebUI report unhealthy
during first install because model download and startup exceed the
original grace periods. Bump start_period and reduce excessive
installer polling to eliminate false-positive health failures.

- embeddings: start_period 60s→120s, retries 3→5 (first-run model download)
- Open WebUI: start_period 30s→60s (depends on llama-server VRAM loading)
- Phase 12: Open WebUI max_attempts 60→45 (90s budget vs ~70s needed)

Tested on NVIDIA (192.168.0.143) and AMD (192.168.0.213). Config-only
changes — fast systems unaffected (healthy on first /health 200).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Includes fixes since v2.3.2:
- Light-Heart-Labs#551: Arch/Void/Alpine distro-native Docker install + PyYAML dependency
- Light-Heart-Labs#552: Windows port 8080 conflict + ComfyUI tier gating
- Light-Heart-Labs#553: Healthcheck timing tuning for first-run startups

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightheartdevs and others added 2 commits March 22, 2026 11:49
…rt check (Light-Heart-Labs#562)

On systems where Docker cannot allocate NVIDIA GPU devices (exit 125
from nvidia-smi smoke test), the installer now falls back to the
existing docker-compose.cpu.yml overlay instead of crashing at
docker compose up with exit code 1.

Changes:
- Phase 05: track GPU passthrough result in $script:gpuPassthroughFailed
- Phase 08: use docker-compose.cpu.yml when flag is set, skip extension
  NVIDIA overlays (whisper, comfyui) that also require GPU reservation
- Phase 04: fix port conflict check from 8080 → 11434 to match
  OLLAMA_PORT default (eliminates false wslrelay warning on WSL2)

Reported by tester on GTX 1050 Ti / Windows 11 / WSL2 where GPU
passthrough consistently fails. Systems with working GPU passthrough
are completely unaffected.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lightheartdevs
Lightheartdevs previously approved these changes Mar 22, 2026
Copy link
Copy Markdown
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed: changes scoped to resources/dev/extensions-library only, no installer interaction. LGTM.

@Lightheartdevs
Copy link
Copy Markdown
Collaborator

Approved but has merge conflicts from other PRs that just landed. Please rebase against main and we'll merge. Thanks for the solid work on these! 🙏

Lightheartdevs and others added 5 commits March 22, 2026 20:25
…ight-Heart-Labs#571)

Docker Compose requires an image on every service definition even when
replicas is 0. The Windows AMD overlay disables llama-server (runs
natively via Vulkan) but the base compose stub has no image, causing
compose validation to fail with "has neither an image nor a build
context specified."

Add hello-world:latest as a placeholder — the container never runs.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bs#572)

* feat: upgrade tier models to Qwen 3.5 and GPT-OSS-20B

Upgrade default models across all tiers and platforms:
- Tier 1 (Entry) + Tier 2 (Prosumer): qwen3-8b → qwen3.5-9b (5.68GB)
- Tier 3 (Pro): qwen3-14b → gpt-oss-20b (11.6GB)
- Intel ARC: qwen3-8b → qwen3.5-9b
- Intel ARC_LITE: qwen3-4b → qwen3.5-4b
- macOS Tier 1: qwen3-4b → qwen3.5-9b (was overly conservative)

Updated across all three platforms (Linux, Windows, macOS):
- Tier map configs (resolve + tier_to_model functions)
- Compose file GGUF defaults (base, cpu, arc, intel)
- CLI fallback defaults (dream.ps1, dream-macos.sh)
- Agent templates (5 templates + README)
- Repair scripts and installer summary phase
- Disk size estimation patterns
- All test assertions

All download URLs verified (HTTP 200). SHA256 hashes sourced from
Hugging Face. No stale model references remain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: update model references in README, SECURITY, and docs to match tier upgrade

Follow-up to tier model upgrade — update documentation references:
- README.md: Apple Silicon tier table (qwen3-4b/8b → qwen3.5-9b, gpt-oss-20b)
- SECURITY.md: model verification examples (Qwen3-8B → Qwen3.5-9B)
- KNOWN-GOOD-VERSIONS.md: macOS known-good model (Qwen3-4B → Qwen3.5-9B)
- llama-server README: GGUF_FILE default (Qwen3-8B → Qwen3.5-9B)
- .claude/commands/tdd.md: example tier config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: swap Tier 3 model from GPT-OSS-20B to Qwen3.5-27B

GPT-OSS-20B uses special tokens (<|start|>, <|channel|>, <|constrain|>)
for structured output that are incompatible with llama.cpp's JSON
grammar mode. This causes Perplexica (which uses generateObject) to
fail with HTTP 500 on every query. Pure chat inference worked fine
but structured output / tool calling was broken.

Qwen3.5-27B (16.7GB Q4_K_M) is the same model family as Tier 1-2
(Qwen 3.5), proven compatible with llama.cpp structured output,
and fits in 20-39GB VRAM tier.

Updated across all platforms, tests, agent templates, docs, and
disk estimation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: correct GGUF_FILE case in Tier 3 test assertion

Qwen3.5-27B-Q4_K_M.gguf not qwen3.5-27b-Q4_K_M.gguf — sed
lowercased it during the bulk replace.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Heart-Labs#574)

Follow-up to Light-Heart-Labs#573 — docs still referenced old Qwen3 8B/4B/14B models.
Updated to match current tier map:
- T1/T2/ARC: Qwen3.5 9B
- T3: Qwen3.5 27B
- ARC_LITE: Qwen3.5 4B

Files: root README, FAQ, INTEL-ARC-GUIDE, MACOS-QUICKSTART, SUPPORT-MATRIX

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…depends_on

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yasinBursali yasinBursali force-pushed the ext/fix-paperless-depends-on branch from 2512ed6 to 5b62a43 Compare March 24, 2026 10:36
@yasinBursali yasinBursali changed the base branch from resources/dev to main March 24, 2026 10:38
@yasinBursali yasinBursali dismissed Lightheartdevs’s stale review March 24, 2026 10:38

The base branch was changed.

@yasinBursali yasinBursali changed the base branch from main to resources/dev March 24, 2026 10:42
@yasinBursali
Copy link
Copy Markdown
Contributor Author

Closing: the depends_on healthy condition already exists on the resources/dev branch (sidecars renamed to paperless-postgres/paperless-redis with condition: service_healthy). This PR is redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants