Skip to content

bug(embeddings): start_period 120s insufficient for slow-connection first-install model download — raise to 600s #476

@yasinBursali

Description

@yasinBursali

Severity: Low
Category: Docker Config
Platform: All
Confidence: Confirmed

Description

The embeddings extension sets start_period: 120s in its compose healthcheck. On first install the TEI image downloads the embedding model (~420 MB for BAAI/bge-base-en-v1.5). On slow connections (<5 Mbps residential), the download exceeds 120s. After start_period expires, Docker counts healthcheck failures; retries: 5 elapses during the ongoing download; the container is marked unhealthy and may be restart-looped. Subsequent starts work normally (model cached).

Affected File(s)

  • dream-server/extensions/services/embeddings/compose.yaml:24 (start_period: 120s)

Root Cause

start_period: 120s is tuned for warm starts (cached model) rather than the cold first-install path where the model still needs to download. No distinction between first-run and subsequent runs.

Platform Analysis

  • macOS / Linux / Windows-WSL2: identical — compose healthcheck semantics platform-neutral.

Reproduction

Fresh install on a connection capable of <5 Mbps to huggingface.co. Enable embeddings. Observe container restart loop during the ~3–10 minute download window; service reaches healthy once download completes.

Impact

First-install UX on slow connections: intermittent failures, dashboard flips between "installing" and "unhealthy", operator confused about whether the install is progressing.

Suggested Approach

  • Raise start_period to 600s (10 minutes) to accommodate slow-connection model downloads. Healthcheck behavior post-warmup unchanged.
  • Alternative: pre-download the model in a setup hook so the container starts with it already cached. More invasive; probably overkill.
  • Alternative: use a two-phase healthcheck — a model-download progress probe during the first N seconds, then the regular endpoint probe. Compose-level complexity for a moderate gain.

Recommendation: raise start_period to 600s. One-line change, zero regression risk on fast connections.

Labels

bug, docker-config, embeddings, healthcheck, first-install, all-platforms

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions