You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is now being tracked as an open performance-planning discussion rather than a release-blocking bug. The original broad release-blocker tracker is closed; ongoing optimization and observability work should continue here and in the narrower follow-up issues.
Landed remediation already split out of the tracker:
Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.
ARM64 gateway health polls: 5 polls × 2s → 30 polls × 10s (5 minutes) because k3s init takes longer
WSL2 sandbox ready wait: 30 × 2s = 60s polling loop because pod init is slower under Docker Desktop
Ollama model pull: 10-minute timeout
Gateway start: exponential backoff with p-retry (10s min, factor 3)
30+ explicit sleep() calls scattered across onboard, recovery, and validation paths
As @brandonpelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking seems to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.
Scope
This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand why networking is slow and fix root causes rather than continuing to widen timeouts.
Phase 1: Diagnose
Profile the onboard network path end-to-end on native Linux, macOS, and WSL2: where is time actually spent?
Measure DNS resolution latency inside the k3s container vs. on the host — is CoreDNS adding overhead?
Measure TLS handshake time for inference provider endpoints from inside the sandbox vs. from the host
Determine whether the L7 proxy (gateway) adds measurable latency to inference validation probes
Check if host.openshell.internal resolution is slow on WSL2 (goes through Windows DNS?)
Profile the sleep() calls — which are covering real async settling vs. papering over race conditions?
Phase 2: Optimize
Based on diagnosis, potential fixes (non-exhaustive):
DNS caching/preflight: Pre-resolve and cache provider DNS during onboard before validation probes hit the gateway
Connection reuse: Validation probes currently spawn a new curl per attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakes
Parallel health checks: Some sequential polling loops could overlap (e.g., gateway health + sandbox ready + dashboard ready)
Reduce gateway round-trips: Validation probes go host → gateway → provider. If the gateway adds overhead, consider a direct probe option for validation only
Replace sleeps with event-driven waits: Many sleep(2) calls are waiting for a process or pod state — replace with kubectl wait, readiness probes, or file watches where possible
Platform-aware defaults: Instead of doubling timeouts for WSL2 as a special case, consider adaptive timeouts that measure the first probe latency and scale subsequent timeouts accordingly
Phase 3: Harden
Add onboard timing telemetry (opt-in) so we can see real-world latency distributions
Set a performance budget: onboard on a warm system with cached images should complete in < N seconds
Add CI timing regression tests that fail if onboard wall-clock exceeds the budget
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Converted from issue #2001: #2001
This is now being tracked as an open performance-planning discussion rather than a release-blocking bug. The original broad release-blocker tracker is closed; ongoing optimization and observability work should continue here and in the narrower follow-up issues.
Landed remediation already split out of the tracker:
Still-open follow-ups from the tracker:
Problem
Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.
Examples:
sleep()calls scattered across onboard, recovery, and validation pathsAs @brandonpelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking seems to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.
Scope
This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand why networking is slow and fix root causes rather than continuing to widen timeouts.
Phase 1: Diagnose
host.openshell.internalresolution is slow on WSL2 (goes through Windows DNS?)sleep()calls — which are covering real async settling vs. papering over race conditions?Phase 2: Optimize
Based on diagnosis, potential fixes (non-exhaustive):
curlper attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakessleep(2)calls are waiting for a process or pod state — replace withkubectl wait, readiness probes, or file watches where possiblePhase 3: Harden
Current timeout inventory
onboard.ts)http-probe.ts)local-inference.ts)nemoclaw.ts)References
Beta Was this translation helpful? Give feedback.
All reactions