perf: investigate and reduce networking latency during onboard and validation #4093

ericksoa · 2026-05-22T21:17:47Z

ericksoa
May 22, 2026
Maintainer

Converted from issue #2001: #2001

This is now being tracked as an open performance-planning discussion rather than a release-blocking bug. The original broad release-blocker tracker is closed; ongoing optimization and observability work should continue here and in the narrower follow-up issues.

Landed remediation already split out of the tracker:

refactor(cli): replace spawnSync("sleep") with native wait utility #2027 replaced subprocess sleep with a native wait utility.
refactor(cli): replace race-condition sleeps with readiness probes #2442 replaced several race-condition sleeps with readiness probes.
fix(onboard): retry inference validation on transient upstream failures #3104 added retries for transient inference validation failures.

Still-open follow-ups from the tracker:

perf(onboard): replace remaining fixed readiness polls with deadline-based waits #3768 deadline-based readiness waits
perf(onboard): operationalize profiling traces for latency diagnosis #3769 profiling traces
perf(onboard): add adaptive timeout calibration for provider validation #3770 adaptive timeout calibration
perf(onboard): reduce repeated DNS and TLS overhead in provider validation probes #3771 reduce DNS/TLS overhead
perf(onboard): parallelize safe onboard waits with cancellation handling #3775 parallelize safe onboard waits
perf(onboard): define onboard performance budget and CI regression signal #3776 performance budget and CI regression signal

Problem

Multiple PRs have independently worked around slow networking by bumping timeouts, adding retries, and inserting sleep delays. Each fix addresses a symptom, but the cumulative effect is a fragile onboard experience that breaks on slower hardware and adds minutes of wall-clock time on every platform.

Examples:

fix(onboard): widen inference verification timeouts on WSL2 (Fixes #987) #1998 / [Window x86 WSL2][Regression] NemoClaw/OpenShell inference verification reports false connectivity failure for NVIDIA Endpoints during onboarding #987: WSL2 inference validation timeouts doubled from 10s/15s → 20s/30s because DNS + TLS through the virtualized network stack is slow
ARM64 gateway health polls: 5 polls × 2s → 30 polls × 10s (5 minutes) because k3s init takes longer
WSL2 sandbox ready wait: 30 × 2s = 60s polling loop because pod init is slower under Docker Desktop
Ollama model pull: 10-minute timeout
Gateway start: exponential backoff with p-retry (10s min, factor 3)
30+ explicit sleep() calls scattered across onboard, recovery, and validation paths

As @brandonpelfrey noted in #1998: "Across multiple PRs from different folks I'm seeing things like add 10 seconds here and there so $thing doesn't time out. Want to make sure we're questioning why networking seems to be so fiddly." These timeout bumps are fragile — they work on the machines that were tested but will break on slower hardware.

Scope

This is a diagnostic and optimization effort, not a single bug fix. The goal is to understand why networking is slow and fix root causes rather than continuing to widen timeouts.

Phase 1: Diagnose

Profile the onboard network path end-to-end on native Linux, macOS, and WSL2: where is time actually spent?
Measure DNS resolution latency inside the k3s container vs. on the host — is CoreDNS adding overhead?
Measure TLS handshake time for inference provider endpoints from inside the sandbox vs. from the host
Determine whether the L7 proxy (gateway) adds measurable latency to inference validation probes
Check if host.openshell.internal resolution is slow on WSL2 (goes through Windows DNS?)
Profile the sleep() calls — which are covering real async settling vs. papering over race conditions?

Phase 2: Optimize

Based on diagnosis, potential fixes (non-exhaustive):

DNS caching/preflight: Pre-resolve and cache provider DNS during onboard before validation probes hit the gateway
Connection reuse: Validation probes currently spawn a new curl per attempt — a persistent connection (or at least keepalive) would skip repeated TCP+TLS handshakes
Parallel health checks: Some sequential polling loops could overlap (e.g., gateway health + sandbox ready + dashboard ready)
Reduce gateway round-trips: Validation probes go host → gateway → provider. If the gateway adds overhead, consider a direct probe option for validation only
Replace sleeps with event-driven waits: Many sleep(2) calls are waiting for a process or pod state — replace with kubectl wait, readiness probes, or file watches where possible
Platform-aware defaults: Instead of doubling timeouts for WSL2 as a special case, consider adaptive timeouts that measure the first probe latency and scale subsequent timeouts accordingly

Phase 3: Harden

Add onboard timing telemetry (opt-in) so we can see real-world latency distributions
Set a performance budget: onboard on a warm system with cached images should complete in < N seconds
Add CI timing regression tests that fail if onboard wall-clock exceeds the budget

Current timeout inventory

Location	Current value	Why
Validation probe (`onboard.ts`)	10s connect / 15s total (20/30 on WSL2)	Inference provider reachability
HTTP probe default (`http-probe.ts`)	10s connect / 60s total	Streaming inference responses
Local provider health (`local-inference.ts`)	3s connect / 5s total	Ollama/vLLM liveness
Gateway liveness (`nemoclaw.ts`)	3s max	Quick gateway up/down check
ARM64 health poll	30 × 10s = 300s	k3s slow init on ARM
WSL2 sandbox ready	30 × 2s = 60s	Pod init under Docker Desktop
Ollama model pull	600s (10 min)	Large model download
OpenShell install	300s (5 min)	Binary download + extract

References

fix(onboard): widen inference verification timeouts on WSL2 (Fixes #987) #1998 — WSL2 timeout widening (trigger for this investigation)
[Window x86 WSL2][Regression] NemoClaw/OpenShell inference verification reports false connectivity failure for NVIDIA Endpoints during onboarding #987 — Original WSL2 inference verification timeout issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: investigate and reduce networking latency during onboard and validation #4093

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

perf: investigate and reduce networking latency during onboard and validation #4093

Uh oh!

ericksoa May 22, 2026 Maintainer

Problem

Scope

Phase 1: Diagnose

Phase 2: Optimize

Phase 3: Harden

Current timeout inventory

References

Replies: 0 comments

ericksoa
May 22, 2026
Maintainer