Skip to content

Commit a5e7e63

Browse files
fix(connect): recover docker-driver inference route without the cluster DNS repair (#3403) (#4551)
## Summary The docker-driver gateway now recovers a broken inference.local route through `openshell inference set` on connect, instead of running the k3s-only CoreDNS cluster repair that can never find its container under the docker driver. Reported on a Docker-driver host in #3403. ## Related Issue Closes #3403 ## Problem `shouldUseLegacyDnsProxyRepair` in `src/lib/actions/sandbox/connect.ts` returned true for every driver except `"vm"`, so a `"docker"` sandbox took the legacy repair path. That path (`runSetupDnsProxy`) patches CoreDNS inside an `openshell-cluster-<name>` container, which only the k3s/kubernetes gateway runs. The docker driver runs the gateway as `nemoclaw-openshell-gateway` with host networking and has no such container, so `runSetupDnsProxy` aborted with `WARNING: Could not find gateway container for '<name>'. DNS proxy not installed.` and inference.local stayed unreachable. After `nemoclaw <name> connect`, `openclaw tui` then failed with `LLM request failed: network connection error`. The contract elsewhere already excludes docker from this step: `usesGatewayMetadataProbe` (`snapshot.ts`) treats `"docker"` and `"vm"` as cluster-less drivers, and the snapshot DNS-proxy step is guarded by `openshellDriver !== "docker"`. ## Changes - Excluded `"docker"` from `shouldUseLegacyDnsProxyRepair` so the docker driver takes the non-legacy branch, which recovers the route via `openshell inference set` (the same `reapplyVmInferenceRoute` step the vm driver uses) and reports an accurate `inference.local is unavailable ... Reapplying OpenShell inference route` message instead of the misleading cluster-container warning. - Left the kubernetes driver on the legacy CoreDNS repair path, where the `openshell-cluster-<name>` container exists. - Added a docker-driver case in `test/sandbox-connect-inference.test.ts`: a broken inference.local probe now triggers the `inference set` reapply (not the legacy cluster repair), asserts no `get service kube-dns` call, and checks for the `Reapplying OpenShell inference route` / `inference.local route repaired` output. - Repointed the two existing tests that exercise the CoreDNS cluster repair and the managed-route reset from `openshellDriver: "docker"` to `"kubernetes"`, since the `openshell-cluster-<name>` container only exists for the k3s driver. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the style guide (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Ran: full `test/sandbox-connect-inference.test.ts` suite passes (18/18); `npm run typecheck:cli` and `npm run build:cli` clean. --- Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Refined DNS proxy repair logic to ensure docker sandboxes use the correct route recovery mechanism instead of legacy cluster DNS repair. * **Tests** * Updated DNS proxy repair tests to reflect Kubernetes-specific behavior. * Added regression test for docker sandbox route recovery verification. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4551?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
1 parent 2d0a78b commit a5e7e63

2 files changed

Lines changed: 65 additions & 4 deletions

File tree

src/lib/actions/sandbox/connect.ts

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,15 @@ function probeSandboxInferenceRoute(
263263
}
264264

265265
function shouldUseLegacyDnsProxyRepair(sb: SandboxEntry | null): boolean {
266-
return sb?.openshellDriver !== "vm";
266+
// The legacy repair patches CoreDNS inside an `openshell-cluster-<name>`
267+
// container, which only the k3s/kubernetes gateway runs. The docker driver
268+
// runs the gateway as `nemoclaw-openshell-gateway` with host networking, and
269+
// the vm driver has no cluster container either, so both recover the route via
270+
// `openshell inference set` instead of the cluster CoreDNS patch. Mirrors
271+
// usesGatewayMetadataProbe (snapshot.ts) and the `!== "docker"` guard on the
272+
// snapshot DNS-proxy step. (#3403)
273+
const driver = sb?.openshellDriver;
274+
return driver !== "vm" && driver !== "docker";
267275
}
268276

269277
function buildInferenceSetArgs(provider: string, model: string): string[] {

test/sandbox-connect-inference.test.ts

Lines changed: 56 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -479,7 +479,7 @@ describe("sandbox connect inference route swap (#1248)", () => {
479479
);
480480

481481
it(
482-
"repairs the sandbox DNS proxy when inference.local returns 503",
482+
"repairs the kubernetes sandbox DNS proxy when inference.local returns 503",
483483
testTimeoutOptions(20_000),
484484
() => {
485485
const { tmpDir, stateFile, sandboxName } = setupFixture(
@@ -488,7 +488,7 @@ describe("sandbox connect inference route swap (#1248)", () => {
488488
model: "nvidia/nemotron-3-super-120b-a12b",
489489
provider: "nvidia-prod",
490490
gpuEnabled: false,
491-
openshellDriver: "docker",
491+
openshellDriver: "kubernetes",
492492
policies: [],
493493
},
494494
"nvidia-prod",
@@ -531,6 +531,59 @@ describe("sandbox connect inference route swap (#1248)", () => {
531531
},
532532
);
533533

534+
it(
535+
"recovers the route via inference set for docker sandboxes without the legacy cluster repair (#3403)",
536+
testTimeoutOptions(20_000),
537+
() => {
538+
const { tmpDir, stateFile, sandboxName } = setupFixture(
539+
{
540+
name: "docker-route-sandbox",
541+
model: "nvidia/nemotron-3-super-120b-a12b",
542+
provider: "nvidia-prod",
543+
gpuEnabled: false,
544+
openshellDriver: "docker",
545+
policies: [],
546+
},
547+
"nvidia-prod",
548+
"nvidia/nemotron-3-super-120b-a12b",
549+
{
550+
inferenceProbeResponses: [
551+
'BROKEN 503 {"error":"inference service unavailable"}',
552+
"OK 200",
553+
],
554+
},
555+
);
556+
557+
const result = runConnect(tmpDir, sandboxName);
558+
expect(result.status).toBe(0);
559+
560+
const state = JSON.parse(fs.readFileSync(stateFile, "utf-8"));
561+
const dockerCalls = state.dockerCalls as string[][];
562+
// The docker driver has no openshell-cluster container (the gateway runs
563+
// as nemoclaw-openshell-gateway with host networking), so it must NOT take
564+
// the legacy CoreDNS cluster repair; it recovers via `inference set`. (#3403)
565+
expect(state.inferenceSetCalls).toEqual([
566+
[
567+
"--provider",
568+
"nvidia-prod",
569+
"--model",
570+
"nvidia/nemotron-3-super-120b-a12b",
571+
"--no-verify",
572+
],
573+
]);
574+
expect(
575+
dockerCalls.some((call) =>
576+
call.join(" ").includes("get service kube-dns"),
577+
),
578+
).toBe(false);
579+
580+
const combined = (result.stdout || "") + (result.stderr || "");
581+
expect(combined).toContain("Reapplying OpenShell inference route");
582+
expect(combined).toContain("inference.local route repaired");
583+
expect(combined).not.toContain("Could not find gateway container");
584+
},
585+
);
586+
534587
it(
535588
"does not run legacy DNS proxy repair for VM sandboxes",
536589
testTimeoutOptions(20_000),
@@ -933,7 +986,7 @@ describe("sandbox connect inference route swap (#1248)", () => {
933986
model: "nvidia/nemotron-3-super-120b-a12b",
934987
provider: "nvidia-prod",
935988
gpuEnabled: false,
936-
openshellDriver: "docker",
989+
openshellDriver: "kubernetes",
937990
policies: [],
938991
},
939992
"nvidia-prod",

0 commit comments

Comments
 (0)