Skip to content

Commit f6120bb

Browse files
fix(cli): add --json flag to per-sandbox status command (#4323)
## Summary `nemoclaw <name> status` now accepts `--json` and emits a structured per-sandbox report, mirroring what the global `nemoclaw status --json` (added in #2790 / #2822) already exposes. Automation can read `hostGpuDetected`, `sandboxGpuEnabled`, `sandboxGpuMode`, `sandboxGpuDevice`, `openshellDriver`, and `openshellVersion` for a specific sandbox without scraping the text renderer. ## Related Issue Fixes #4310. Enhances #2790 (resolved by #2822) by extending the JSON renderer to the per-sandbox status variant. ## Changes - `src/commands/sandbox/status.ts` opts into `static enableJsonFlag = true`, updates the usage/examples to advertise `--json`, and branches on `this.jsonEnabled()` so the existing text path is unchanged. - New `getSandboxStatusReport(sandboxName)` in `src/lib/actions/sandbox/status.ts` returns a `SandboxStatusReport`: `schemaVersion`, `name`, `found`, `model`, `provider`, `phase`, `gatewayState`, `inferenceHealth`, `hostGpuDetected`, `sandboxGpuEnabled`, `sandboxGpuMode`, `sandboxGpuDevice`, `openshellDriver`, `openshellVersion`, `policies`. `openshellDriver` and `openshellVersion` are normalised to the string `"unknown"` when missing so consumers can rely on `typeof` checks. - The JSON path sets `process.exitCode = 1` when the sandbox is missing locally or the gateway state is not `present`, so scripts can distinguish "ready" from "drift/unreachable" without parsing the body. - `test/cli.test.ts` covers the populated JSON shape, the `"unknown"` string fallback for `openshellDriver` / `openshellVersion`, the protobuf-mismatch path (asserts exit code `1`, `rpcIssue = { kind: "protobuf_mismatch" }`, `inferenceHealth = null`, and `model` / `provider` = `"unknown"`), and the `--help` advertising `--json`. - `collectSandboxStatusSnapshot` is shared by `showSandboxStatus` and `getSandboxStatusReport`; it fail-closes on `detectOpenShellStateRpcResultIssue` so the JSON path no longer emits inferred `model` / `provider` / `inferenceHealth` alongside an `rpcIssue`. The host-side gateway-chain subprobe is appended in the collector so both text and JSON paths surface it. - `docs/reference/commands.mdx` documents the `--json` flag in the per-sandbox status section (fields, fallback, exit-code contract) so the CLI/docs parity check stays green. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [X] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [X] `npx prek run --all-files` passes - [X] `npm test` passes - [X] Tests added or updated for new or changed behavior - [X] No secrets, API keys, or credentials committed - [X] Docs updated for user-facing behavior changes - [ ] `npm run docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * `sandbox status` gains a `--json` mode that emits a structured per-sandbox JSON report (gateway state, inference health, GPU details, policies, model/provider, OpenShell driver/version). * **Behavior** * `openshellDriver`/`openshellVersion` default to `"unknown"` when unset. * `--json` returns structured data and sets non‑zero exit codes for missing sandbox, non‑present gateway, or RPC/schema issues; text output unchanged without `--json`. * **Tests** * Added coverage for JSON output, defaults, RPC issue handling, exit codes, and help text. * **Documentation** * Command docs updated with `--json` behavior, fields, exit conditions, and examples. <!-- review_stack_entry_start --> [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4323?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
1 parent eae6f8a commit f6120bb

4 files changed

Lines changed: 431 additions & 65 deletions

File tree

docs/reference/commands.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -391,6 +391,18 @@ $ nemoclaw my-assistant recover
391391

392392
Show sandbox status, health, and inference configuration.
393393

394+
Pass `--json` to emit a structured per-sandbox report instead of the text renderer.
395+
The JSON output includes at least `schemaVersion`, `name`, `found`, `model`, `provider`, `phase`, `gatewayState`, `inferenceHealth`, `rpcIssue`, `hostGpuDetected`, `sandboxGpuEnabled`, `sandboxGpuMode`, `sandboxGpuDevice`, `openshellDriver`, `openshellVersion`, and `policies`.
396+
`openshellDriver` and `openshellVersion` are always strings (falling back to `"unknown"` when the registry has no value), so consumers can rely on `typeof` checks.
397+
The command exits non-zero when the sandbox is missing locally, the gateway state is not `present`, or the gateway reports a schema/protobuf mismatch (mirrored as `rpcIssue`).
398+
The alias form `nemoclaw <name> status --json` requires the sandbox to be registered locally; the canonical form `nemoclaw sandbox status <name> --json` is the one to use from automation that may run against an unknown sandbox name, since it still emits a JSON document with `found: false` instead of a text error.
399+
400+
```console
401+
$ nemoclaw my-assistant status
402+
$ nemoclaw my-assistant status --json
403+
$ nemoclaw sandbox status my-assistant --json
404+
```
405+
394406
The command probes every inference provider and reports one of three states on the `Inference` line:
395407

396408
| State | Meaning |

src/commands/sandbox/status.ts

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,24 +3,40 @@
33

44
import { NemoClawCommand } from "../../lib/cli/nemoclaw-oclif-command";
55

6-
import { showSandboxStatus } from "../../lib/actions/sandbox/status";
6+
import { getSandboxStatusReport, showSandboxStatus } from "../../lib/actions/sandbox/status";
77
import { sandboxNameArg } from "../../lib/sandbox/command-support";
8+
import { redactForLog } from "../../lib/security/redact";
89

910
export default class SandboxStatusCommand extends NemoClawCommand {
1011
static id = "sandbox:status";
1112
static strict = true;
13+
static enableJsonFlag = true;
1214
static summary = "Sandbox health and NIM status";
1315
static description = "Show sandbox health, OpenShell gateway state, and local NIM status.";
14-
static usage = ["<name>"];
15-
static examples = ["<%= config.bin %> sandbox status alpha"];
16+
static usage = ["<name> [--json]"];
17+
static examples = [
18+
"<%= config.bin %> sandbox status alpha",
19+
"<%= config.bin %> sandbox status alpha --json",
20+
];
1621
static args = {
1722
sandboxName: sandboxNameArg,
1823
};
1924
static flags = {
2025
};
2126

22-
public async run(): Promise<void> {
27+
public async run(): Promise<unknown> {
2328
const { args } = await this.parse(SandboxStatusCommand);
29+
if (this.jsonEnabled()) {
30+
const report = await getSandboxStatusReport(args.sandboxName);
31+
if (!report.found || report.gatewayState !== "present" || report.rpcIssue) {
32+
process.exitCode = 1;
33+
}
34+
// #4310: route the machine-readable report through the centralized
35+
// redactForLog source of truth so health diagnostics (inferenceHealth
36+
// endpoint/detail/subprobes) cannot leak token-shaped values into
37+
// automation that persists CLI JSON output.
38+
return redactForLog(report);
39+
}
2440
await showSandboxStatus(args.sandboxName);
2541
}
2642
}

src/lib/actions/sandbox/status.ts

Lines changed: 136 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,134 @@ export function getSandboxStatusInferenceHealth(
6363
});
6464
}
6565

66+
export interface SandboxStatusReport {
67+
schemaVersion: 1;
68+
name: string;
69+
found: boolean;
70+
model: string;
71+
provider: string;
72+
phase: string | null;
73+
gatewayState: string;
74+
inferenceHealth: ProviderHealthStatus | null;
75+
rpcIssue: { kind: "image_drift" | "protobuf_mismatch" } | null;
76+
hostGpuDetected: boolean;
77+
sandboxGpuEnabled: boolean;
78+
sandboxGpuMode: string | null;
79+
sandboxGpuDevice: string | null;
80+
openshellDriver: string;
81+
openshellVersion: string;
82+
policies: string[];
83+
}
84+
85+
interface SandboxStatusSnapshot {
86+
sb: registry.SandboxEntry | null;
87+
lookup: SandboxGatewayState;
88+
rpcIssue: ReturnType<typeof detectOpenShellStateRpcResultIssue>;
89+
currentModel: string;
90+
currentProvider: string;
91+
inferenceHealth: ProviderHealthStatus | null;
92+
}
93+
94+
async function collectSandboxStatusSnapshot(
95+
sandboxName: string,
96+
): Promise<SandboxStatusSnapshot> {
97+
const sb = registry.getSandbox(sandboxName);
98+
let lookup: SandboxGatewayState;
99+
try {
100+
lookup = await getReconciledSandboxGatewayState(sandboxName, {
101+
getState: getSandboxGatewayStateForStatus,
102+
});
103+
} catch (err) {
104+
const message = err instanceof Error ? err.message : String(err);
105+
lookup = {
106+
state: "gateway_error",
107+
output: ` Could not probe live gateway state: ${message}`,
108+
};
109+
}
110+
let liveResult: Awaited<ReturnType<typeof captureOpenshellForStatus>> | null = null;
111+
if (lookup.state === "present") {
112+
try {
113+
liveResult = await captureOpenshellForStatus(["inference", "get"]);
114+
} catch {
115+
liveResult = null;
116+
}
117+
}
118+
const rpcIssue = liveResult ? detectOpenShellStateRpcResultIssue(liveResult) : null;
119+
if (rpcIssue) {
120+
return {
121+
sb,
122+
lookup,
123+
rpcIssue,
124+
currentModel: "unknown",
125+
currentProvider: "unknown",
126+
inferenceHealth: null,
127+
};
128+
}
129+
const live =
130+
liveResult && !isCommandTimeout(liveResult) ? parseGatewayInference(liveResult.output) : null;
131+
const currentModel = (live && live.model) || (sb && sb.model) || "unknown";
132+
const currentProvider = (live && live.provider) || (sb && sb.provider) || "unknown";
133+
const inferenceHealth = getSandboxStatusInferenceHealth(
134+
lookup.state === "present",
135+
currentProvider,
136+
currentModel,
137+
);
138+
if (
139+
inferenceHealth &&
140+
lookup.state === "present" &&
141+
(currentProvider === "ollama-local" || currentProvider === "vllm-local")
142+
) {
143+
const gatewayChain = await probeSandboxInferenceGatewayHealth(sandboxName);
144+
if (gatewayChain) {
145+
const gatewaySubprobe: ProviderHealthStatus = {
146+
ok: gatewayChain.ok,
147+
probed: true,
148+
providerLabel: "Inference gateway chain",
149+
endpoint: gatewayChain.endpoint,
150+
detail: gatewayChain.detail,
151+
probeLabel: "gateway",
152+
...(gatewayChain.ok ? {} : { failureLabel: "unreachable" as const }),
153+
};
154+
inferenceHealth.subprobes = [...(inferenceHealth.subprobes ?? []), gatewaySubprobe];
155+
}
156+
}
157+
return { sb, lookup, rpcIssue, currentModel, currentProvider, inferenceHealth };
158+
}
159+
160+
export async function getSandboxStatusReport(
161+
sandboxName: string,
162+
): Promise<SandboxStatusReport> {
163+
const snapshot = await collectSandboxStatusSnapshot(sandboxName);
164+
const { sb, lookup, rpcIssue, currentModel, currentProvider, inferenceHealth } = snapshot;
165+
const phase =
166+
lookup.state === "present" ? parseSandboxPhase(lookup.output || "") : null;
167+
const sandboxGpuEnabled = sb
168+
? (sb.sandboxGpuEnabled ?? (sb.gpuEnabled === true))
169+
: false;
170+
const policies =
171+
sb && Array.isArray(sb.policies)
172+
? sb.policies.filter((policy): policy is string => typeof policy === "string")
173+
: [];
174+
return {
175+
schemaVersion: 1,
176+
name: sandboxName,
177+
found: !!sb,
178+
model: currentModel,
179+
provider: currentProvider,
180+
phase,
181+
gatewayState: lookup.state,
182+
inferenceHealth,
183+
rpcIssue: rpcIssue ? { kind: rpcIssue.kind } : null,
184+
hostGpuDetected: !!(sb && sb.hostGpuDetected),
185+
sandboxGpuEnabled,
186+
sandboxGpuMode: (sb && sb.sandboxGpuMode) || null,
187+
sandboxGpuDevice: (sb && sb.sandboxGpuDevice) || null,
188+
openshellDriver: (sb && sb.openshellDriver) || "unknown",
189+
openshellVersion: (sb && sb.openshellVersion) || "unknown",
190+
policies,
191+
};
192+
}
193+
66194
/**
67195
* Render one Inference status line. The main probe and each subprobe go
68196
* through this helper so multi-hop providers (e.g. ollama-local backend +
@@ -113,73 +241,20 @@ async function printGatewayFailureLayerHeader(sandboxName: string): Promise<void
113241

114242
// eslint-disable-next-line complexity
115243
export async function showSandboxStatus(sandboxName: string): Promise<void> {
116-
const sb = registry.getSandbox(sandboxName);
117-
maybeEnsureHermesToolGatewayBroker(sb);
118244
// #2666: never let an unexpected throw from the gateway probe (e.g. openshell
119245
// hanging when its container is stopped and the published port is held by a
120246
// foreign listener) suppress the sandbox header. The downstream switch
121247
// handles `gateway_error` by printing an actionable block + exit(1), so a
122248
// synthesized fallback keeps the user-visible contract intact.
123-
let lookup: SandboxGatewayState;
124-
try {
125-
lookup = await getReconciledSandboxGatewayState(sandboxName, {
126-
getState: getSandboxGatewayStateForStatus,
249+
const snapshot = await collectSandboxStatusSnapshot(sandboxName);
250+
const { sb, lookup, rpcIssue, currentModel, currentProvider, inferenceHealth } = snapshot;
251+
maybeEnsureHermesToolGatewayBroker(sb);
252+
if (rpcIssue) {
253+
printOpenShellStateRpcIssue(rpcIssue, {
254+
action: `checking inference status for sandbox '${sandboxName}'`,
255+
command: `${CLI_NAME} ${sandboxName} status`,
127256
});
128-
} catch (err) {
129-
const message = err instanceof Error ? err.message : String(err);
130-
lookup = {
131-
state: "gateway_error",
132-
output: ` Could not probe live gateway state: ${message}`,
133-
};
134-
}
135-
let liveResult: Awaited<ReturnType<typeof captureOpenshellForStatus>> | null = null;
136-
if (lookup.state === "present") {
137-
try {
138-
liveResult = await captureOpenshellForStatus(["inference", "get"]);
139-
} catch {
140-
liveResult = null;
141-
}
142-
}
143-
if (liveResult) {
144-
const inferenceIssue = detectOpenShellStateRpcResultIssue(liveResult);
145-
if (inferenceIssue) {
146-
printOpenShellStateRpcIssue(inferenceIssue, {
147-
action: `checking inference status for sandbox '${sandboxName}'`,
148-
command: `${CLI_NAME} ${sandboxName} status`,
149-
});
150-
process.exit(1);
151-
}
152-
}
153-
const live =
154-
liveResult && !isCommandTimeout(liveResult) ? parseGatewayInference(liveResult.output) : null;
155-
const currentModel = (live && live.model) || (sb && sb.model) || "unknown";
156-
const currentProvider = (live && live.provider) || (sb && sb.provider) || "unknown";
157-
const inferenceHealth = getSandboxStatusInferenceHealth(
158-
lookup.state === "present",
159-
currentProvider,
160-
currentModel,
161-
);
162-
// #3265 optional 3rd line: probe the full inference chain (openclaw gateway
163-
// → auth proxy → backend) from inside the sandbox so a broken hop the
164-
// host-side probes can't see still surfaces in `status`.
165-
if (
166-
inferenceHealth &&
167-
lookup.state === "present" &&
168-
(currentProvider === "ollama-local" || currentProvider === "vllm-local")
169-
) {
170-
const gatewayChain = await probeSandboxInferenceGatewayHealth(sandboxName);
171-
if (gatewayChain) {
172-
const gatewaySubprobe: ProviderHealthStatus = {
173-
ok: gatewayChain.ok,
174-
probed: true,
175-
providerLabel: "Inference gateway chain",
176-
endpoint: gatewayChain.endpoint,
177-
detail: gatewayChain.detail,
178-
probeLabel: "gateway",
179-
...(gatewayChain.ok ? {} : { failureLabel: "unreachable" as const }),
180-
};
181-
inferenceHealth.subprobes = [...(inferenceHealth.subprobes ?? []), gatewaySubprobe];
182-
}
257+
process.exit(1);
183258
}
184259
if (sb) {
185260
console.log("");

0 commit comments

Comments
 (0)