fix(platform): wait for instance health before ready#529
fix(platform): wait for instance health before ready#529StanGirard wants to merge 15 commits intomainfrom
Conversation
…sion restart When a Codex session is relaunched, thread/resume can fail with "no rollout found" if the previous thread's rollout state is stale or missing. Previously this caused the entire init to fail, leaving the session stuck in "exited" state with no way to restart without manual intervention. Now the adapter catches non-transient resume errors and automatically falls back to thread/start, creating a fresh thread. Transient "Transport closed" errors are still retried via the existing backoff mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address Greptile review feedback: after falling back from thread/resume to thread/start, update options.threadId with the new thread ID so that subsequent resetForReconnect calls resume the correct (live) thread instead of repeatedly trying the stale one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@Greptile I pushed follow-up fixes on commit 52788ba for the latest PR head. CI is already running on this commit. Please re-review once the current run finishes. |
Greptile SummaryThis PR migrates instance provisioning from Fly.io to Hetzner Cloud and adds two unrelated improvements: a Key changes:
Issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as Browser/UI
participant Platform as Platform Routes
participant Prov as Provisioner
participant HC as HetznerCloudClient
participant H as Hetzner API
participant CI as Companion Instance
UI->>Platform: POST /instances/create-stream
Platform->>Prov: provision(input)
loop For each serverType x location candidate
Prov->>HC: createVolume(name, location, size)
HC->>H: POST /volumes
H-->>HC: volume id
Prov->>HC: createServer(serverType, location, user_data)
HC->>H: POST /servers
H-->>HC: server id + action id
Prov->>HC: waitForAction(actionId)
HC->>H: GET /actions/:id (poll)
H-->>HC: action status=success
Prov->>HC: waitForServerStatus(serverId, running)
HC->>H: GET /servers/:id (poll)
H-->>HC: server status=running
Prov->>CI: GET http://ipv4/health (poll every 5s, up to 4 min)
CI-->>Prov: HTTP 200 OK
Note over Prov: Health confirmed, break loop
end
Prov-->>Platform: providerMachineId, providerVolumeId, hostname
Platform->>Platform: INSERT into instances table
Platform-->>UI: SSE done event with instance data
UI->>Platform: GET /instances/:id/embed
Platform->>Platform: resolveAuthMode -> static_token
Platform-->>UI: 302 redirect to http://ipv4 with token param
UI->>CI: GET http://ipv4 with token param
|
There was a problem hiding this comment.
4 issues found across 27 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="platform/src/components/CreateInstanceModal.tsx">
<violation number="1" location="platform/src/components/CreateInstanceModal.tsx:38">
P2: Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</violation>
</file>
<file name="platform/server/services/provisioner.ts">
<violation number="1" location="platform/server/services/provisioner.ts:110">
P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</violation>
</file>
<file name="platform/server/routes/instances.ts">
<violation number="1" location="platform/server/routes/instances.ts:427">
P1: Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</violation>
</file>
<file name="platform/server/services/provisioner.test.ts">
<violation number="1" location="platform/server/services/provisioner.test.ts:251">
P1: Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| .update(instancesTable) | ||
| .set({ | ||
| machineStatus: "started", | ||
| config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" }, |
There was a problem hiding this comment.
P1: Scaling older instances explicitly assigns authMode: "static_token", which breaks their original implicit "managed_jwt" fallback authentication.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/routes/instances.ts, line 427:
<comment>Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</comment>
<file context>
@@ -498,6 +399,38 @@ instances.post("/:id/restart", async (c) => {
+ .update(instancesTable)
+ .set({
+ machineStatus: "started",
+ config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },
+ })
+ .where(eq(instancesTable.id, id));
</file context>
| config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" }, | |
| config: { ...currentConfig, plan, provider: "hetzner", authMode: resolveAuthMode(row) }, |
| setupProvisionFetchMock({ machineState: "stopped" }); | ||
| it("falls back to next location when primary location is unavailable", async () => { | ||
| fetchMock.mockImplementation((url: string, opts: RequestInit) => { | ||
| const method = opts.method ?? "GET"; |
There was a problem hiding this comment.
P1: Using opts.method throws a TypeError when fetch is called with just a URL (as done in waitForCompanionHealth). Use optional chaining (opts?.method) instead. Note: Please fix all 6 occurrences of this across the fallback tests.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.test.ts, line 251:
<comment>Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</comment>
<file context>
@@ -46,329 +22,613 @@ function okResponse(data: unknown): Response {
- setupProvisionFetchMock({ machineState: "stopped" });
+ it("falls back to next location when primary location is unavailable", async () => {
+ fetchMock.mockImplementation((url: string, opts: RequestInit) => {
+ const method = opts.method ?? "GET";
+ if (url === "http://1.2.3.4/health" && method === "GET") {
+ return Promise.resolve(okResponse({ ok: true }));
</file context>
| const method = opts.method ?? "GET"; | |
| const method = opts?.method ?? "GET"; |
| ? status.provisioning.regions | ||
| : DEFAULT_REGIONS; | ||
| setRegionOptions(regions); | ||
| setRegion(regions[0].value); |
There was a problem hiding this comment.
P2: Do not unconditionally reset region after loading region options; it can overwrite a user’s in-progress selection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/src/components/CreateInstanceModal.tsx, line 38:
<comment>Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</comment>
<file context>
@@ -9,16 +9,43 @@ interface CreateInstanceModalProps {
+ ? status.provisioning.regions
+ : DEFAULT_REGIONS;
+ setRegionOptions(regions);
+ setRegion(regions[0].value);
+ })
+ .catch(() => {
</file context>
| setRegion(regions[0].value); | |
| setRegion((prev) => (regions.some((r) => r.value === prev) ? prev : regions[0].value)); |
| if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"]; | ||
| if (normalized === "fra") return ["nbg1", "hel1", "fsn1"]; | ||
| if (normalized === "ams") return ["nbg1", "hel1", "fsn1"]; | ||
| return ["nbg1", "hel1", "fsn1"]; |
There was a problem hiding this comment.
P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.ts, line 110:
<comment>Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</comment>
<file context>
@@ -26,171 +21,387 @@ interface ProvisionInput {
+ if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"];
+ if (normalized === "fra") return ["nbg1", "hel1", "fsn1"];
+ if (normalized === "ams") return ["nbg1", "hel1", "fsn1"];
+ return ["nbg1", "hel1", "fsn1"];
+ }
+
</file context>
| return ["nbg1", "hel1", "fsn1"]; | |
| throw new Error(`Unsupported region: ${region}`); |
Summary
Testing
Review provenance