Skip to content

fix(platform): wait for instance health before ready#529

Open
StanGirard wants to merge 15 commits intomainfrom
feat/hetzner-provider-scaling
Open

fix(platform): wait for instance health before ready#529
StanGirard wants to merge 15 commits intomainfrom
feat/hetzner-provider-scaling

Conversation

@StanGirard
Copy link
Copy Markdown
Contributor

Summary

  • wait for Companion /health before marking Hetzner instances ready
  • surface Codex resume fallback context reset to the browser
  • add platform/web tests for the new behavior

Testing

  • cd platform && bun run test --run server/routes/instances.test.ts
  • cd platform && bun run typecheck
  • cd web && bun run test --run server/codex-adapter.test.ts
  • cd web && bun run typecheck

Review provenance

  • Implemented by AI agent
  • Human review: no

Codex-Agent and others added 14 commits March 13, 2026 15:08
…sion restart

When a Codex session is relaunched, thread/resume can fail with "no rollout
found" if the previous thread's rollout state is stale or missing. Previously
this caused the entire init to fail, leaving the session stuck in "exited"
state with no way to restart without manual intervention.

Now the adapter catches non-transient resume errors and automatically falls
back to thread/start, creating a fresh thread. Transient "Transport closed"
errors are still retried via the existing backoff mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address Greptile review feedback: after falling back from thread/resume to
thread/start, update options.threadId with the new thread ID so that
subsequent resetForReconnect calls resume the correct (live) thread instead
of repeatedly trying the stale one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
companion Ready Ready Preview, Comment Mar 13, 2026 8:56pm

Request Review

@StanGirard
Copy link
Copy Markdown
Contributor Author

@Greptile I pushed follow-up fixes on commit 52788ba for the latest PR head. CI is already running on this commit. Please re-review once the current run finishes.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR migrates instance provisioning from Fly.io to Hetzner Cloud and adds two unrelated improvements: a thread/resume fallback in the Codex adapter and conditional Secure cookie behavior for plain-HTTP IPv4 instances.

Key changes:

  • Hetzner provisioner: A full rewrite of Provisioner using a new HetznerCloudClient. Provisioning now deploys a Docker container via cloud-init user data, with location/server-type retry logic and a new waitForCompanionHealth step that polls GET /health before marking instances ready.
  • Auth mode: New static_token auth mode returns the raw authSecret directly instead of wrapping it in a JWT. Embed redirects use http:// for raw IPv4 hostnames.
  • DB schema: flyMachineId/flyVolumeId columns are re-exposed as providerMachineId/providerVolumeId at the application layer while preserving the legacy SQL column names.
  • /scale endpoint: New route powers off a server, changes its type, and powers it back on.
  • Codex resume fallback: On non-transient thread/resume failure, the adapter falls back to thread/start, updates its internal threadId, and sends a "context was reset" browser notification.
  • Managed-auth Secure flag: The Secure cookie attribute is now omitted for direct HTTP requests (raw IPv4 access), enabling auth cookies to work on plain-HTTP Hetzner instances.

Issues found:

  • Potential auth bypass (provisioner.ts lines 674–677): Cloud-init disables the JWT managed-auth middleware and stores the auth credential under a different env var name than the one the middleware reads. If no separate auth layer in the companion server handles this key when JWT auth is off, instances will be publicly accessible. This needs verification before merging.
  • Misleading deprovision comment (provisioner.ts line 331): The "may already be stopped" comment is inside the throw branch, not the 404-swallow branch.
  • Silent 4-minute wait on empty hostname (provisioner.ts lines 147–162): If no public IPv4 is available, waitForCompanionHealth polls an invalid URL for 240 s with no fast-fail.
  • Deprecated type loops all locations (provisioner.ts lines 283–308): When a deprecated server type error fires, continue advances the inner location loop instead of breaking to the next type, causing unnecessary API calls.

Confidence Score: 2/5

  • Not safe to merge until the auth env var discrepancy in the cloud-init user data is verified against the companion runtime.
  • The Hetzner provisioner and routing changes are generally well-structured with good test coverage. However, the cloud-init disables the JWT managed-auth middleware while storing the auth credential under a different env var than the one the middleware reads. If the companion server has no separate code path validating that key when JWT auth is off, every provisioned Hetzner instance will be publicly accessible without authentication. This single issue is a potential critical security regression that blocks a safe merge.
  • platform/server/services/provisioner.ts — specifically the buildHetznerUserData method's auth env var configuration around lines 670–680.

Important Files Changed

Filename Overview
platform/server/services/provisioner.ts Rewrites Provisioner from Fly.io to Hetzner Cloud with health-check wait, location/type retry logic, and a new resize operation. Contains a potential auth bypass (disabled JWT middleware with mismatched env var), a misleading comment in deprovision, silent 4-minute polling on empty hostname, and suboptimal retry loop for deprecated server types.
platform/server/services/hetzner-cloud.ts New Hetzner Cloud API client with CRUD for servers/volumes, power lifecycle, type-change, action polling, and server-status polling. Clean implementation with proper error propagation.
platform/server/routes/instances.ts Migrates instance routes from Fly.io to Hetzner, adds /scale endpoint, adds static_token auth mode (returns raw authSecret instead of JWT), uses HTTP for IPv4 embed redirects. DB field references updated to provider-neutral names.
platform/server/db/schema.ts Renames flyMachineId/flyVolumeId properties to providerMachineId/providerVolumeId while keeping the legacy SQL column names for backward compatibility. Safe migration approach.
web/server/codex-adapter.ts Adds thread/resume fallback: on non-transient failure, falls back to thread/start, updates internal threadId, and emits a "context was reset" browser message. Transient-error detection uses exact string match which may be fragile.
web/server/middleware/managed-auth.ts Conditionally omits the Secure cookie flag for direct HTTP (raw IPv4) access, enabling auth cookies to function on plain-HTTP Hetzner instances without breaking HTTPS instances.
platform/src/lib/api.ts Adds ControlPlaneStatus type, fixes SSE parser for CRLF line endings, and resolves the createInstanceStream promise early on the done event (canceling the reader) rather than waiting for stream close.
platform/src/components/CreateInstanceModal.tsx Fetches available regions from /api/status on mount and renders them dynamically instead of hardcoding Fly.io regions. Falls back to defaults on error.
platform/server/services/provisioner.test.ts Comprehensive test suite for Hetzner provisioner covering health-check polling, location/type fallback chains, cloud-init sanitization, resize lifecycle, and geographic isolation. The wait() method is spied out to avoid real delays.
platform/server/routes/instances.test.ts Updated test suite for Hetzner routes covering provisioning, streaming, delete, scale, token issuance, and embed redirect. Several detailed edge-case tests from the Fly era (404 deprovision still deletes DB row, ensurePublicIps failure) were removed without equivalent replacements.

Sequence Diagram

sequenceDiagram
    participant UI as Browser/UI
    participant Platform as Platform Routes
    participant Prov as Provisioner
    participant HC as HetznerCloudClient
    participant H as Hetzner API
    participant CI as Companion Instance

    UI->>Platform: POST /instances/create-stream
    Platform->>Prov: provision(input)
    loop For each serverType x location candidate
        Prov->>HC: createVolume(name, location, size)
        HC->>H: POST /volumes
        H-->>HC: volume id
        Prov->>HC: createServer(serverType, location, user_data)
        HC->>H: POST /servers
        H-->>HC: server id + action id
        Prov->>HC: waitForAction(actionId)
        HC->>H: GET /actions/:id (poll)
        H-->>HC: action status=success
        Prov->>HC: waitForServerStatus(serverId, running)
        HC->>H: GET /servers/:id (poll)
        H-->>HC: server status=running
        Prov->>CI: GET http://ipv4/health (poll every 5s, up to 4 min)
        CI-->>Prov: HTTP 200 OK
        Note over Prov: Health confirmed, break loop
    end
    Prov-->>Platform: providerMachineId, providerVolumeId, hostname
    Platform->>Platform: INSERT into instances table
    Platform-->>UI: SSE done event with instance data

    UI->>Platform: GET /instances/:id/embed
    Platform->>Platform: resolveAuthMode -> static_token
    Platform-->>UI: 302 redirect to http://ipv4 with token param
    UI->>CI: GET http://ipv4 with token param
Loading

Comments Outside Diff (1)

  1. platform/server/services/provisioner.ts, line 674-677 (link)

    Potential auth bypass: cloud-init disables managed-auth middleware

    The cloud-init user data sets the auth-enabled flag to 0, disabling the companion's JWT middleware. The provisioned auth credential is stored under a different env var key than the one the managed-auth middleware reads.

    With the middleware disabled, if no separate auth layer in the companion server validates the credential when the flag is 0, Hetzner instances will accept unauthenticated connections. The token appended to the embed redirect URL will reach the instance but never be enforced.

    Please confirm a separate validation path exists in the companion runtime for this token-based mode before merging. If none exists, the flag should be set to 1 and the credential written under the env var key that the managed-auth middleware actually reads.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 329-336

Comment:
**Misleading comment placed in the wrong branch**

The comment `// Instance may already be stopped or removed.` describes the case where the server is already gone (404), but it sits inside the `if (!this.isNotFoundError(err))` branch — which is the **throw** path, not the swallow path. The comment should be just outside or below the `if` block (on the implicit 404-swallow path).

```suggestion
    } catch (err) {
      if (!this.isNotFoundError(err)) {
        throw err;
      }
      // Instance may already be stopped or removed — safe to continue.
    }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 147-162

Comment:
**Silent 4-minute wait when server has no public IPv4**

If `server.public_net?.ipv4?.ip` is empty (e.g., IPv6-only server, or Hetzner API anomaly), `serverIpv4` will be `""`. The call site passes `serverIpv4 || resolvedHostname`, but `resolvedHostname` may also be an IP. If both are empty, this method polls `http:///health` every 5 seconds for **240 seconds** and then fails with `Companion instance at  did not become healthy within 240000ms` — which gives no actionable information.

Consider adding a fast-fail guard:

```typescript
private async waitForCompanionHealth(hostname: string, timeoutMs = 240_000): Promise<void> {
  if (!hostname) throw new Error("Cannot poll companion health: no hostname/IP available");
  const url = `http://${hostname}/health`;
  ...
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 283-308

Comment:
**Deprecated server type loops through all locations needlessly**

When `isDeprecatedServerTypeError(err)` fires, `continue` advances the **inner** `for (const location of candidateLocations)` loop. This means every remaining location is tried with the same deprecated type before the outer loop advances to the next candidate type. This causes unnecessary API calls (Hetzner charges for failed server-create actions) and wasted round-trips.

A `break` on the inner loop (with `continue` on the outer one) would fix this. For example:

```typescript
if (
  this.isInvalidLocationError(err) ||
  this.isUnsupportedServerLocationError(err) ||
  this.isServerLocationDisabledError(err)
) {
  lastError = err;
  continue; // try next location, same server type
}
if (this.isDeprecatedServerTypeError(err)) {
  lastError = err;
  break; // all locations will fail for deprecated type — skip to next type
}
throw err;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/server/codex-adapter.ts
Line: 801-807

Comment:
**Exact-match transient-error check may be brittle**

The fallback guard uses `resumeErr.message === "Transport closed"` with a strict equality check. If the transport layer ever uses a slightly different message (e.g., `"transport closed"`, `"Transport Closed"`, or wraps the underlying error), the condition will silently miss the transient case, triggering a full thread reset and surfacing a "context was reset" banner to the user unnecessarily.

Consider using `includes` or extracting the sentinel string as a constant:

```suggestion
              const isTransport = resumeErr instanceof Error && resumeErr.message.includes("Transport closed");
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 674-677

Comment:
**Potential auth bypass: cloud-init disables managed-auth middleware**

The cloud-init user data sets the auth-enabled flag to `0`, disabling the companion's JWT middleware. The provisioned auth credential is stored under a different env var key than the one the managed-auth middleware reads.

With the middleware disabled, if no separate auth layer in the companion server validates the credential when the flag is `0`, Hetzner instances will accept unauthenticated connections. The token appended to the embed redirect URL will reach the instance but never be enforced.

Please confirm a separate validation path exists in the companion runtime for this token-based mode before merging. If none exists, the flag should be set to `1` and the credential written under the env var key that the managed-auth middleware actually reads.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 9cb13ba

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 27 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="platform/src/components/CreateInstanceModal.tsx">

<violation number="1" location="platform/src/components/CreateInstanceModal.tsx:38">
P2: Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</violation>
</file>

<file name="platform/server/services/provisioner.ts">

<violation number="1" location="platform/server/services/provisioner.ts:110">
P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</violation>
</file>

<file name="platform/server/routes/instances.ts">

<violation number="1" location="platform/server/routes/instances.ts:427">
P1: Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</violation>
</file>

<file name="platform/server/services/provisioner.test.ts">

<violation number="1" location="platform/server/services/provisioner.test.ts:251">
P1: Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

.update(instancesTable)
.set({
machineStatus: "started",
config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Scaling older instances explicitly assigns authMode: "static_token", which breaks their original implicit "managed_jwt" fallback authentication.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/routes/instances.ts, line 427:

<comment>Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</comment>

<file context>
@@ -498,6 +399,38 @@ instances.post("/:id/restart", async (c) => {
+    .update(instancesTable)
+    .set({
+      machineStatus: "started",
+      config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },
+    })
+    .where(eq(instancesTable.id, id));
</file context>
Suggested change
config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },
config: { ...currentConfig, plan, provider: "hetzner", authMode: resolveAuthMode(row) },
Fix with Cubic

setupProvisionFetchMock({ machineState: "stopped" });
it("falls back to next location when primary location is unavailable", async () => {
fetchMock.mockImplementation((url: string, opts: RequestInit) => {
const method = opts.method ?? "GET";
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Using opts.method throws a TypeError when fetch is called with just a URL (as done in waitForCompanionHealth). Use optional chaining (opts?.method) instead. Note: Please fix all 6 occurrences of this across the fallback tests.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.test.ts, line 251:

<comment>Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</comment>

<file context>
@@ -46,329 +22,613 @@ function okResponse(data: unknown): Response {
-      setupProvisionFetchMock({ machineState: "stopped" });
+  it("falls back to next location when primary location is unavailable", async () => {
+    fetchMock.mockImplementation((url: string, opts: RequestInit) => {
+      const method = opts.method ?? "GET";
+      if (url === "http://1.2.3.4/health" && method === "GET") {
+        return Promise.resolve(okResponse({ ok: true }));
</file context>
Suggested change
const method = opts.method ?? "GET";
const method = opts?.method ?? "GET";
Fix with Cubic

? status.provisioning.regions
: DEFAULT_REGIONS;
setRegionOptions(regions);
setRegion(regions[0].value);
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Do not unconditionally reset region after loading region options; it can overwrite a user’s in-progress selection.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/src/components/CreateInstanceModal.tsx, line 38:

<comment>Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</comment>

<file context>
@@ -9,16 +9,43 @@ interface CreateInstanceModalProps {
+          ? status.provisioning.regions
+          : DEFAULT_REGIONS;
+        setRegionOptions(regions);
+        setRegion(regions[0].value);
+      })
+      .catch(() => {
</file context>
Suggested change
setRegion(regions[0].value);
setRegion((prev) => (regions.some((r) => r.value === prev) ? prev : regions[0].value));
Fix with Cubic

if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"];
if (normalized === "fra") return ["nbg1", "hel1", "fsn1"];
if (normalized === "ams") return ["nbg1", "hel1", "fsn1"];
return ["nbg1", "hel1", "fsn1"];
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.ts, line 110:

<comment>Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</comment>

<file context>
@@ -26,171 +21,387 @@ interface ProvisionInput {
+    if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"];
+    if (normalized === "fra") return ["nbg1", "hel1", "fsn1"];
+    if (normalized === "ams") return ["nbg1", "hel1", "fsn1"];
+    return ["nbg1", "hel1", "fsn1"];
+  }
+
</file context>
Suggested change
return ["nbg1", "hel1", "fsn1"];
throw new Error(`Unsupported region: ${region}`);
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants