fix(platform): wait for instance health before ready by StanGirard · Pull Request #529 · The-Vibe-Company/companion

StanGirard · 2026-03-13T20:53:30Z

Summary

wait for Companion /health before marking Hetzner instances ready
surface Codex resume fallback context reset to the browser
add platform/web tests for the new behavior

Testing

cd platform && bun run test --run server/routes/instances.test.ts
cd platform && bun run typecheck
cd web && bun run test --run server/codex-adapter.test.ts
cd web && bun run typecheck

Review provenance

Implemented by AI agent
Human review: no

…sion restart When a Codex session is relaunched, thread/resume can fail with "no rollout found" if the previous thread's rollout state is stale or missing. Previously this caused the entire init to fail, leaving the session stuck in "exited" state with no way to restart without manual intervention. Now the adapter catches non-transient resume errors and automatically falls back to thread/start, creating a fresh thread. Transient "Transport closed" errors are still retried via the existing backoff mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address Greptile review feedback: after falling back from thread/resume to thread/start, update options.threadId with the new thread ID so that subsequent resetForReconnect calls resume the correct (live) thread instead of repeatedly trying the stale one. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…lity

…location

vercel · 2026-03-13T20:53:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
companion	Ready	Preview, Comment	Mar 13, 2026 8:56pm

StanGirard · 2026-03-13T20:53:51Z

@Greptile I pushed follow-up fixes on commit 52788ba for the latest PR head. CI is already running on this commit. Please re-review once the current run finishes.

greptile-apps · 2026-03-13T21:01:07Z

Greptile Summary

This PR migrates instance provisioning from Fly.io to Hetzner Cloud and adds two unrelated improvements: a thread/resume fallback in the Codex adapter and conditional Secure cookie behavior for plain-HTTP IPv4 instances.

Key changes:

Hetzner provisioner: A full rewrite of Provisioner using a new HetznerCloudClient. Provisioning now deploys a Docker container via cloud-init user data, with location/server-type retry logic and a new waitForCompanionHealth step that polls GET /health before marking instances ready.
Auth mode: New static_token auth mode returns the raw authSecret directly instead of wrapping it in a JWT. Embed redirects use http:// for raw IPv4 hostnames.
DB schema: flyMachineId/flyVolumeId columns are re-exposed as providerMachineId/providerVolumeId at the application layer while preserving the legacy SQL column names.
/scale endpoint: New route powers off a server, changes its type, and powers it back on.
Codex resume fallback: On non-transient thread/resume failure, the adapter falls back to thread/start, updates its internal threadId, and sends a "context was reset" browser notification.
Managed-auth Secure flag: The Secure cookie attribute is now omitted for direct HTTP requests (raw IPv4 access), enabling auth cookies to work on plain-HTTP Hetzner instances.

Issues found:

Potential auth bypass (provisioner.ts lines 674–677): Cloud-init disables the JWT managed-auth middleware and stores the auth credential under a different env var name than the one the middleware reads. If no separate auth layer in the companion server handles this key when JWT auth is off, instances will be publicly accessible. This needs verification before merging.
Misleading deprovision comment (provisioner.ts line 331): The "may already be stopped" comment is inside the throw branch, not the 404-swallow branch.
Silent 4-minute wait on empty hostname (provisioner.ts lines 147–162): If no public IPv4 is available, waitForCompanionHealth polls an invalid URL for 240 s with no fast-fail.
Deprecated type loops all locations (provisioner.ts lines 283–308): When a deprecated server type error fires, continue advances the inner location loop instead of breaking to the next type, causing unnecessary API calls.

Confidence Score: 2/5

Not safe to merge until the auth env var discrepancy in the cloud-init user data is verified against the companion runtime.
The Hetzner provisioner and routing changes are generally well-structured with good test coverage. However, the cloud-init disables the JWT managed-auth middleware while storing the auth credential under a different env var than the one the middleware reads. If the companion server has no separate code path validating that key when JWT auth is off, every provisioned Hetzner instance will be publicly accessible without authentication. This single issue is a potential critical security regression that blocks a safe merge.
platform/server/services/provisioner.ts — specifically the buildHetznerUserData method's auth env var configuration around lines 670–680.

Important Files Changed

Filename	Overview
platform/server/services/provisioner.ts	Rewrites Provisioner from Fly.io to Hetzner Cloud with health-check wait, location/type retry logic, and a new resize operation. Contains a potential auth bypass (disabled JWT middleware with mismatched env var), a misleading comment in deprovision, silent 4-minute polling on empty hostname, and suboptimal retry loop for deprecated server types.
platform/server/services/hetzner-cloud.ts	New Hetzner Cloud API client with CRUD for servers/volumes, power lifecycle, type-change, action polling, and server-status polling. Clean implementation with proper error propagation.
platform/server/routes/instances.ts	Migrates instance routes from Fly.io to Hetzner, adds /scale endpoint, adds static_token auth mode (returns raw authSecret instead of JWT), uses HTTP for IPv4 embed redirects. DB field references updated to provider-neutral names.
platform/server/db/schema.ts	Renames flyMachineId/flyVolumeId properties to providerMachineId/providerVolumeId while keeping the legacy SQL column names for backward compatibility. Safe migration approach.
web/server/codex-adapter.ts	Adds thread/resume fallback: on non-transient failure, falls back to thread/start, updates internal threadId, and emits a "context was reset" browser message. Transient-error detection uses exact string match which may be fragile.
web/server/middleware/managed-auth.ts	Conditionally omits the Secure cookie flag for direct HTTP (raw IPv4) access, enabling auth cookies to function on plain-HTTP Hetzner instances without breaking HTTPS instances.
platform/src/lib/api.ts	Adds ControlPlaneStatus type, fixes SSE parser for CRLF line endings, and resolves the createInstanceStream promise early on the done event (canceling the reader) rather than waiting for stream close.
platform/src/components/CreateInstanceModal.tsx	Fetches available regions from /api/status on mount and renders them dynamically instead of hardcoding Fly.io regions. Falls back to defaults on error.
platform/server/services/provisioner.test.ts	Comprehensive test suite for Hetzner provisioner covering health-check polling, location/type fallback chains, cloud-init sanitization, resize lifecycle, and geographic isolation. The wait() method is spied out to avoid real delays.
platform/server/routes/instances.test.ts	Updated test suite for Hetzner routes covering provisioning, streaming, delete, scale, token issuance, and embed redirect. Several detailed edge-case tests from the Fly era (404 deprovision still deletes DB row, ensurePublicIps failure) were removed without equivalent replacements.

Sequence Diagram

sequenceDiagram
    participant UI as Browser/UI
    participant Platform as Platform Routes
    participant Prov as Provisioner
    participant HC as HetznerCloudClient
    participant H as Hetzner API
    participant CI as Companion Instance

    UI->>Platform: POST /instances/create-stream
    Platform->>Prov: provision(input)
    loop For each serverType x location candidate
        Prov->>HC: createVolume(name, location, size)
        HC->>H: POST /volumes
        H-->>HC: volume id
        Prov->>HC: createServer(serverType, location, user_data)
        HC->>H: POST /servers
        H-->>HC: server id + action id
        Prov->>HC: waitForAction(actionId)
        HC->>H: GET /actions/:id (poll)
        H-->>HC: action status=success
        Prov->>HC: waitForServerStatus(serverId, running)
        HC->>H: GET /servers/:id (poll)
        H-->>HC: server status=running
        Prov->>CI: GET http://ipv4/health (poll every 5s, up to 4 min)
        CI-->>Prov: HTTP 200 OK
        Note over Prov: Health confirmed, break loop
    end
    Prov-->>Platform: providerMachineId, providerVolumeId, hostname
    Platform->>Platform: INSERT into instances table
    Platform-->>UI: SSE done event with instance data

    UI->>Platform: GET /instances/:id/embed
    Platform->>Platform: resolveAuthMode -> static_token
    Platform-->>UI: 302 redirect to http://ipv4 with token param
    UI->>CI: GET http://ipv4 with token param

Comments Outside Diff (1)

platform/server/services/provisioner.ts, line 674-677 (link)

Potential auth bypass: cloud-init disables managed-auth middleware

The cloud-init user data sets the auth-enabled flag to 0, disabling the companion's JWT middleware. The provisioned auth credential is stored under a different env var key than the one the managed-auth middleware reads.

With the middleware disabled, if no separate auth layer in the companion server validates the credential when the flag is 0, Hetzner instances will accept unauthenticated connections. The token appended to the embed redirect URL will reach the instance but never be enforced.

Please confirm a separate validation path exists in the companion runtime for this token-based mode before merging. If none exists, the flag should be set to 1 and the credential written under the env var key that the managed-auth middleware actually reads.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 329-336

Comment:
**Misleading comment placed in the wrong branch**

The comment `// Instance may already be stopped or removed.` describes the case where the server is already gone (404), but it sits inside the `if (!this.isNotFoundError(err))` branch — which is the **throw** path, not the swallow path. The comment should be just outside or below the `if` block (on the implicit 404-swallow path).

```suggestion
    } catch (err) {
      if (!this.isNotFoundError(err)) {
        throw err;
      }
      // Instance may already be stopped or removed — safe to continue.
    }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 147-162

Comment:
**Silent 4-minute wait when server has no public IPv4**

If `server.public_net?.ipv4?.ip` is empty (e.g., IPv6-only server, or Hetzner API anomaly), `serverIpv4` will be `""`. The call site passes `serverIpv4 || resolvedHostname`, but `resolvedHostname` may also be an IP. If both are empty, this method polls `http:///health` every 5 seconds for **240 seconds** and then fails with `Companion instance at  did not become healthy within 240000ms` — which gives no actionable information.

Consider adding a fast-fail guard:

```typescript
private async waitForCompanionHealth(hostname: string, timeoutMs = 240_000): Promise<void> {
  if (!hostname) throw new Error("Cannot poll companion health: no hostname/IP available");
  const url = `http://${hostname}/health`;
  ...
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 283-308

Comment:
**Deprecated server type loops through all locations needlessly**

When `isDeprecatedServerTypeError(err)` fires, `continue` advances the **inner** `for (const location of candidateLocations)` loop. This means every remaining location is tried with the same deprecated type before the outer loop advances to the next candidate type. This causes unnecessary API calls (Hetzner charges for failed server-create actions) and wasted round-trips.

A `break` on the inner loop (with `continue` on the outer one) would fix this. For example:

```typescript
if (
  this.isInvalidLocationError(err) ||
  this.isUnsupportedServerLocationError(err) ||
  this.isServerLocationDisabledError(err)
) {
  lastError = err;
  continue; // try next location, same server type
}
if (this.isDeprecatedServerTypeError(err)) {
  lastError = err;
  break; // all locations will fail for deprecated type — skip to next type
}
throw err;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: web/server/codex-adapter.ts
Line: 801-807

Comment:
**Exact-match transient-error check may be brittle**

The fallback guard uses `resumeErr.message === "Transport closed"` with a strict equality check. If the transport layer ever uses a slightly different message (e.g., `"transport closed"`, `"Transport Closed"`, or wraps the underlying error), the condition will silently miss the transient case, triggering a full thread reset and surfacing a "context was reset" banner to the user unnecessarily.

Consider using `includes` or extracting the sentinel string as a constant:

```suggestion
              const isTransport = resumeErr instanceof Error && resumeErr.message.includes("Transport closed");
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: platform/server/services/provisioner.ts
Line: 674-677

Comment:
**Potential auth bypass: cloud-init disables managed-auth middleware**

The cloud-init user data sets the auth-enabled flag to `0`, disabling the companion's JWT middleware. The provisioned auth credential is stored under a different env var key than the one the managed-auth middleware reads.

With the middleware disabled, if no separate auth layer in the companion server validates the credential when the flag is `0`, Hetzner instances will accept unauthenticated connections. The token appended to the embed redirect URL will reach the instance but never be enforced.

Please confirm a separate validation path exists in the companion runtime for this token-based mode before merging. If none exists, the flag should be set to `1` and the credential written under the env var key that the managed-auth middleware actually reads.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 9cb13ba}

cubic-dev-ai

4 issues found across 27 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="platform/src/components/CreateInstanceModal.tsx">

<violation number="1" location="platform/src/components/CreateInstanceModal.tsx:38">
P2: Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</violation>
</file>

<file name="platform/server/services/provisioner.ts">

<violation number="1" location="platform/server/services/provisioner.ts:110">
P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</violation>
</file>

<file name="platform/server/routes/instances.ts">

<violation number="1" location="platform/server/routes/instances.ts:427">
P1: Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</violation>
</file>

<file name="platform/server/services/provisioner.test.ts">

<violation number="1" location="platform/server/services/provisioner.test.ts:251">
P1: Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-13T21:05:01Z

+    .update(instancesTable)
+    .set({
+      machineStatus: "started",
+      config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },


P1: Scaling older instances explicitly assigns authMode: "static_token", which breaks their original implicit "managed_jwt" fallback authentication.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/routes/instances.ts, line 427: <comment>Scaling older instances explicitly assigns `authMode: "static_token"`, which breaks their original implicit `"managed_jwt"` fallback authentication.</comment> <file context> @@ -498,6 +399,38 @@ instances.post("/:id/restart", async (c) => { + .update(instancesTable) + .set({ + machineStatus: "started", + config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" }, + }) + .where(eq(instancesTable.id, id)); </file context>

Suggested change

config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode || "static_token" },

config: { ...currentConfig, plan, provider: "hetzner", authMode: resolveAuthMode(row) },

cubic-dev-ai · 2026-03-13T21:05:01Z

-      setupProvisionFetchMock({ machineState: "stopped" });
+  it("falls back to next location when primary location is unavailable", async () => {
+    fetchMock.mockImplementation((url: string, opts: RequestInit) => {
+      const method = opts.method ?? "GET";


P1: Using opts.method throws a TypeError when fetch is called with just a URL (as done in waitForCompanionHealth). Use optional chaining (opts?.method) instead. Note: Please fix all 6 occurrences of this across the fallback tests.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.test.ts, line 251: <comment>Using `opts.method` throws a `TypeError` when `fetch` is called with just a URL (as done in `waitForCompanionHealth`). Use optional chaining (`opts?.method`) instead. Note: Please fix all 6 occurrences of this across the fallback tests.</comment> <file context> @@ -46,329 +22,613 @@ function okResponse(data: unknown): Response { - setupProvisionFetchMock({ machineState: "stopped" }); + it("falls back to next location when primary location is unavailable", async () => { + fetchMock.mockImplementation((url: string, opts: RequestInit) => { + const method = opts.method ?? "GET"; + if (url === "http://1.2.3.4/health" && method === "GET") { + return Promise.resolve(okResponse({ ok: true })); </file context>

Suggested change

const method = opts.method ?? "GET";

const method = opts?.method ?? "GET";

cubic-dev-ai · 2026-03-13T21:05:01Z

+          ? status.provisioning.regions
+          : DEFAULT_REGIONS;
+        setRegionOptions(regions);
+        setRegion(regions[0].value);


P2: Do not unconditionally reset region after loading region options; it can overwrite a user’s in-progress selection.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At platform/src/components/CreateInstanceModal.tsx, line 38: <comment>Do not unconditionally reset `region` after loading region options; it can overwrite a user’s in-progress selection.</comment> <file context> @@ -9,16 +9,43 @@ interface CreateInstanceModalProps { + ? status.provisioning.regions + : DEFAULT_REGIONS; + setRegionOptions(regions); + setRegion(regions[0].value); + }) + .catch(() => { </file context>

Suggested change

setRegion(regions[0].value);

setRegion((prev) => (regions.some((r) => r.value === prev) ? prev : regions[0].value));

cubic-dev-ai · 2026-03-13T21:05:01Z

+    if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"];
+    if (normalized === "fra") return ["nbg1", "hel1", "fsn1"];
+    if (normalized === "ams") return ["nbg1", "hel1", "fsn1"];
+    return ["nbg1", "hel1", "fsn1"];


P2: Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At platform/server/services/provisioner.ts, line 110: <comment>Unknown regions are silently remapped to EU locations instead of being rejected, which can provision instances in the wrong geography.</comment> <file context> @@ -26,171 +21,387 @@ interface ProvisionInput { + if (normalized === "cdg") return ["nbg1", "hel1", "fsn1"]; + if (normalized === "fra") return ["nbg1", "hel1", "fsn1"]; + if (normalized === "ams") return ["nbg1", "hel1", "fsn1"]; + return ["nbg1", "hel1", "fsn1"]; + } + </file context>

Suggested change

return ["nbg1", "hel1", "fsn1"];

throw new Error(`Unsupported region: ${region}`);

Codex-Agent and others added 14 commits March 13, 2026 15:08

feat(platform): add hetzner provider and instance scaling

8db1317

test(platform): raise hetzner provisioning coverage

121fce5

fix(platform): address hetzner provisioning and scaling review findings

55c8f39

refactor(platform): remove fly provider and standardize on hetzner

298eee7

fix(platform): add hetzner location fallback and restore db compatibi…

768b881

…lity

fix(platform): retry provisioning when server type is unavailable in …

400befe

…location

fix(platform): keep hetzner fallback within selected geography

3ed26ad

fix(platform): harden hetzner provisioning lifecycle

7fc69b5

fix(platform): retry hetzner server types by region

d27b3c5

fix(platform): handle hetzner regional capacity fallback

cfa373c

fix(platform): align instance auth mode handling

4a57025

fix(platform): wait for instance health before ready

52788ba

chore(ci): retrigger pr checks

9cb13ba

vercel Bot deployed to Preview March 13, 2026 20:56 View deployment

cubic-dev-ai Bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(platform): wait for instance health before ready#529

fix(platform): wait for instance health before ready#529
StanGirard wants to merge 15 commits intomainfrom
feat/hetzner-provider-scaling

StanGirard commented Mar 13, 2026

Uh oh!

vercel Bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

StanGirard commented Mar 13, 2026

Uh oh!

greptile-apps Bot commented Mar 13, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	config: { ...currentConfig, plan, provider: "hetzner", authMode: currentConfig.authMode \|\| "static_token" },
	config: { ...currentConfig, plan, provider: "hetzner", authMode: resolveAuthMode(row) },

	const method = opts.method ?? "GET";
	const method = opts?.method ?? "GET";

	setRegion(regions[0].value);
	setRegion((prev) => (regions.some((r) => r.value === prev) ? prev : regions[0].value));

	return ["nbg1", "hel1", "fsn1"];
	throw new Error(`Unsupported region: ${region}`);

Conversation

StanGirard commented Mar 13, 2026

Summary

Testing

Review provenance

Uh oh!

vercel Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StanGirard commented Mar 13, 2026

Uh oh!

greptile-apps Bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Mar 13, 2026 •

edited

Loading

greptile-apps Bot commented Mar 13, 2026 •

edited

Loading

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading

cubic-dev-ai Bot Mar 13, 2026 •

edited

Loading