Skip to content

Commit 3316560

Browse files
lalaluneclaude
andcommitted
fix(hetzner-e2e): classify 403 limit_reached as quota_exceeded, not missing_token
Run 26077765355 surfaced a misleading "missing_token" error on a Hetzner project that actually hit its server cap. The API returned HTTP 403 with body `{ error: { code: "limit_reached", message: "server limit reached" } }`, but `mapStatusToCode` short-circuited on status 401/403 before reading the apiCode, collapsing quota exhaustion and missing-token into one bucket. Layer 1: reorder `mapStatusToCode` so the explicit `limit_reached` / `resource_limit_exceeded` apiCode wins over the status-only auth fallback. This is the only correct mapping: status 403 with `limit_reached` is a project quota issue, not an auth problem, and operators should not be told to refresh a token that is working fine. Provision script: extend `isRetryableCombo` to also retry the next fallback combo on `quota_exceeded` (and the literal "server limit reached" / "limit_reached" / "resource_limit_exceeded" message strings, so we stay correct if the error is rewrapped). The fallback ladder is finite, so a genuinely exhausted project still surfaces a single, clear error after the ladder is exhausted — with an operator-facing hint ("delete leaked CI servers in https://console.hetzner.cloud/ or rotate HCLOUD_TOKEN_CI to a project with capacity"). A non-retryable 401/403 also now prints a hint pointing at the GitHub `ci-hetzner-e2e` environment. Layer 2: workflow now captures the provision step's combined output to a log and, on failure, writes a categorized diagnostic to GITHUB_STEP_SUMMARY (quota vs auth vs unknown) so the operator opening the failed run sees the actionable next step at the top instead of buried in the step log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9e3a8ec commit 3316560

3 files changed

Lines changed: 97 additions & 9 deletions

File tree

.github/workflows/hetzner-e2e.yml

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,8 +96,39 @@ jobs:
9696
run: bun install --no-save --ignore-scripts
9797

9898
- name: Provision Hetzner server
99+
id: provision
99100
if: steps.secret_config.outputs.configured == 'true'
100-
run: bun run packages/scripts/cloud/admin/hetzner-e2e/hetzner-e2e-provision.ts
101+
run: |
102+
set -o pipefail
103+
log=/tmp/hetzner-e2e-provision.log
104+
bun run packages/scripts/cloud/admin/hetzner-e2e/hetzner-e2e-provision.ts 2>&1 | tee "$log"
105+
106+
- name: Surface provision failure diagnostic
107+
if: always() && steps.secret_config.outputs.configured == 'true' && steps.provision.outcome == 'failure'
108+
run: |
109+
log=/tmp/hetzner-e2e-provision.log
110+
{
111+
echo "### Hetzner E2E provisioning failed"
112+
echo ""
113+
if [ -f "$log" ] && grep -qE "quota exhausted|server limit reached|limit_reached|resource_limit_exceeded|quota_exceeded" "$log"; then
114+
echo "**Cause:** Hetzner project quota exhausted (server cap reached)."
115+
echo ""
116+
echo "Operator action:"
117+
echo "1. Open https://console.hetzner.cloud/ and check the CI project for leaked servers (filter labels \`ci=true\`, \`workflow=hetzner-e2e\`)."
118+
echo "2. Delete any leaked servers, or wait for the half-hourly reaper workflow to sweep them."
119+
echo "3. Re-run this workflow."
120+
echo ""
121+
echo "If the project itself has a tighter cap than the fallback ladder can survive, request a quota increase from Hetzner support or rotate \`HCLOUD_TOKEN_CI\` to a project with capacity."
122+
elif [ -f "$log" ] && grep -qE "Hetzner rejected the token|missing_token|HTTP 401|HTTP 403" "$log"; then
123+
echo "**Cause:** Hetzner rejected the API token (HTTP 401/403)."
124+
echo ""
125+
echo "Operator action: refresh \`HCLOUD_TOKEN_CI\` in the \`ci-hetzner-e2e\` GitHub environment, or confirm the project is still active."
126+
else
127+
echo "**Cause:** see the \"Provision Hetzner server\" step log for the raw error."
128+
echo ""
129+
echo "If the error message is unfamiliar, check \`packages/scripts/cloud/admin/hetzner-e2e/README.md\` and the fallback ladder in \`hetzner-e2e-provision.ts\`."
130+
fi
131+
} >> "$GITHUB_STEP_SUMMARY"
101132
102133
- name: Wait for host ready
103134
if: steps.secret_config.outputs.configured == 'true'

packages/cloud-shared/src/lib/services/containers/hetzner-cloud-api.ts

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -460,13 +460,19 @@ export class HetznerCloudClient {
460460
// ---------------------------------------------------------------------------
461461

462462
function mapStatusToCode(status: number, apiCode?: string): HetznerCloudErrorCode {
463+
// Explicit quota/limit apiCodes win over auth-status fallback: Hetzner
464+
// returns HTTP 403 with body code `limit_reached` (or
465+
// `resource_limit_exceeded`) when the project's server cap is hit. Without
466+
// this priority, `status === 403` collapses both "no token" and "quota
467+
// exhausted" into `missing_token`, which sends operators chasing a
468+
// non-existent auth bug while the real issue is account quota.
469+
if (apiCode === "limit_reached" || apiCode === "resource_limit_exceeded") {
470+
return "quota_exceeded";
471+
}
463472
if (status === 404) return "not_found";
464473
if (status === 401 || status === 403) return "missing_token";
465474
if (status === 422 || status === 400) return "invalid_input";
466475
if (status === 429) return "rate_limited";
467-
if (apiCode === "limit_reached" || apiCode === "resource_limit_exceeded") {
468-
return "quota_exceeded";
469-
}
470476
return "server_error";
471477
}
472478

packages/scripts/cloud/admin/hetzner-e2e/hetzner-e2e-provision.ts

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,10 @@
1515
* any further work — so a crash never leaks a server.
1616
*/
1717

18-
import { HetznerCloudClient } from "@elizaos/cloud-shared/lib/services/containers/hetzner-cloud-api";
18+
import {
19+
HetznerCloudClient,
20+
HetznerCloudError,
21+
} from "@elizaos/cloud-shared/lib/services/containers/hetzner-cloud-api";
1922
import { appendStateAtomic } from "./state-file";
2023

2124
function requireEnv(name: string): string {
@@ -48,10 +51,18 @@ const SERVER_TYPE_FALLBACKS: ReadonlyArray<{
4851
// Hetzner's "this server type can't be created here" and "this server
4952
// type is going away" responses — both render the requested combo
5053
// unusable and a different shared-cpu type / location is the natural
51-
// remediation. Auth / quota / billing failures (HTTP 401/402/403) are
52-
// NOT in this list — those are surfaced unchanged so the operator
54+
// remediation. We also treat project-wide quota exhaustion as retryable
55+
// because the pre-reap pass runs immediately before the loop: if it
56+
// freed any slots, the next attempt may now fit under the cap. The
57+
// fallback ladder is finite (~5 combos) so a genuinely exhausted project
58+
// will still surface as the last combo's error after the loop exits.
59+
// Pure auth / billing failures (HTTP 401, real 403 without limit code)
60+
// are NOT in this list — those are surfaced unchanged so the operator
5361
// fixes the underlying account issue.
5462
function isRetryableCombo(err: unknown): boolean {
63+
if (err instanceof HetznerCloudError && err.code === "quota_exceeded") {
64+
return true;
65+
}
5566
const message = err instanceof Error ? err.message.toLowerCase() : "";
5667
return (
5768
message.includes("unsupported_server_type_for_location") ||
@@ -60,7 +71,14 @@ function isRetryableCombo(err: unknown): boolean {
6071
message.includes("is deprecated") ||
6172
message.includes("server_type_deprecated") ||
6273
message.includes("resource_unavailable") ||
63-
message.includes("not_found") // Hetzner returns 404 when a deprecated type is fully removed
74+
message.includes("not_found") || // Hetzner returns 404 when a deprecated type is fully removed
75+
// Hetzner returns HTTP 403 with body `{ error: { code: "limit_reached",
76+
// message: "server limit reached" } }` when the project cap is hit.
77+
// Match both the apiCode and the human message so we stay correct even
78+
// if mapStatusToCode is bypassed (e.g. transport layer wraps the body).
79+
message.includes("server limit reached") ||
80+
message.includes("limit_reached") ||
81+
message.includes("resource_limit_exceeded")
6482
);
6583
}
6684

@@ -179,14 +197,47 @@ async function main(): Promise<void> {
179197
break;
180198
} catch (err) {
181199
lastError = err;
182-
if (!isRetryableCombo(err)) throw err;
200+
if (!isRetryableCombo(err)) {
201+
// Surface a hint before propagating: a non-retryable failure on
202+
// the first attempt is almost always an auth/account problem
203+
// (missing or stale HCLOUD_TOKEN_CI, project disabled). Without
204+
// this, the workflow log just shows the bare HetznerCloudError
205+
// and the operator has to guess.
206+
if (
207+
err instanceof HetznerCloudError &&
208+
err.code === "missing_token"
209+
) {
210+
console.error(
211+
"[hetzner-e2e-provision] Hetzner rejected the token (HTTP 401/403). " +
212+
"Refresh HCLOUD_TOKEN_CI in the ci-hetzner-e2e GitHub environment, or verify the project is active.",
213+
);
214+
}
215+
throw err;
216+
}
183217
const reason = err instanceof Error ? err.message : String(err);
184218
console.error(
185219
`[hetzner-e2e-provision] ${attempt.serverType}@${attempt.location} unavailable (${reason}); trying next fallback`,
186220
);
187221
}
188222
}
189223
if (!server) {
224+
// Layer 2 diagnostic: when every fallback combo also failed with
225+
// quota_exceeded, the operator needs a single actionable next step —
226+
// not a stack of "unavailable" lines that look like a transient API
227+
// glitch. Refresh `HCLOUD_TOKEN_CI` only if the token's project no
228+
// longer matches the CI project; otherwise delete leaked servers in
229+
// the Hetzner console (https://console.hetzner.cloud/) and re-run.
230+
if (
231+
lastError instanceof HetznerCloudError &&
232+
lastError.code === "quota_exceeded"
233+
) {
234+
throw new Error(
235+
`Hetzner project quota exhausted across all fallback combos (last error: ${lastError.message}). ` +
236+
"Operator action required: pre-reap freed nothing in 20min window, and the workflow cannot proceed. " +
237+
"Check https://console.hetzner.cloud/ for leaked CI servers (label ci=true, workflow=hetzner-e2e), " +
238+
"or rotate HCLOUD_TOKEN_CI if it now points to a project with a tighter cap.",
239+
);
240+
}
190241
throw lastError instanceof Error
191242
? lastError
192243
: new Error("Hetzner provisioning failed across all fallback combos");

0 commit comments

Comments
 (0)