Summary
The awf-api-proxy container intermittently fails its Docker health check during startup, causing the entire AWF-wrapped step to hard-fail. This has been observed recurring at a ~33% rate on Azure-hosted runners in microsoft/vscode-engineering today (2 of 6 errors-investigate workflow runs failed the same way).
Observed failure
In the Execute GitHub Copilot CLI step, the awf-api-proxy container starts successfully but never passes its health check within the timeout:
Container awf-api-proxy Starting
Container awf-api-proxy Started
Container awf-squid Waiting
Container awf-api-proxy Waiting
Container awf-squid Healthy ✅
Container awf-api-proxy Error ❌
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never
The awf-squid container becomes healthy fine — the issue is isolated to awf-api-proxy.
Example runs
Both hit the identical failure within the same ~1 hour window on 2026-04-27.
Two problems
1. No retry / tolerance on container health check failure
The awf CLI invocation has no retry mechanism. A transient health check miss hard-fails the entire step with exit code 1, with no opportunity to recover. A retry or an increased --health-start-period for awf-api-proxy would handle these transient cases.
2. Misleading surfaced error message
When awf-api-proxy fails to start, the detection model never runs, so no THREAT_DETECTION_RESULT is written to the log. The downstream parse step then surfaces:
❌ No THREAT_DETECTION_RESULT found in detection log. The detection model may have failed to follow the output format.
This is misleading — the model never ran at all. The root cause (container health failure) is buried in the raw logs. It would help to detect this case and surface a clearer error like: "AWF firewall failed to start (awf-api-proxy unhealthy) — the detection model was never invoked."
Suggested mitigations
- Retry
docker compose up once when awf-api-proxy fails its health check — most transient failures would recover on a second attempt
- Increase the health check start period for
awf-api-proxy if the container needs more time to initialize on slower runners
- Improve error messaging — detect the container startup failure and emit a clear top-level error before falling through to the parse step
- Expose
awf-api-proxy container logs in the failure output to aid diagnosis (currently the compose error has no stderr/stdout attached)
Environment
- AWF version:
v0.25.22
- Runner:
ubuntu-24.04 (Azure-hosted, ubuntu24/20260426.100)
awf-api-proxy image: ghcr.io/github/gh-aw-firewall/api-proxy:0.25.22
Summary
The
awf-api-proxycontainer intermittently fails its Docker health check during startup, causing the entire AWF-wrapped step to hard-fail. This has been observed recurring at a ~33% rate on Azure-hosted runners inmicrosoft/vscode-engineeringtoday (2 of 6errors-investigateworkflow runs failed the same way).Observed failure
In the
Execute GitHub Copilot CLIstep, theawf-api-proxycontainer starts successfully but never passes its health check within the timeout:The
awf-squidcontainer becomes healthy fine — the issue is isolated toawf-api-proxy.Example runs
detection, job ID73222273375)detection, job ID73228631399)Both hit the identical failure within the same ~1 hour window on 2026-04-27.
Two problems
1. No retry / tolerance on container health check failure
The
awfCLI invocation has no retry mechanism. A transient health check miss hard-fails the entire step with exit code 1, with no opportunity to recover. A retry or an increased--health-start-periodforawf-api-proxywould handle these transient cases.2. Misleading surfaced error message
When
awf-api-proxyfails to start, the detection model never runs, so noTHREAT_DETECTION_RESULTis written to the log. The downstream parse step then surfaces:This is misleading — the model never ran at all. The root cause (container health failure) is buried in the raw logs. It would help to detect this case and surface a clearer error like: "AWF firewall failed to start (awf-api-proxy unhealthy) — the detection model was never invoked."
Suggested mitigations
docker compose uponce whenawf-api-proxyfails its health check — most transient failures would recover on a second attemptawf-api-proxyif the container needs more time to initialize on slower runnersawf-api-proxycontainer logs in the failure output to aid diagnosis (currently the compose error has no stderr/stdout attached)Environment
v0.25.22ubuntu-24.04(Azure-hosted,ubuntu24/20260426.100)awf-api-proxyimage:ghcr.io/github/gh-aw-firewall/api-proxy:0.25.22