Skip to content

awf-api-proxy container fails health check intermittently, causing hard failures with misleading error #2246

@bryanchen-d

Description

@bryanchen-d

Summary

The awf-api-proxy container intermittently fails its Docker health check during startup, causing the entire AWF-wrapped step to hard-fail. This has been observed recurring at a ~33% rate on Azure-hosted runners in microsoft/vscode-engineering today (2 of 6 errors-investigate workflow runs failed the same way).

Observed failure

In the Execute GitHub Copilot CLI step, the awf-api-proxy container starts successfully but never passes its health check within the timeout:

Container awf-api-proxy  Starting
Container awf-api-proxy  Started
Container awf-squid      Waiting
Container awf-api-proxy  Waiting
Container awf-squid      Healthy        ✅
Container awf-api-proxy  Error          ❌
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never

The awf-squid container becomes healthy fine — the issue is isolated to awf-api-proxy.

Example runs

Both hit the identical failure within the same ~1 hour window on 2026-04-27.

Two problems

1. No retry / tolerance on container health check failure

The awf CLI invocation has no retry mechanism. A transient health check miss hard-fails the entire step with exit code 1, with no opportunity to recover. A retry or an increased --health-start-period for awf-api-proxy would handle these transient cases.

2. Misleading surfaced error message

When awf-api-proxy fails to start, the detection model never runs, so no THREAT_DETECTION_RESULT is written to the log. The downstream parse step then surfaces:

❌ No THREAT_DETECTION_RESULT found in detection log. The detection model may have failed to follow the output format.

This is misleading — the model never ran at all. The root cause (container health failure) is buried in the raw logs. It would help to detect this case and surface a clearer error like: "AWF firewall failed to start (awf-api-proxy unhealthy) — the detection model was never invoked."

Suggested mitigations

  1. Retry docker compose up once when awf-api-proxy fails its health check — most transient failures would recover on a second attempt
  2. Increase the health check start period for awf-api-proxy if the container needs more time to initialize on slower runners
  3. Improve error messaging — detect the container startup failure and emit a clear top-level error before falling through to the parse step
  4. Expose awf-api-proxy container logs in the failure output to aid diagnosis (currently the compose error has no stderr/stdout attached)

Environment

  • AWF version: v0.25.22
  • Runner: ubuntu-24.04 (Azure-hosted, ubuntu24/20260426.100)
  • awf-api-proxy image: ghcr.io/github/gh-aw-firewall/api-proxy:0.25.22

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions