Skip to content

[awf] agent+api-proxy: P0/P1 failure cluster — api-proxy health check + node not found in copilot container #2294

@lpcox

Description

@lpcox

Problem

This is a combined P0/P1 failure report covering 28 failed/cancelled runs out of 40 total over the 07:00–13:00 UTC window on 2026-04-28. Two distinct failure clusters were identified:

P0 — awf-api-proxy health check timeout (at least 3 confirmed runs): Affects Sub-Issue Closer (copilot), Daily Team Evolution Insights (claude), Smoke CI (copilot). Container starts but health check times out, blocking docker compose up entirely.

P1 — node: command not found in copilot agent container: At least 2 scheduled copilot-engine workflows fail inside the container because node is not on PATH. Affects: Daily Issues Report Generator, Daily News.

Context

Parent investigation: github/gh-aw#28947. The api-proxy health check failure is also separately tracked.

Root Cause

P0 (api-proxy health check): See separate tracking issue for #28898/#28949 — tight healthcheck timing under runner load.

P1 (node not found): The containers/agent/entrypoint.sh chroots to /host and drops capabilities before running the user command. The PATH inside the chroot may not include the directory where the host's node binary lives (e.g., /usr/local/bin/node, /opt/hostedtoolcache/node/.../bin). The bind mounts include /usr, /bin, /sbin read-only, but /usr/local or nvm/volta paths under /opt may not be fully covered.

Proposed Solution

P1 fix:

  1. In containers/agent/entrypoint.sh: After chroot, construct PATH by including common Node.js installation locations: /usr/local/bin, /opt/hostedtoolcache/node/*/bin (glob expanded), /home/runner/.nvm/versions/node/*/bin.
  2. Alternatively: Add a --extra-path <dir> CLI flag in src/cli.ts that appends additional directories to the agent container PATH.
  3. In src/docker-manager.ts: When generating the Docker Compose env section, include the host's PATH value or a curated superset to ensure tool availability inside the chroot.

P0 fix: Tracked in the api-proxy health check issues.

Generated by Firewall Issue Dispatcher · ● 436.3K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions