Skip to content

[awf] cli-proxy: Harden DIFC liveness probe retry backoff #4725

@lpcox

Description

@lpcox

Problem

awf-cli-proxy fails the entire agent run when the external DIFC proxy is slow to accept connections during startup. The current liveness probe uses only 2 attempts with 1s between them — too tight for transient host-level DIFC proxy startup contention. On 2026-06-10 this knocked out two scheduled workflows (Auto-Triage #27261698585, Sub-Issue Closer #27261373050) with 0 agent turns executed before the firewall aborted.

Context

Source issue: github/gh-aw#38309

The probe resolves localhost to IPv6 [::1] but the tunnel listener may bind IPv4-only at 127.0.0.1, adding a second failure mode.

Root Cause

awf-cli-proxy entrypoint fail-fast logic: 2 liveness probe attempts at 1s intervals → immediate fatal abort on connection refused. No exponential backoff. Two concurrent scheduled jobs hit the same DIFC proxy at 07:46 UTC, exhausted the probe window before the proxy finished binding.

Proposed Solution

  1. Replace 2-attempt/1s fail-fast with exponential backoff (~5 attempts, ~15–30s total) in containers/api-proxy/ cli-proxy startup logic.
  2. Pin the localhost:18443 tunnel listener and probe to the same address family (prefer 127.0.0.1 explicitly) to eliminate IPv4/IPv6 mismatch.

Success criteria: scheduled runs survive transient DIFC proxy startup slowness; no awf-cli-proxy could not connect fatal in the next 7-day window.

Generated by Firewall Issue Dispatcher · 157.4 AIC · ⊞ 27.8K ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions