Skip to content

Make selector maxFails / failTimeout configurable (per forward or via env var) #509

Description

@cold-sword

Problem

Currently flux_agent hardcodes selector parameters as maxFails: 1 and failTimeout: 600s (10 minutes) for all forwarded services. Every UpdateService command pushed from the control plane contains:

"selector": {
  "strategy": "fifo",
  "maxFails": 1,
  "failTimeout": "600s"
}

Impact

With maxFails: 1, a single transient connection failure (brief backend timeout, DNS hiccup, network jitter) immediately marks a node as "failed" for 10 minutes. All traffic shifts to the fallback node. If the fallback also has any issue, the entire service is dead for 10 minutes — even though both backends are healthy 99% of the time.

Symptoms:

  • Forwarded ports randomly stop working while tcping shows the port as open
  • Auto-recovers after ~10 minutes (failTimeout expiry)
  • Re-deploying from panel temporarily "fixes" it (counters reset)

Real-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)

Over a 9-hour window, a single SSH relay port (real traffic, zero monitoring probes):

  • 38 TCP errors + 167 UDP errors
  • 619 total connections, 38 failures = 12% failure rate
  • Each error potentially triggers a 10-minute fail period

Proposed Solution

Make maxFails and failTimeout user-configurable. MVP options (any one would help):

  1. Per-forward config: Add columns to forward table, expose in panel UI; keep current defaults for backward compat
  2. Environment variable: FLUX_SELECTOR_MAX_FAILS / FLUX_SELECTOR_FAIL_TIMEOUT on the flux_agent node
  3. Global panel setting: Override defaults for all forwards on a tunnel/node

Suggested safer defaults: maxFails: 5, failTimeout: 60s.

Environment

  • flux_agent: v3.0.0-rc4 (debian/amd64)
  • flux-panel-backend: v3.0.0-rc4 (Docker, PostgreSQL)
  • OS: Debian 13 (trixie)
  • Network: IPLC dedicated line

问题描述(中文)

目前 flux_agent 对所有转发服务硬编码了 maxFails: 1failTimeout: 600s(10分钟),控制平面下发的 UpdateService 命令中固定携带这些值。

影响: maxFails: 1 意味着只要出现 1次 瞬态连接失败(后端超时、DNS 抖动、网络波动),该节点就被标记故障长达 10分钟,流量全部切到备用节点。备用节点若也有问题,整个端口就瘫痪 10 分钟。

表现: 转发端口间歇性不通(tcping 通但转发不走),约 10 分钟后自动恢复,重新下发立刻正常。

实际数据(IPLC 专线 VPS,9小时窗口,纯实际流量,无监控探活):

  • 端口 30107:619 次连接,38 个 TCP 错误 + 167 个 UDP 错误,失败率 12%

建议:maxFailsfailTimeout 做成可配置项——面板UI配置、环境变量、或全局设置均可。建议安全默认值:maxFails: 5, failTimeout: 60s

环境: flux_agent v3.0.0-rc4, Debian 13, IPLC 专线

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions