Problem
Currently flux_agent hardcodes selector parameters as maxFails: 1 and failTimeout: 600s (10 minutes) for all forwarded services. Every UpdateService command pushed from the control plane contains:
"selector": {
"strategy": "fifo",
"maxFails": 1,
"failTimeout": "600s"
}
Impact
With maxFails: 1, a single transient connection failure (brief backend timeout, DNS hiccup, network jitter) immediately marks a node as "failed" for 10 minutes. All traffic shifts to the fallback node. If the fallback also has any issue, the entire service is dead for 10 minutes — even though both backends are healthy 99% of the time.
Symptoms:
- Forwarded ports randomly stop working while
tcping shows the port as open
- Auto-recovers after ~10 minutes (failTimeout expiry)
- Re-deploying from panel temporarily "fixes" it (counters reset)
Real-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)
Over a 9-hour window, a single SSH relay port (real traffic, zero monitoring probes):
- 38 TCP errors + 167 UDP errors
- 619 total connections, 38 failures = 12% failure rate
- Each error potentially triggers a 10-minute fail period
Proposed Solution
Make maxFails and failTimeout user-configurable. MVP options (any one would help):
- Per-forward config: Add columns to
forward table, expose in panel UI; keep current defaults for backward compat
- Environment variable:
FLUX_SELECTOR_MAX_FAILS / FLUX_SELECTOR_FAIL_TIMEOUT on the flux_agent node
- Global panel setting: Override defaults for all forwards on a tunnel/node
Suggested safer defaults: maxFails: 5, failTimeout: 60s.
Environment
- flux_agent: v3.0.0-rc4 (debian/amd64)
- flux-panel-backend: v3.0.0-rc4 (Docker, PostgreSQL)
- OS: Debian 13 (trixie)
- Network: IPLC dedicated line
问题描述(中文)
目前 flux_agent 对所有转发服务硬编码了 maxFails: 1 和 failTimeout: 600s(10分钟),控制平面下发的 UpdateService 命令中固定携带这些值。
影响: maxFails: 1 意味着只要出现 1次 瞬态连接失败(后端超时、DNS 抖动、网络波动),该节点就被标记故障长达 10分钟,流量全部切到备用节点。备用节点若也有问题,整个端口就瘫痪 10 分钟。
表现: 转发端口间歇性不通(tcping 通但转发不走),约 10 分钟后自动恢复,重新下发立刻正常。
实际数据(IPLC 专线 VPS,9小时窗口,纯实际流量,无监控探活):
- 端口 30107:619 次连接,38 个 TCP 错误 + 167 个 UDP 错误,失败率 12%
建议: 将 maxFails 和 failTimeout 做成可配置项——面板UI配置、环境变量、或全局设置均可。建议安全默认值:maxFails: 5, failTimeout: 60s。
环境: flux_agent v3.0.0-rc4, Debian 13, IPLC 专线
Problem
Currently
flux_agenthardcodes selector parameters asmaxFails: 1andfailTimeout: 600s(10 minutes) for all forwarded services. EveryUpdateServicecommand pushed from the control plane contains:Impact
With
maxFails: 1, a single transient connection failure (brief backend timeout, DNS hiccup, network jitter) immediately marks a node as "failed" for 10 minutes. All traffic shifts to the fallback node. If the fallback also has any issue, the entire service is dead for 10 minutes — even though both backends are healthy 99% of the time.Symptoms:
tcpingshows the port as openReal-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)
Over a 9-hour window, a single SSH relay port (real traffic, zero monitoring probes):
Proposed Solution
Make
maxFailsandfailTimeoutuser-configurable. MVP options (any one would help):forwardtable, expose in panel UI; keep current defaults for backward compatFLUX_SELECTOR_MAX_FAILS/FLUX_SELECTOR_FAIL_TIMEOUTon theflux_agentnodeSuggested safer defaults:
maxFails: 5,failTimeout: 60s.Environment
问题描述(中文)
目前
flux_agent对所有转发服务硬编码了maxFails: 1和failTimeout: 600s(10分钟),控制平面下发的UpdateService命令中固定携带这些值。影响:
maxFails: 1意味着只要出现 1次 瞬态连接失败(后端超时、DNS 抖动、网络波动),该节点就被标记故障长达 10分钟,流量全部切到备用节点。备用节点若也有问题,整个端口就瘫痪 10 分钟。表现: 转发端口间歇性不通(tcping 通但转发不走),约 10 分钟后自动恢复,重新下发立刻正常。
实际数据(IPLC 专线 VPS,9小时窗口,纯实际流量,无监控探活):
建议: 将
maxFails和failTimeout做成可配置项——面板UI配置、环境变量、或全局设置均可。建议安全默认值:maxFails: 5,failTimeout: 60s。环境: flux_agent v3.0.0-rc4, Debian 13, IPLC 专线