Make selector maxFails / failTimeout configurable (per forward or via env var)

## Problem

Currently `flux_agent` hardcodes selector parameters as `maxFails: 1` and `failTimeout: 600s` (10 minutes) for all forwarded services. Every `UpdateService` command pushed from the control plane contains:

```json
"selector": {
  "strategy": "fifo",
  "maxFails": 1,
  "failTimeout": "600s"
}
```

## Impact

With `maxFails: 1`, a **single transient connection failure** (brief backend timeout, DNS hiccup, network jitter) immediately marks a node as "failed" for **10 minutes**. All traffic shifts to the fallback node. If the fallback also has any issue, the entire service is dead for 10 minutes — even though both backends are healthy 99% of the time.

Symptoms:
- Forwarded ports randomly stop working while `tcping` shows the port as open
- Auto-recovers after ~10 minutes (failTimeout expiry)
- Re-deploying from panel temporarily "fixes" it (counters reset)

## Real-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)

Over a 9-hour window, a single SSH relay port (real traffic, zero monitoring probes):
- **38 TCP errors + 167 UDP errors**
- 619 total connections, 38 failures = **12% failure rate**
- Each error potentially triggers a 10-minute fail period

## Proposed Solution

Make `maxFails` and `failTimeout` user-configurable. MVP options (any one would help):

1. **Per-forward config**: Add columns to `forward` table, expose in panel UI; keep current defaults for backward compat
2. **Environment variable**: `FLUX_SELECTOR_MAX_FAILS` / `FLUX_SELECTOR_FAIL_TIMEOUT` on the `flux_agent` node
3. **Global panel setting**: Override defaults for all forwards on a tunnel/node

Suggested safer defaults: `maxFails: 5`, `failTimeout: 60s`.

## Environment
- flux_agent: v3.0.0-rc4 (debian/amd64)
- flux-panel-backend: v3.0.0-rc4 (Docker, PostgreSQL)
- OS: Debian 13 (trixie)
- Network: IPLC dedicated line

---

## 问题描述（中文）

目前 `flux_agent` 对所有转发服务硬编码了 `maxFails: 1` 和 `failTimeout: 600s`（10分钟），控制平面下发的 `UpdateService` 命令中固定携带这些值。

**影响：** `maxFails: 1` 意味着只要出现 **1次** 瞬态连接失败（后端超时、DNS 抖动、网络波动），该节点就被标记故障长达 **10分钟**，流量全部切到备用节点。备用节点若也有问题，整个端口就瘫痪 10 分钟。

**表现：** 转发端口间歇性不通（tcping 通但转发不走），约 10 分钟后自动恢复，重新下发立刻正常。

**实际数据（IPLC 专线 VPS，9小时窗口，纯实际流量，无监控探活）：**
- 端口 30107：619 次连接，38 个 TCP 错误 + 167 个 UDP 错误，失败率 12%

**建议：** 将 `maxFails` 和 `failTimeout` 做成可配置项——面板UI配置、环境变量、或全局设置均可。建议安全默认值：`maxFails: 5`, `failTimeout: 60s`。

**环境：** flux_agent v3.0.0-rc4, Debian 13, IPLC 专线

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make selector maxFails / failTimeout configurable (per forward or via env var) #509

Problem

Impact

Real-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)

Proposed Solution

Environment

问题描述（中文）

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make selector maxFails / failTimeout configurable (per forward or via env var) #509

Description

Problem

Impact

Real-world data (IPLC VPS, Debian 13, flux_agent v3.0.0-rc4)

Proposed Solution

Environment

问题描述（中文）

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions