Skip to content

[Bug] Team specialist worker 的 channels.matrix.groupAllowFrom 不含 leader / sibling / team-admin —— WorkerReconciler 覆盖 TeamReconciler 写的正确 config || [Bug] Team specialist worker's channels.matrix.groupAllowFrom does not contain leader / sibling / team-admin - WorkerReconciler overrides the correct config written by TeamReconciler #799

@hlgone

Description

@hlgone

现象

当 Team CR 通过 spec.workers[].name 引用一个已存在的标准 Worker CR 时,TeamReconciler 一开始确实计算出正确的 channels.matrix.groupAllowFrom = [leader, admin, sibling-workers] 并通过 DeployWorkerConfig 写到 MinIO(agents/<worker>/openclaw.json)。

但随后 WorkerReconciler 任何一次 reconcile(informer event / periodic resync / status patch trigger)都会重新生成 openclaw.json覆盖回 standalone 默认 [manager, admin] —— 因为 workerMemberContext 只读 hiclaw.io/team-leader annotation 来判断 team 关系,shared-external Worker CR 不带这个 annotation。

最终 MinIO 上 specialist 的 groupAllowFrom

[
  "@manager:<homeserver>",
  "@admin:<homeserver>"
]

(实际场景再叠加 accessibleTeams Human reconciler 通过 ChannelPolicy.GroupAllowExtra 注入的 user MXID,但仍然缺 leader 与 sibling)。

最小复现链路

# 1. apply 标准 Worker CR(不带 hiclaw.io/team-leader annotation)
cat <<YAML | hiclaw apply -f -
apiVersion: hiclaw.io/v1beta1
kind: Worker
metadata:
  name: worker-pop
spec:
  model: qwen3.5-plus
  runtime: copaw
YAML

# 2. 等 worker provisioned,dump MinIO config([manager, admin],符合预期)
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom

# 3. apply Team CR 引用上述 worker
cat <<YAML | hiclaw apply -f -
apiVersion: hiclaw.io/v1beta1
kind: Team
metadata:
  name: t1
spec:
  description: minimum repro
  leader:
    name: t1-lead
    model: qwen3.5-plus
    soul: ...
    agents: ...
  workers:
    - {name: worker-pop}
YAML

# 4. 等 Team active 后立即 dump(看 TeamReconciler 写的版本)
#    → 这一步偶尔能看到 [t1-lead, admin, ...](race-win),偶尔已被覆盖
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom

# 5. 触发任意一次 worker reconcile(restart 容器 / patch status / 等 periodic resync)
docker restart hiclaw-worker-worker-pop

# 6. 再 dump 一次(被 WorkerReconciler 写回 standalone)
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom
# → [
#     "@manager:<homeserver>",
#     "@admin:<homeserver>"
#   ]

影响

Leader 的 Matrix REST PUT /rooms/<team-room>/send/m.room.message/<txn> dispatch(按 AGENTS.md 教 Leader 的方式)落到 specialist 后:

  • m.text body 含 @worker-pop:<homeserver> mention
  • m.mentions.user_ids@worker-pop:<homeserver>
  • Matrix server 端投递正常

但 specialist worker 的 channel filter 第一关 sender allowlist check 发现 sender @t1-lead:<homeserver> 不在 groupAllowFrom,在 requireMention 检查之前就静默 drop。specialist 容器 log 完全没有 event 痕迹,没有 LLM inference,Team 协作链路断在 leader→specialist 这一跳。

整个 ADR 0010 风格的 Team Room (Leader-led dispatch) 因此对外部消费者不可用。

Code-level trace

internal/controller/worker_controller.go:264

TeamLeaderName: w.Annotations["hiclaw.io/team-leader"],

internal/controller/team_controller.go:711-727 + :870-892 teamWorkerSpecToWorkerSpec

// team-side projection adds leader + admin + peers to ChannelPolicy
policy = appendGroupAllowExtra(policy, t.Spec.Leader.Name)
// ...
for _, peer := range t.Spec.Workers {
    if peer.Name != w.Name {
        policy = appendGroupAllowExtra(policy, peer.Name)
    }
}

internal/agentconfig/generator.go:208-218

groupAllowFrom := []string{managerMatrixID, adminMatrixID}
if req.TeamLeaderName != "" {
    leaderMatrixID := fmt.Sprintf("@%s:%s", req.TeamLeaderName, domain)
    groupAllowFrom = []string{leaderMatrixID, adminMatrixID}
}

两个 reconciler 用不同信息源算出不同 WorkerConfigRequest,因为它们共享同一 MinIO 写入路径,最后一次写入决定 specialist 的实际 allowlist。

我在本地分支 feat/team-worker-groupallow-peers 加了一个 lock-in unit test (TestWorkerReconciler_ExternalTeamWorker_MissesTeamContext) 把这条 asymmetry 锁住,PASS 表示当前 bug 存在;maintainer 确认方向后我可以把测试翻成 fix 后的 parity 断言,一并 PR。

提议方向(doctrine 问题需要 maintainer 先对齐)

我倾向 WorkerReconciler 主动反查 Team membership(field indexer on Team.Spec.Workers[].name),union 所有引用 Team 的 leader/admin/peers 进 groupAllowFromhiclaw.io/team-leader annotation 退化为单 team 场景的可选 hint,不再是必要条件。

但这之前请 maintainer 先表态:

  1. Team CR 通过 spec.workers[].name 引用 existing Worker CR 是否是受支持路径?还是说 Team 的 worker 必须内联在 spec.workers[] 里(让 TeamReconciler 拥有完整 spec ownership)?
  2. 同一 Worker 被多个 Team 引用时("shared worker across teams" 用例,例如多业务网关给同一 user 复用 worker),allowlist 应该 union 所有 team 的 leader/peer,还是按 Team 隔离(即每个 team 应该有自己的 worker 副本)?
  3. WorkerReconciler 是否应该感知 Team membership?还是说应该让 TeamReconciler 在 reconcile Team 时给 Worker CR 打 annotation / owner reference(这样 WorkerReconciler 不变,但需要面对多 team annotation 怎么编码的问题)?
  4. owner reference 模型可行性:把 Worker CR 设为 Team CR 的 owner reference 的话,多 team 引用时怎么处理(K8s ownerReferences 允许多个 controller-flag false 的 owner,但 controller-flag true 只能一个)?

任一方向都能让 specialist 收到 Leader dispatch,但对 K8s ownership 模型和升级路径有不同影响,希望先听 maintainer 的偏好再写完整 PR。

环境


Phenomenon

When a Team CR references an existing standard Worker CR via spec.workers[].name, TeamReconciler does initially calculate the correct channels.matrix.groupAllowFrom = [leader, admin, sibling-workers] and writes it to MinIO (agents/<worker>/openclaw.json) via DeployWorkerConfig.

But then any reconciliation (informer event / periodic resync / status patch trigger) of WorkerReconciler will regenerate openclaw.json and overwrite back to the standalone default [manager, admin] - because workerMemberContext only reads the hiclaw.io/team-leader annotation to determine the team relationship, and the shared-external Worker CR does not carry this annotation.

Finally groupAllowFrom of specialist on MinIO:

[
  "@manager:<homeserver>",
  "@admin:<homeserver>"
]

(In the actual scenario, the user MXID injected by accessibleTeams Human reconciler through ChannelPolicy.GroupAllowExtra is superimposed, but the leader and sibling are still missing).

Minimum recurring link

# 1. apply standard Worker CR (without hiclaw.io/team-leader annotation)
cat <<YAML | hiclaw apply -f -
apiVersion: hiclaw.io/v1beta1
Kind: Worker
metadata:
  name: worker-pop
spec:
  model: qwen3.5-plus
  runtime: copaw
YAML

# 2. Wait for worker provisioned, dump MinIO config ([manager, admin], as expected)
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom

# 3. apply Team CR refers to the above worker
cat <<YAML | hiclaw apply -f -
apiVersion: hiclaw.io/v1beta1
Kind: Team
metadata:
  name: t1
spec:
  description: minimum repro
  leader:
    name: t1-lead
    model: qwen3.5-plus
    soul: ...
    agents: ...
  workers:
    - {name: worker-pop}
YAML

# 4. Dump immediately after Team active (see the version written by TeamReconciler)
# → This step can occasionally see [t1-lead, admin, ...] (race-win), and occasionally it has been overwritten.
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom

# 5. Trigger any worker reconcile (restart container/patch status/etc. periodic resync)
docker restart hiclaw-worker-worker-pop

# 6. Dump again (written back to standalone by WorkerReconciler)
docker exec hiclaw-controller mc cat hiclaw/hiclaw-storage/agents/worker-pop/openclaw.json | jq .channels.matrix.groupAllowFrom
# → [
# "@manager:<homeserver>",
# "@admin:<homeserver>"
# ]

Impact

The Leader's Matrix REST PUT /rooms/<team-room>/send/m.room.message/<txn> dispatch (according to the method taught by AGENTS.md to the Leader) falls after the specialist:

  • m.text body contains @worker-pop:<homeserver> mention
  • m.mentions.user_ids contains @worker-pop:<homeserver>
  • Matrix server delivery is normal

However, the first pass of the specialist worker's channel filter sender allowlist check found that sender @t1-lead:<homeserver> was not in groupAllowFrom, and it silently dropped before the requireMention check. The specialist container log has no event traces at all, no LLM inference, and the Team collaboration link is broken at the leader→specialist hop.

The entire ADR 0010 style Team Room (Leader-led dispatch) is therefore not available to external consumers.

Code-level trace

internal/controller/worker_controller.go:264:

TeamLeaderName: w.Annotations["hiclaw.io/team-leader"],

internal/controller/team_controller.go:711-727 + :870-892 teamWorkerSpecToWorkerSpec:

// team-side projection adds leader + admin + peers to ChannelPolicy
policy = appendGroupAllowExtra(policy, t.Spec.Leader.Name)
// ...
for _, peer := range t.Spec.Workers {
    if peer.Name != w.Name {
        policy = appendGroupAllowExtra(policy, peer.Name)
    }
}

internal/agentconfig/generator.go:208-218:

groupAllowFrom := []string{managerMatrixID, adminMatrixID}
if req.TeamLeaderName != "" {
    leaderMatrixID := fmt.Sprintf("@%s:%s", req.TeamLeaderName, domain)
    groupAllowFrom = []string{leaderMatrixID, adminMatrixID}
}

The two reconcilers calculate different WorkerConfigRequest with different sources of information, and since they share the same MinIO write path, the last write determines the specialist's actual allowlist.

I added a lock-in unit test (TestWorkerReconciler_ExternalTeamWorker_MissesTeamContext) to the local branch feat/team-worker-groupallow-peers to lock this asymmetry. PASS indicates that the current bug exists; after the maintainer confirms the direction, I can convert the test into a post-fix parity assertion and PR together.

Propose direction (doctrine issues require maintainer to align first)

I prefer WorkerReconciler Proactively check Team membership (field indexer on Team.Spec.Workers[].name), union all leader/admin/peers that reference Team into groupAllowFrom; hiclaw.io/team-leader annotation is reduced to an optional hint in a single team scenario, and is no longer a necessary condition.

But before that, please maintainer take a stand:

  1. Is Team CR referencing existing Worker CR via spec.workers[].name a supported path? Or do Team's workers have to be inlined in spec.workers[] (giving TeamReconciler full spec ownership)?
  2. When the same Worker is referenced by multiple Teams ("shared worker across teams" use case, such as multiple service gateways reusing workers for the same user), should the allowlist union the leaders/peers of all teams, or should it be isolated by Team (that is, each team should have its own copy of the worker)?
  3. Should WorkerReconciler be aware of Team membership? Or should TeamReconciler give annotation / owner reference to Worker CR when reconciling Team (in this way, WorkerReconciler remains unchanged, but we need to face the problem of how to code multiple team annotations)?
  4. owner reference model feasibility: If Worker CR is set as the owner reference of Team CR, how to deal with multiple team references (K8s ownerReferences allows multiple owners with controller-flag false, but only one controller-flag true)?

Either direction can allow the specialist to receive Leader dispatch, but it has different impacts on the K8s ownership model and upgrade path. We hope to listen to the maintainer's preferences before writing a complete PR.

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions