Skip to content

feat(observability): health IM boundary + agent-turn latency/error metrics (OCT-17)#407

Open
lml2468 wants to merge 39 commits into
mainfrom
feat/oct-17-observability
Open

feat(observability): health IM boundary + agent-turn latency/error metrics (OCT-17)#407
lml2468 wants to merge 39 commits into
mainfrom
feat/oct-17-observability

Conversation

@lml2468

@lml2468 lml2468 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Observability for the end-to-end agent turn on the server↔channel (WuKongIM) boundary (OCT-17, sub-task of OCT-10).

  • Health (/v1/health): already probed db + redis; now also probes WuKongIM reachability. Any HTTP response (even non-2xx) counts as reachable — the probe asserts the IM API answers on the network, not that a specific health route exists, so readiness isn't coupled to a WuKongIM version/path. Only a transport failure marks im=down; unconfigured APIURL reports im=unknown without failing overall readiness. 2s timeout bounds a hung IM control plane. HTTP stays 200 (aggregate status in body) — additive, non-breaking for existing consumers.
  • Agent-turn metrics (modules/bot_api/metrics.go): instrument the bot reply delivery leg (sendMessagedispatchMsgSendReq):
    • dmwork_agent_turn_delivery_duration_seconds (HistogramVec) — latency of bot reply delivery to the channel.
    • dmwork_agent_turn_delivery_total{result} (CounterVec) — result=error is the failed agent-turn deliveries counter.
    • Registered on the default registry via promauto (consistent with modules/oidc/metrics.go); exposed by the existing /metrics scrape. Labels kept low-cardinality: channel_type{person|group|topic|other} × result{ok|error}, pre-warmed to zero series.

Acceptance (from OCT-10)

  • Health check present (extended to the channel/IM boundary).
  • At least one latency + one error metric for the agent turn.

Test plan

  • go build ./...
  • go vet ./modules/bot_api/ ./modules/common/ (only linter gated in CI)
  • go test ./modules/bot_api/ -run 'TestAgentTurn|TestObserveAgentTurn|TestBotAPINoLegacyResponseError'
  • go test ./modules/common/ -run 'TestIMReachable'
  • CI green (full suite needs MySQL/Redis/WuKongIM)

No REST/WS contract break (health gains an additive im field; metrics are new). No auth/secrets/permissions changes.

caster-Q and others added 30 commits May 18, 2026 19:30
…time

迁移 dmworkim 老仓 feat/agent-runtime 的 runtime 模块到 octo-server,
按简化版策略只做编译必需的最小适配:

- modules/runtime/ 12 文件(api/db/model/enrich/upgrade/version_sync/1module + 4 SQL)
- import 路径:dmwork-org/{dmworkim,dmwork-lib} → Mininglamp-OSS/{octo-server,octo-lib}
- /daemon 命令安装 URL → Mininglamp-OSS/octo-daemon-cli
- internal/modules.go 注册 runtime 模块
- modules/botfather/{const,command,api}.go surgical apply CmdDaemon + handleDaemon

业务字面(channel id "dmwork"、route prefix "dmwork/<uid>"、log/注释里的 dmwork)
有意保留,等本地端到端测试阶段统一修(与 openclaw-channel-octo channel id 改名对齐)。

Build: go build ./... 通过
Tests: enrich_test 等含 dmwork 业务字面的测试会挂,留给测试阶段
Match the daemon-cli and openclaw-channel-octo changes:

- enrich.go: channel id 'dmwork' → 'octo' (route prefix matching for
  bot info JOIN in user × robot × space_member); the rename function
  collectDmworkUIDs internal name kept as-is for now
- enrich_test.go: test data updated
- api.go: plugin_has_update match on 'octo' (was hardcoded
  'openclaw-channel-dmwork')
- upgrade.go: const componentPlugin = 'octo'
- model.go + sql migration comment: update doc strings

Verified end-to-end locally:
- go test ./modules/runtime/... pass
- Runtimes UI shows Octo Plugin 1.0.7 → 1.0.8 upgrade hint
- Bot route enrich: avatar/name/online dot rendered correctly
- Plugin upgrade flow (npx -y openclaw-channel-octo@dev install) end-to-end
  completed in 86 s with task=completed, daemon-reported version 1.0.8
- adapters/claude-code-dmwork/SKILL.md: install command uses ClawHub plugin id `octo` (was `openclaw-channel-dmwork`)
- docs/2026-05-mention-all-chokepoint-audit.md: source-tree path references updated from `openclaw-channel-dmwork/src/inbound.ts` to `octo/src/inbound.ts`
- docs/changelog.md: package-name reference in V2 conflict-fix note updated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two heartbeat-pull dispatch pipelines under one schema:

managed_runtime_agent — Web "Create agent + Add bot" UX (PoC1).
Creates an openclaw agent workspace on the daemon and mints a bot
in the same flow; status: provisioning → bot_minted → dispatched →
active. Routes:
  POST /v1/runtimes/managed-agents
  POST /v1/runtimes/:runtime_id/agents/:agent_id/bots
  POST /v1/daemon/managed-agents/:id/ack
botfather.MintBotOBO extracted from /newbot handler so non-IM
callers can mint bots on behalf of a user.

bot_task — matter @mention/assignee → agent dispatch (PoC2+3).
Resolves bot_uid → openclaw agent_id via managed_runtime_agent,
queues a row, daemon claims via heartbeat. Routes:
  POST /v1/internal/bot-tasks      (X-Internal-Token, called by matter)
  POST /v1/daemon/bot-tasks/:id/ack
ack handler writes back to matter:
  - timeline (bot reply content)
  - activity (agent_task_completed/failed with elapsed_ms + bytes)
createBotTaskReq.prompt is an optional override so mention dispatch
can supply a full continuation prompt with conversation history.

heartbeat handler claims both pending_command (managed_agent) and
pending_task (bot_task) per cycle.

tsdd.yaml: enable MinIO for local dev (admin/12345678 against
testenv-minio-1) so avatar/file uploads work — production should
set proper IAM creds via env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for cross-service trust chain laid out in spec
docs/superpowers/specs/2026-06-01-octo-fleet-extraction-design.md.

octo-server is now the canonical JWT issuer; fleet/matter will fetch
the public key from /.well-known/jwks.json and verify tokens locally
without service-to-service calls.

Implementation:
- modules/auth_jwt/jwt.go: RSA-2048 keypair loaded from or generated
  into ~/.octo-server/jwt-priv.pem (override via JWT_PRIVATE_KEY_PATH).
  kid derived from last-8-bytes of modulus so it stays stable across
  restarts — JWKS consumers can cache long-term.
- modules/auth_jwt/resolve.go: token exchange has two paths — web
  session via existing cache lookup, daemon api_key via user_api_key
  table + space_member membership re-check.
- POST /v1/auth/token returns {token, expires_in, scope}; web=30min,
  daemon=30days.
- GET /.well-known/jwks.json (Cache-Control: public, max-age=300).

Verified locally: api-key exchange produces a valid RS256 token with
the expected claims (sub=uid, scope=daemon, space_id, daemon_id, exp).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-A.2 cross-service contract:

  POST /v1/bot/mint        — web session auth; wraps botfather.MintBotOBO
                             and returns ONLY {bot_uid}. bot_token stays
                             in server's robot.bot_token column.
  GET  /v1/bot/:uid/token  — daemon JWT auth (scope=daemon); returns
                             {bot_token, bot_uid} after asserting
                             robot.creator_uid == JWT.sub.

Together these let octo-fleet orchestrate bot lifecycle without ever
seeing bot_token: web mints (browser knows uid, not token), web tells
fleet about bot_uid, daemon later fetches bot_token directly from
server using its daemon-scope JWT. No fleet↔server HTTP.

E2E verified: valid daemon JWT returns the right bot_token; missing/
non-owned bot returns 404 (no info leak).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…R-A.2)

Runtime/bot endpoints have moved to octo-fleet :8092. The module
still loads and runs schema migrations (no data loss), but Route()
returns early so nginx-style 404s come back from server.

Set LEGACY_RUNTIME_ROUTES=true to re-enable as a rollback path if
fleet has problems during cutover. Code path is unchanged — just
gated.

Verified:
- POST /v1/daemon/register → 404 (no route)
- GET  /v1/runtimes        → 404 (no route)
- POST /v1/auth/token      → still works (auth_jwt module)
- POST /v1/bot/mint        → still works (auth_jwt module)
- GET  /v1/bot/:uid/token  → still works (auth_jwt module)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PR-C)

Spec PR-C: server's runtime/bot orchestration code is dead weight now
that octo-fleet has been stable through PR-A and PR-B. Server keeps
only the cross-service contract surface (modules/auth_jwt):
  - /.well-known/jwks.json   (JWKS for fleet/matter to verify tokens)
  - POST /v1/auth/token      (session/api_key → JWT exchange)
  - POST /v1/bot/mint        (web-callable bot OBO)
  - GET  /v1/bot/:uid/token  (daemon-callable bot_token lookup)

Cleanup steps already taken locally (one-time, not in commit):
  DELETE FROM gorp_migrations WHERE id LIKE 'runtime-%';
  -- the runtime_* tables themselves are harmless orphans and can be
  -- dropped manually whenever convenient; they don't affect anything.

Boot smoke:
- AuthJWT loaded key from disk (existing /Users/caster/.octo-server/jwt-priv.pem)
- /v1/runtimes → 404 (gone)
- /v1/bot/mint → 200 (new bot minted: 27w5OWCvvD2f63623a6_bot)
- /v1/auth/token (api_key) → 200 JWT

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
100 commits 主要内容:
- pkg/i18n 大规模迁移(modules/{user,message,group,workplace,category} → ResponseErrorL)
- oidc RP-Initiated Logout (#217)
- voice → octo-speech adapter 替换 (#113)
- modules/bot_api / modules/user 重构
- 各种 CI workflow

冲突点预期:internal/modules.go (我们删了 runtime import,origin 可能新增 module)

# Conflicts:
#	internal/modules.go
requireDaemonJWT only did signature verification via parsed.Claims;
expired tokens still mint bot_tokens via GET /v1/bot/:uid/token. Plan
§2 AU1 explicitly requires "JWT 过期 / kid 未知 / 签名失败 → 401" but
this code path enforced only the kid + signature halves.

Risk: a leaked daemon JWT past its 30-day TTL keeps working for any
bot the original owner had access to, since exp was never checked.
30-day window limits blast radius but is still a contract drift on
the most security-sensitive endpoint we expose.

Fix: chain cl.Validate(jwt.Expected{Issuer: jwtIssuer, Time: now})
after the signature check. Issuer pinning also blocks tokens minted
by a future second issuer (e.g. test fixtures, staging keys).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
917ff7c accidentally git-add'd a deletion of the /daemon BotFather
command (45 lines across botfather/{api,command,const}.go). The
deletion came from a stale working-tree state during merge resolution,
not from any deliberate change — and the command is the ONLY user-
facing onboarding path for the daemon:

  user sends /daemon in OCTO IM →
  BotFather generates (or reuses) user_api_key row in test DB →
  BotFather replies with a copy-paste install snippet →
  user pastes into their terminal → daemon starts + registers.

Without this, new users have no way to get an api_key and no way to
know the install URL pattern.

Restored verbatim from origin/feat/agent-runtime, plus a follow-up
update to the reply payload reflecting the PR-A/B 3-URL split:

  export OCTO_SERVER_URL=...
  export OCTO_FLEET_URL=...
  export OCTO_MATTER_URL=...
  export NOTIFY_INTERNAL_TOKEN=<ops-shared-secret>
  octo-daemon start --api-key uk_xxx --api-url <server>

New helper deriveServiceURL() best-effort swaps the port suffix on
server URL to compute fleet/matter URLs for the same host. Operators
running fleet/matter on different hosts must override via env after
copy-paste (call this out in the reply text).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the i18n merge, user/login writes sessions to Redis as
"v2:{"uid":"...","name":"..."}" but auth_jwt.resolveSession only
parsed the legacy "uid@name@..." triple-at format. Result: every
POST /v1/auth/token from the web returned 401 "malformed session
value", which the frontend treats as session-expired and bounces
the user back to the login page — so as soon as web tried to call
any fleet/matter endpoint requiring a JWT, the user got kicked.

resolveSession now detects the "v2:" prefix and JSON-unmarshals
the payload to extract uid; legacy format remains a working
fallback so nothing else breaks.

Independent of the DualAuth work — this bug was latent since the
i18n merge and surfaced when we ran the web through the new
Runtimes/智能体 tab which exercises the session→JWT exchange.
/daemon dispatch case + handler were registered correctly and the
command works when typed, but handleHelp's hardcoded command list
omitted it, leading users to think it had been removed. Adds the
line right after /quickstart since both are runtime-setup commands.

No behavioral change beyond what /help renders.
Two reviewer follow-ups bundled (both single-spot, low-risk):

resolve.go: replace the hand-rolled "v2:" + struct-unmarshal mini
decoder with pkg/auth.Decode. The pkg helper has been the canonical
v2 envelope reader since the i18n rewrite (commit 76bc59a); having a
parallel mini-decoder here meant any new v2 field would have to be
mirrored in two places. resolveSession now just calls Decode and
returns info.UID. Behavior unchanged — Decode already covers v2 JSON,
legacy "uid@name[@ROLE]" fallback, and empty-payload error mapping.

bot_api.go: cl.Validate(jwt.Expected{Issuer, Time: time.Now()}) was
silently relying on go-jose's DefaultLeeway (1 min) for daemon ↔ server
clock skew tolerance on exp/iat/nbf. Added a comment so the next
person who reads this knows not to wrap a tighter leeway around it
"for security" — that would cause spurious failures at mint time
when daemon clock drifts a few seconds ahead.
Closes 2 cross-space privilege escalations found while documenting the
auth flow:

#1 (POST /v1/auth/token): resolveSession used to trust caller-supplied
   X-Space-Id with a comment saying "downstream re-checks". In practice
   plan AU3 has downstream (fleet/matter) trust JWT.space_id directly.
   Neither side actually checked → any logged-in user could mint a JWT
   for an arbitrary space and gain cross-space read/write on fleet/
   matter endpoints. Now resolveSession calls assertSpaceMember on the
   spaceHint before returning.

#2 (POST /v1/bot/mint): mintBot took space_id from body, never
   validated caller is in that space. MintBotOBO would then drop a bot
   into space_member of an arbitrary space. Attack: pull the new bot
   into a group → observe messages of a space you don't belong to.
   Now mintBot calls assertSpaceMember(uid, req.SpaceID) → 403 if not
   a member.

Refactor: extracted the existing space_member SQL from resolveAPIKey
into a shared assertSpaceMember helper so the daemon path (which always
had this check) and the two new check sites share one implementation.

No DB migration. No API contract change. X-Internal-Token and daemon
api_key paths unaffected — they already validated this. Verified via
curl: legit space → 200/200; fake space → 401/403.
## Summary
- \`daemonTokenTTL\` 30 * 24h → 24h
- 同步 bot_api.go 里过时的 "30-day TTL" 注释 (codex review MINOR)

## Why
通信协议决策 P1-5. daemon 端 EnsureJWT 早就在过期前自动 refresh, 缩 TTL 没有运行时影响, 但把
api_key 撤销窗口从 30 天降到 1 天 (生产合规).

## Note
- 如果 auth session 决定走 "去 daemon JWT 改 api_key 直连" 路线 (review 结论 4),
这个改动就作废. 但当前 JWT 路线下这是安全的紧化, 两种方向都不冲突.

## Test plan
- [x] \`go build ./...\` 通过

## Reviewers
@lvsijia @lijianhui

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## ⚠️ Reviewer 请先读架构 doc, 再 review 代码

避免直接 nitpick 实现细节而错过架构层决策的 context (e.g. 为啥 AU5 删了但 issue #75 没修).

## 1. 背景 + Phase 拆解

合并 plan: **决策一** (web 验证改造 — 浏览器改回带 session token, 不再换 JWT) + **决策二**
(daemon api_key 直连 — daemon 不再换 JWT, 直接持 api_key 调 fleet/matter, 由它们调
server verify-api-key 验证).

5 个 Phase:

| Phase | 内容 | 涉及 repo |
|---|---|---|
| 1 | server `POST /v1/auth/verify-api-key` endpoint | server |
| 2 | matter middleware 扩三协议 (session+bot_token+api_key) + fleet 抄 |
matter + fleet |
| 3A | web 客户端切回 session token | web |
| 3B | daemon 改 api_key 直连 + 业务 SQL 加 owner_uid 过滤 | daemon-cli + fleet
+ matter |
| 4 | 清理 (JWT/AU5/DualAuth/JWKS) + 配置改名 + 单测补 | 全 5 repo |

会议依据: 鉴权架构讨论会议结论 2026-06-05 决策一+决策二.

## 2. 本 repo 改动 (3 commit, 按时间顺序)

```
6b0dac8 refactor(auth): rename auth_jwt → bot_provision + add status edge case
36fbc3e refactor(auth): tear down JWT module, botToken now uses api_key
c2cf1d0 feat(auth): add POST /v1/auth/verify-api-key endpoint
```

## 3. 配套 PR (cross-link)

- octo-server (本 PR)
- octo-matter: Mininglamp-OSS/octo-matter#78
- octo-fleet: Mininglamp-OSS/octo-fleet#24
- octo-daemon-cli: Mininglamp-OSS/octo-daemon-cli#26
- octo-web: Mininglamp-OSS/octo-web#294

合并顺序建议: **本 PR (server) 先 merge** — 它是 verify-api-key endpoint 的 base,
matter/fleet middleware 依赖它.

## 4. 本 repo review 重点

### Phase 1: `POST /v1/auth/verify-api-key` (modules/user/api.go)

- 跟现有 `/v1/auth/verify` + `/v1/auth/verify-bot` 同 group, 复用 verifyLimit
- SQL 两步走: `user_api_key WHERE api_key=? AND space_id!=''` (显式拒 legacy)
+ `space_member WHERE status=1` 校验 owner 还在 space
- 失败统一返 401 `{"msg":"invalid api_key"}` (不细分 revoked/not-found)
- 7 单测覆盖: valid / unknown / owner-left-space / nonactive-status /
legacy-empty-space / missing-field / multi-space

### Phase 4: `modules/auth_jwt` 模块清理 → 改名 `bot_provision`

- 删 JWT 签发 + JWKS endpoint + token exchange + Claims 全套
- `bot_api.go` 的 `botToken` handler 改成验 api_key (复用 `resolveAPIKey`),
不再验 daemon JWT
- 模块改名 `auth_jwt` → `bot_provision` (内容只剩 mintBot + botToken, 名字误导)

## 5. 重要决策 — issue #75 推迟到决策四

(本 PR 不涉及, 详情见 matter PR)

## 6. e2e 验证

- 本地 5 服务跑通: daemon 心跳 + web runtime 列表 + create bot + bot 私聊响应
- 单测 (核弹 test DB 后): `go test -run TestAuthVerifyAPIKey ./modules/user/`
→ 7/7 PASS

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

- 4 件 v3.3.6 §7 follow-up nit (empty Bearer guard / dead code 删 / 测试
dedup / 注释精确化)
- 顺手删 JWT-era 410 Gone stub (`/.well-known/jwks.json` +
`/v1/auth/token`) + helper + 2 个 test — Phase 4 删 JWT 后 PoC 阶段无真 caller,
compat for nonexistent client 是 dead-code-by-construction
- **0 SQL 改动** — 不动 ban 语义, 保持 runtime 分支跟 main 一致 ("ban user ≠ ban
user's bots")
- **不影响 main** — 5 处全在 runtime-only 文件 (bot_provision 模块 /
verify-api-key test / zz file 都是 runtime 分支引入的)

净 diff: 5 files, +18 / -117 (净 **-99 行**)

## Scope explicitly excluded (留独立 PR / 决策)

- **nit-8 account-ban gate**: R0-R3 review 期间扩到 6 处 SQL D11 闭环, 但跟 main
"ban 不牵连 bot" 语义冲突, caster 拍 revert
- **nit-4 JWT-shaped Bearer hint**: 同 410 stub 同形, 假设不存在 client, revert
- nit-5 / nit-7 / nit-9 / verifyCache sha256: 独立 PR

## Review trail

6 轮 codex + cc 并行 review (R1-R6), 最后双 pass 0 finding:
- R1-R3: D11 SQL 扩 1→6 处 (双确认)
- caster review: 拍板回 main 语义, revert SQL
- R4: revert 后双 pass
- caster 进一步原则: no compat for nonexistent JWT clients
- R5-R6: 删 nit-4 + 410 stub, 双 pass

## Test plan

- [x] `go build ./...` PASS
- [x] `go vet ./modules/user/... ./modules/bot_provision/...` PASS
- [x] `go test -count=1 -p 1 -run 'TestAuthVerify|TestBotToken'
./modules/user/... ./modules/bot_provision/...` PASS
- [x] `go test -race -count=1 -p 1` (same scope) PASS
- [ ] reviewer 跑 `go build ./... && go vet ./...` 确认绿
- [ ] reviewer grep verify: `gone410Handler` / `TestGone410` /
`"/.well-known/jwks.json"` / `"/v1/auth/token"` 在 bot_provision/ 应 0 hit

## Notes

- 4 个 `TestDestroyApply_*` 失败 (full-package run) 是 **pre-existing
flaky** (clean origin/feat/agent-runtime 也挂; redis login-locked state 跨
test 残留), 跟本 PR 无关
- v3.3.6 §P1 (commit 07ca7db, 已 merged 的 daemon api_key user.status=1
gate) 是 runtime 分支跟 main 在 ban 语义上唯一遗留分叉, 若想 100% 跟 main 对齐需要单独 revert
PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mand

## Summary

Migrate runtime daemon onboarding from the BotFather IM `/daemon` command
to a web HTTP endpoint. The `/runtimes` page's CreateRuntimeModal will
call this endpoint to render the install + start commands directly in
the browser, rather than asking users to switch to IM and message
BotFather.

## Changes

### Add: GET /v1/runtime-onboarding (session auth)
- New file `modules/botfather/api_runtime_onboarding.go` (~150 lines)
- Route registered in `Route()` under a session-auth group
- Returns JSON containing:
  - `api_key` (lazy-create on first call, mirrors prior /daemon
    behavior — same `(uid, space_id)` row, same uk_* prefix)
  - `space_id`, `server_url`, `fleet_url`, `matter_url`
  - `commands.install` / `commands.start` — pre-rendered command
    strings for direct display
  - `env_vars` map — for users who want to set OCTO_*_URL in their
    shell rc instead of inlining per-invocation
- `space_id` resolution: X-Space-Id header > ?space_id= query > 400.
  No implicit "first space" fallback — would let caller end up with
  api_key bound to a space they didn't intend.
- `getOrCreateUserAPIKey` helper extracted from the now-deleted
  handleDaemon. INSERT race tolerant (unique constraint on
  (uid, space_id) — loser re-reads).
- `deriveOnboardingURLs` helper extracted from same.

### Remove: BotFather /daemon command
Following the same principle as PR #334 (no compat for nonexistent
clients post-Phase 4), removing rather than deprecating:
- `modules/botfather/const.go:46` — `CmdDaemon` constant
- `modules/botfather/api.go:308` — command list entry
- `modules/botfather/command.go:174-175` — `case CmdDaemon` dispatch
- `modules/botfather/command.go:489` — help text line
- `modules/botfather/command.go:1190-1258` — `handleDaemon` function
  (70 lines, behavior reproduced in the web handler)

`deriveServiceURL` (still in command.go) is preserved — both the new
web handler's `deriveOnboardingURLs` and any future caller share it.

## Verification
- `go build ./...` PASS
- `go vet ./modules/botfather/...` PASS
- Pre-existing test flake unrelated to this PR: `botfather` package
  test setup panics with `unknown migration: 20201222000001_report_
  legacy01.sql` from a polluted local test DB. Verified on clean
  `origin/feat/agent-runtime` — same panic, not introduced here.
  Fixable via `bootstrap.sh --fresh` (drops + re-creates test DB).

## Out of scope (separate PR)

octo-web `/runtimes` page restructure:
- Sidebar `Runtimes` → `运行时`
- Top-level Runtime/Bot tabs removed; tree view: device → runtime → bot
- Top-right `+` popover: 创建 Runtime / 创建 Bot
- New `CreateRuntimeModal.tsx` calls this endpoint

Frontend PR follows once this server PR is merged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc round 1 verdict=fail 7 finding 2 P0 / codex round 1 verdict=fail 3 finding
2 P0. 双方共识 P0 + 单方 codex P0 全修.

- C1 (双方 P0): 缺 SpaceMember + user.status=1 校验. 接 spacepkg.CheckMembership
  (已 join space.status=1 + space_member.status=1) + 二重 user.status=1 SQL
  gate (D11 撤销链路审计: liftBanUser 走 redis token 已堵 session 路径,
  这里 belt-and-braces 防 banned user lazy-create dead api_key row).
  query 优先 / header 备选 (X-Space-Id / X-Space-ID 都吃) 跟 SpaceMiddleware
  顺序对齐.
- C2 (codex P0): commands.start 是 standalone multi-line block (含
  export OCTO_SERVER_URL + OCTO_FLEET_URL + OCTO_MATTER_URL + octo-daemon
  start 行), caller 复制即跑. env_vars 字段保留供想 set 到 shell rc 而
  非 inline 的场景.
- C3 (codex P0): QUICKSTART.md 删 BotFather /daemon 引导, 改成 Web
  Runtimes 页 + popover 创建 Runtime.
- C4 (cc): deriveOnboardingURLs 拼出 'http://:8090' (External.BaseURL +
  External.IP 都空) 时 handler 报 500 + log, 不再给前端展示 broken
  command.

cc 提的 minor 推 PR-3 (errcode 风格 / 单测). cc 提的 X-Space-Id 大小写 /
header-query 顺序由本 PR 自带兼容消化 (双 case 都吃, 跟 SpaceMiddleware
对齐).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc round 2 verdict=pass / codex round 2 verdict=pass, 0 pre-push blocker.
顺手清 cc 提的 3 个 minor 提高 PR review 通过率.

- F1 (cc): 删第二个 c.GetHeader('X-Space-ID') dead branch — Go HTTP
  header 大小写不敏感 (textproto.CanonicalMIMEHeaderKey), 一次查询
  X-Space-Id / X-Space-ID 都吃, 第二次永远不会比第一次多查到. 改注释
  显式说明.
- F2 (cc): 删 bf.ctx.DB().NewSession(nil) 冗余调用, 改成 bf.ctx.DB()
  跟 SpaceMiddleware (pkg/space/middleware.go:102) 调用风格对齐.
- F4 (cc 提的 question): octo-daemon start --api-url 改回 serverURL
  跟旧 handleDaemon 一致. daemon-cli 内部 cfg.APIURL 只是 env (OCTO_FLEET_URL/
  OCTO_SERVER_URL) 缺失时的 fallback, 我们 commands.start 已经全设
  了 export, 实际 cfg.APIURL 不会被读到, 改 fleetURL/serverURL 行为
  等价 — 但语义上跟旧版对齐避免 future drift. 注释改写.

cc 的 F3 (CheckMembership + user.status 拆两次 vs verify-api-key 一条
SQL JOIN sm+s+u 简洁版) / F6 (broken URL 检测覆盖度盲区) 推 PR-3 (cosmetic).
codex 的 C4 (broken URL 漏 'http://' / 空白 IP case) 同 F6 推 PR-3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint

feat(runtime-onboarding): web endpoint 替代 BotFather /daemon 命令
daemon 0.0.3 二进制不读 OCTO_MATTER_URL (它不直接调 matter), 且 matter
当前未上线; 启动命令只下发 OCTO_SERVER_URL + OCTO_FLEET_URL 两个 export.
顶层 matter_url 响应字段保留, 不影响契约.
caster-Q and others added 9 commits June 15, 2026 15:14
QUICKSTART "Connect an AI Agent" still showed the old
`go install ...octo-daemon-cli@latest`; align it with the
runtime-onboarding API (now `npm install -g @mininglamp-oss/octo-daemon`)
so the in-repo docs and the served command match. Note Node.js >= 18.
fix(onboarding): npm daemon install + drop OCTO_MATTER_URL from start
chore: sync origin/main into agent-runtime
## 问题
合入 main 后,`user_api_key` 表新增 `status`(1=active/0=revoked)与
`client_id`(默认 botfather)两列以支持吊销与第三方 integration client 维度。但 runtime 的两条
daemon 鉴权直查是在加列之前写的,只匹配 `api_key + space_id`,**未过滤 status/client_id**:

- 被吊销的 key(status=0)仍能通过 daemon 鉴权;
- 原生 daemon 路径可能接受第三方 integration client key(client_id ≠ botfather)。

属语义缺口,`go build` 抓不到。

## 改动
两条查询同步补 `status=1 AND client_id='botfather'`(绑定参数):
- `modules/user/api.go`
`authVerifyAPIKey`(/v1/auth/verify-api-key,fleet/matter 调用)
- `modules/bot_provision/resolve.go` `resolveAPIKey`(daemon→bot_token)

字面量 `1`/`'botfather'` 对应列语义;botfather 包常量未导出,且 `user` 无法 import
botfather(`botfather → user` 循环)。两处 SQL 旁加注释说明。

**不在范围**:`modules/integration/db.go` 的 Update 是 integration 自身管理 key
的写路径,非 daemon 鉴权读。

## 测试
两条路径各加回归用例:
- 吊销 key(status=0)→ 401
- legacy 明文非 botfather client(client_id≠botfather)→ 401(刻意 direct-insert
明文,使 `api_key` 仍能命中,**仅靠 client_id 过滤拒绝**,证明过滤生效)
- 正常 botfather key(status=1)→ 仍 200

`modules/user`、`modules/bot_provision` 全量测试通过;`go build ./...` + `go
vet` 通过。
…ses (#386)

## What
Migrate the runtime error surface to the registered error-code i18n
framework.

- bot_provision: all 13 hardcoded error sites -> registered codes +
httperr.ResponseErrorL / ResponseErrorLWithStatus, preserving each
site's wire status; daemon credential failures collapse to one
anti-enumeration 401; internal failures route through the shared
internal code without leaking detail.
- user (verify-api-key): the two raw 401 responses -> status-preserving
ErrUserAPIKeyInvalid (anti-enumeration, reason logged), clearing the
lint-direct-error-response (D23) regression on user/api.go.

## Tests / gates
- New i18n tests: source guard, zh parity, envelope + wire-status
contracts.
- Green: make i18n-extract-check, lint-unregistered-code,
lint-direct-error-response; full bot_provision + user module tests pass.
…362)

The /daemon onboarding flow (now GET /v1/runtime-onboarding) used to
instruct users to `export OCTO_MATTER_URL`, but the octo-daemon binary
never reads that env — it reaches matter through the fleet/server
endpoints. The instruction was a no-op that caused bot-provisioning
friction.

The runnable command block and env_vars already omit OCTO_MATTER_URL,
but nothing guarded it against reintroduction. Extract the command/env
builders into pure functions (buildDaemonStartBlock, daemonEnvVars) and
add hermetic regression tests asserting the daemon-facing instructions
expose exactly OCTO_SERVER_URL + OCTO_FLEET_URL and never
OCTO_MATTER_URL.

No behavior change; response shape unchanged.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Instrument the bot reply delivery leg (sendMessage -> dispatchMsgSendReq),
the server<->channel (WuKongIM) boundary of an agent turn:

- dmwork_agent_turn_delivery_duration_seconds (HistogramVec) — delivery latency
- dmwork_agent_turn_delivery_total{result} (CounterVec) — result=error counts
  failed agent-turn deliveries

Metrics register on the default registry via promauto (consistent with
modules/oidc/metrics.go) so the existing /metrics scrape exposes them.
Labels kept low-cardinality: channel_type{person|group|topic|other} x
result{ok|error}, pre-warmed to zero series for stable dashboards.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
The existing /v1/health already probes db + redis. Add a WuKongIM
reachability check so readiness reflects the channel/IM dependency the
agent turn relies on. Any HTTP response (even non-2xx) counts as
reachable — the probe asserts the IM API answers on the network, not that
a specific health route exists — so readiness isn't coupled to a WuKongIM
version/path. Only a transport failure marks im=down. 2s timeout bounds a
hung IM control plane; unconfigured APIURL reports im=unknown without
failing overall readiness. HTTP stays 200 (aggregate in body), preserving
existing consumer behavior.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@lml2468 lml2468 requested a review from a team as a code owner June 17, 2026 16:02
@github-actions github-actions Bot added the size/XL PR size: XL label Jun 17, 2026

@mochashanyao mochashanyao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Octo-Q · automated review]

Verdict: Approve — no blocking findings; notes below (data-flow traced).


PR #407 Review Report — OCT-17 Observability + Bot Provisioning + Auth Extensions

Reviewer: Octo-Q (automated review)
Head SHA: dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb
Repo: Mininglamp-OSS/octo-server
Scope: security_sensitive (new auth endpoints, credential handling, authorization gates)


1. Verification Summary

Area Status Evidence
Agent-turn metrics (metrics.go) Low-cardinality labels, init pre-warm, correct data flow from req.ChannelType uint8 (send.go:27) → agentTurnChannelTypeLabel() switch with default "other"
Health IM probe (common/api.go) Config-sourced URL (not user input → no SSRF), 2s timeout, empty→"unknown", transport-fail→"down"
Bot mint (bot_provision/bot_api.go) Session auth → assertSpaceMember (3-way JOIN: space.status=1 + user.status=1 + sm.status=1) → MintBotOBO
Bot token (bot_provision/bot_api.go) api_key Bearer → resolveAPIKey (status=1 + client_id='botfather') → assertSpaceMember → SQL cross-space filter + creator check
Auth verify-token (user/api.go:4011) OwnedBots UNCONDITIONAL (v3.3.5 fix), ?include=context fail-secure (empty lists, not 500)
Auth verify-api-key (user/api.go:4333) 3-step: key lookup → 3-way JOIN membership → opt-in ownedBotsBySpace
Runtime onboarding (botfather/api_runtime_onboarding.go) Session auth → CheckMembership → user.status=1 gate → getOrCreateAPIKey → URL sanity check
SQL injection safety All new queries use parameterized ? placeholders via dbr
Anti-enumeration (bot_provision) All credential failures collapse to ErrBotProvisionAuthFailed 401
Rate limiting (botToken) 1000 req/min/IP "verify" bucket, appropriate for credential-touching endpoint
i18n error codes 5 new bot_provision codes + ErrUserAPIKeyInvalid registered with zh-CN/en-US translations

2. Findings

F1 — P2: Hardcoded MinIO credentials committed to tracked config

File: configs/tsdd.yaml (new block added by this PR)
Lines: PR diff +135–144

fileService: "minio"
minio:
  url: "http://127.0.0.1:9000"
  accessKeyID: "admin"
  secretAccessKey: "12345678"

Diff-scope: NEW — this PR adds the block. The file is tracked in git (not in .gitignore). The inline comment acknowledges "Production should set proper AccessKey/SecretKey via env vars, not commit creds" but the creds are committed.

Risk: admin/12345678 are MinIO's factory defaults. They only reach 127.0.0.1:9000 so no remote production exposure, but:

  • Anyone deploying from this config without overriding gets a publicly-known storage credential
  • Sets a precedent that weak creds in config are acceptable

Recommendation: Move the dev block to a gitignored configs/tsdd.dev.yaml (add to .gitignore), or replace with env-var references (${MINIO_ACCESS_KEY}). Keep the commented-out template as documentation.

F2 — P2: New http.Client per health probe call (no connection reuse)

File: modules/common/api.go, imReachable() function

func imReachable(apiURL string) error {
    client := &http.Client{Timeout: imHealthTimeout}  // new client each call
    ...
}

Diff-scope: NEW — introduced by this PR.

Impact: Each health probe creates a new http.Client with default Transport (no connection pooling). For a health endpoint probed every 10–30s by orchestrators, this means a fresh TCP+TLS handshake per probe. Not a correctness issue but wasteful.

Recommendation: Use a package-level var imHealthClient = &http.Client{Timeout: imHealthTimeout} or configure a custom Transport with MaxIdleConnsPerHost: 1.

3. Suggestions

  1. F1 fix: Move dev MinIO creds to gitignored config or env-var interpolation. Low effort, high hygiene value.
  2. F2 fix: Hoist the http.Client to package level. One-line change.
  3. Nit: The health endpoint's statusMap["error"] = lastError.Error() (pre-existing pattern, not this PR) can leak internal hostnames/connection strings to unauthenticated callers. Consider a future PR to redact the error detail from the public response while keeping it in the log.

4. Additional Observations

  • MintBotOBO friend-add is best-effort (mint_obo.go:59-65): AddFriend failures are logged-and-continue. This is intentional (bot is usable without friend link) but worth noting that a minted bot may lack the bidirectional friend relationship in edge cases.
  • botToken SQL no-rows → 500 not 404 (bot_api.go:155): When SelectBySql().Load(&r) finds no matching robot, dbr returns an error → 500. The r.BotToken == "" 404 path (line 160) is unreachable for the no-rows case. Functionally correct (caller gets an error either way) but the status code is imprecise. Pre-existing dbr pattern, not a blocker.
  • deriveServiceURL (command.go:1137): Simple port-swap helper, correctly handles URLs with and without trailing port. No edge-case issues found.

5. Data-Flow Tracing

Consumer Upstream Source Verified Flow
observeAgentTurnDelivery(req.ChannelType, ...) req struct field ChannelType uint8 (send.go:27), JSON-bound from HTTP body ✅ → agentTurnChannelTypeLabel() switch maps to {person,group,topic,other}, default catches all unknowns
imReachable(imURL) cn.ctx.GetConfig().WuKongIM.APIURL (config, not user input) ✅ → TrimSpace → empty check → GET with 2s timeout
assertSpaceMember(uid, spaceID) mintBot: c.GetLoginUID() + req.SpaceID; resolveAPIKey: user_api_key.uid + user_api_key.space_id ✅ → 3-way JOIN filters space.status=1, user.status=1, sm.status=1
botToken SQL callerSpace resolveAPIKey()user_api_key.space_id (DB-sourced, not user input) ✅ → parameterized query, cross-space filter enforced
authVerifyToken OwnedBots robot table JOIN user on robot_id=uid, filtered by creator_uid=resp.UID + status=1 ✅ → UNCONDITIONAL load (v3.3.5 fix), DB error → empty list (fail-secure)
authVerifyAPIKey Step 2 membership keyInfo.UID + keyInfo.SpaceID from Step 1 DB lookup ✅ → same 3-way JOIN pattern as assertSpaceMember
runtimeOnboarding spaceID c.Query("space_id") or c.GetHeader("X-Space-Id") (user input) ✅ → validated by CheckMembership + user.status=1 before any side-effect

6. Blindspot Checklist (R5 — security_sensitive)

C1 — Dual-path parity: ✅ CLEAR

  • add/remove: No symmetric pair operations in this PR
  • create-gate ↔ execute-path: mintBot checks space membership before MintBotOBO; botToken checks api_key + space + creator; runtimeOnboarding checks membership + user.status before lazy-create of api_key. All authorization gates are symmetric between creation and execution paths.
  • Parallel implementations: assertSpaceMember (bot_provision/resolve.go) and the inline 3-way JOIN in authVerifyAPIKey (user/api.go:4385) use identical SQL patterns (space.status=1 + user.status=1 + sm.status=1). No drift.

C2 — Control-flow ordering / nesting reuse: ✅ CLEAR

  • observeAgentTurnDelivery called exactly once per sendMessage — no double-counting risk
  • deriveServiceURL called twice (fleet + matter) with different port suffixes — no nesting
  • imReachable called once per health check — no re-entry
  • No security controls (escape/sanitize/regex) introduced that need non-canonical-form testing

C3 — Authorization boundary ≠ capability boundary: ✅ CLEAR

  • mintBot: session auth (web) → space membership gate → MintBotOBO. Only authenticated web users who are space members can mint.
  • botToken: api_key Bearer → resolveAPIKey (status=1 + client_id='botfather') → space membership + creator check. Only the bot's creator with a valid daemon key can pull the token.
  • runtimeOnboarding: session auth → space membership + user.status=1 → lazy-create api_key. Banned users blocked.
  • authVerifyAPIKey: internal endpoint, api_key validation → 3-way JOIN. No unauthorized path.

C4 — Authorization lifecycle / container-member state cascade: ✅ CLEAR

  • assertSpaceMember checks all three levels: space.status=1 (container), sm.status=1 (membership), user.status=1 (account ban). A disabled space cascades to deny all member operations. A banned user cascades to deny all api_key/mint operations.
  • authVerifyAPIKey Step 2 uses the same 3-way JOIN — symmetric coverage.
  • queryUserSpaceContext spaces query JOINs space.status=1 — disabled spaces excluded from context.
  • No gap found where a parent container's disabled state fails to cascade.

7. Cross-round Blocker Recheck (R6)

N/A — first review of this PR.


[Octo-Q] verdict: APPROVED

No P0/P1 findings. Two P2 items (hardcoded dev credentials in tracked config, per-call http.Client allocation) are hygiene/efficiency improvements that should be addressed but do not block landing. All authorization chains verified complete across the new bot provisioning, auth verification, and runtime onboarding surfaces. C1–C4 blindspot checks all clear.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #407 (octo-server)

Reviewed at head dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb. Scope: 5 files / +268 lines (modules/common/api.go, modules/bot_api/metrics.go, modules/bot_api/send.go, plus tests). This PR is security-sensitive (it touches an unauthenticated endpoint and adds an outbound network probe), so I looked at it carefully.

Verdict: Approve. No correctness, data-loss, or build-breaking issues. go build, go vet, and all new tests pass locally. The two new metrics are correctly wired through the single delivery chokepoint, label cardinality is bounded, and the health probe is timeout-bounded and fails safe. The items below are non-blocking — one info-disclosure point I'd like a maintainer to consciously accept, plus three minor robustness/test nits.

What's good

  • Metrics design is sound. delivery_duration_seconds / delivery_total are labeled channel_type{person|group|topic|other} × result{ok|error} — fully bounded, no robotID/channelID to explode Prometheus memory. The default -> "other" collapse is fuzz-safe and unit-tested. init() pre-warms the 8-series matrix so dashboards can distinguish "zero deliveries" from "metric absent". Namespace/subsystem/buckets are consistent with the existing dmwork_http_* and modules/oidc/metrics.go patterns.
  • Instrumentation point is correct. observeAgentTurnDelivery(req.ChannelType, elapsed, err) is recorded immediately after dispatchMsgSendReq (the single caller), before the if err != nil branch — intentional and right. The error classification captures both transport failures and non-2xx IM responses as result=error.
  • Health probe fails safe. Unconfigured WuKongIM.APIURLim=unknown without failing overall readiness; only a transport-level failure marks im=down. The 2s http.Client{Timeout} bounds a hung IM control plane so it can't stall the handler (or a liveness probe behind it). Treating any HTTP response (incl. 4xx/5xx) as reachable correctly decouples readiness from a specific WuKongIM route/version.

Non-blocking findings

[P2 — please have a human accept] Unauthenticated /v1/health discloses the internal IM address on probe failure

modules/common/api.go/v1/health is registered via raw r.GET(...) (no auth middleware), and on a transport failure the handler returns statusMap["error"] = lastError.Error() in the JSON body. The imReachable transport error embeds the configured WuKongIM.APIURL (e.g. Get "http://10.x.x.x:5001": dial tcp ...: connection refused), so an unauthenticated caller can learn internal control-plane host:port / DNS topology.

Important context: this is not a new vulnerability class — the parent commit already returned lastError.Error() for db.Ping/redis.Ping failures, which leak the DB/Redis host:port the same way. This PR adds one more internal endpoint (IM) to an already-leaky surface. I also confirmed no secret leaks: WuKongIM.ManagerToken is a separate config field sent via an HTTP header at the real call sites and is never part of APIURL.

Because it's incremental rather than novel, I'm not blocking on it — but since this PR is security-sensitive and it does widen the disclosure surface by one endpoint, a maintainer should consciously accept it. The clean fix (which also closes the pre-existing db/redis case): keep the per-component up/down/unknown fields in the public body but drop the raw lastError.Error() string, logging it server-side only (the handler already does cn.Error(...)).

[P2] imReachable follows redirects, partially contradicting "any HTTP response = reachable"

modules/common/api.goimReachable uses the default http.Client (no CheckRedirect), which follows up to 10 redirects. The doc comment says any HTTP response counts as reachable, but a 3xx from the IM root is followed rather than counted; if the redirect target then fails transport, the probe reports im=down even though WuKongIM answered. The tests cover 200/404/500 but not a redirect. Low likelihood (the IM root is a flat API server) and it fails closed, so this is a robustness nit — consider client.CheckRedirect = func(*http.Request, []*http.Request) error { return http.ErrUseLastResponse } to make the "first response wins" intent explicit.

[P2] Health probe doesn't propagate the request context

modules/common/api.go — the probe uses http.NewRequest rather than http.NewRequestWithContext(c.Request.Context(), ...). If the health caller disconnects, the outbound IM probe still runs to the 2s timeout. Bounded and minor (and a global rate limiter already guards the endpoint), but threading the request context would let it cancel early under health-check floods.

[P2 — test quality] TestAgentTurnDeliveryTotal_AllLabelsPrewarmed doesn't assert what its name claims

modules/bot_api/metrics_test.go — the test calls GetMetricWithLabelValues, which lazily creates the series if absent, so it would pass even if init() stopped pre-warming the 8 combinations. It validates label well-formedness, not prewarming. To actually assert prewarming, check testutil.CollectAndCount(metricAgentTurnDeliveryTotal) == 8 (or assert each series is present) before any observe* call runs.

Coverage notes / blind spots

  • I verified build, vet, the new tests, the WuKongIM.APIURL config field, the ChannelType* constants, and that dispatchMsgSendReq has a single caller. I did not exercise the probe against a live WuKongIM, nor confirm how octo-deployment k8s manifests consume /v1/health (whether the JSON body is parsed or only the 200 status is read) — that drives the real-world impact of the info-disclosure point above.
  • Latency semantics: the histogram measures the outbound dispatch leg only, not the end-to-end human→bot→human turn. That matches the stated OCT-17 scope and the PR description's rationale (no reliable server-side correlation key across the two async HTTP legs), so it's a deliberate, documented choice — just worth keeping in mind when naming dashboards.

@yujiawei yujiawei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #407 (octo-server)

Reviewed at head dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb. This is a full-scope review of all 36 changed files (≈3.1k additions), correcting an earlier review that only covered the OCT-17 observability slice (5 files) and missed the bot-provisioning / cross-service auth surface that makes up the bulk of this PR.

Verdict: Request changes. One P1 (a ban-revocation gap on the user-token verification path) and one credential-hygiene item I'd want a human to sign off on, plus several non-blocking nits. The auth surface is, on the whole, carefully built — the items below are specific and each has a small fix.

What's solid

  • Cross-space / IDOR gates on the daemon paths are correct. GET /v1/bot/:uid/token only returns a token for a bot that is both a member of the caller api_key's bound space (INNER JOIN space_member sm ON sm.uid=r.robot_id AND sm.space_id=? AND sm.status=1) and created by the caller (r.creator_uid == callerUID); a guessed botUID yields 403/404 with no existence leak.
  • Anti-enumeration is sound. Credential failures on the daemon path collapse to single codes (ErrBotProvisionAuthFailed / BotForbidden / BotNotFound). I confirmed the dbr semantics: botToken uses .Load() (multi-row), which returns (0, nil) on zero rows — so the no-rows case correctly falls through to the r.BotToken == "" → 404 branch (it does not 500). The 404 path is reachable and correct.
  • Caller-supplied bot_token cannot hijack an existing bot. robot.bot_token carries a functional UNIQUE index on NULLIF(bot_token,''), so a colliding INSERT fails; createBotCoreWithRetry holds the token constant across retries, so a collision errors out rather than overwriting a victim's row.
  • Parameterized SQL throughout; the only string concatenation is static identifier quoting (`user`), never user input.
  • owned_bots_by_space whitelist holds. The token path filters bot rows through validSpaces built from the caller's own active spaces, and the api_key path scopes via SQL AND sm.space_id=? — neither leaks a foreign space_id.

Blocking

[P1] POST /v1/auth/verify does not gate user.status, so a banned user keeps gateway access until token TTL

modules/user/api.goauthVerifyToken resolves the session token purely from the Redis cache (Cache().Get(TokenCachePrefix + token)) and builds the full response — identity, OwnedBots, and (with ?include=context) spaces + owned_bots_by_space + context_included — with no user.status=1 check.

The ban path does not evict that cache. Admin ban (liftBanUser, modules/user/api_manager.go) sets user.status=0, bans the IM channel, and calls QuitUserDevice(uid, -1). I traced QuitUserDevice in the pinned dependency: it issues a single POST to the IM control plane's /user/device_quit and touches nothing else. The Redis key authVerifyToken reads (token:{token}) is deleted only by revokeAllDeviceTokens, which is called only from the account-deletion path (deleteAdminUsers) — never from ban. So after a ban, the user's session token remains valid in the cache (default TTL is long), and this endpoint keeps returning a full, populated authorization context for a banned account.

Why this is in-scope for this PR rather than pre-existing noise: the PR's two new sibling credential endpoints — authVerifyAPIKey and runtime-onboarding — both explicitly add a user.status=1 gate, with comments stating they are closing exactly this account-ban bypass. Their comments also assert that the session-token path is already covered because "QuitUserDevice clears the redis token cache." That premise is false (per the trace above), and the token path is the one path left ungated. The PR builds its new cross-service authz primitives (owned_bots_by_space, context_included) on this ungated path, so it extends the gap rather than merely inheriting it.

Fix: mirror the sibling pattern — gate the resolved info.UID on user.status=1 in authVerifyToken (and/or add the INNER JOIN user u ... u.status=1 to queryUserSpaceContext's spaces query) so a banned user fails closed here as it already does on the api_key path. If the downstream gateway consumers in fact re-check ban state and this is intentional, please say so explicitly and remove the now-incorrect "clears redis token cache" comments — but as written, this is a fail-open revocation gap on a credential-verification endpoint in a security-sensitive PR.

Human verification requested (this PR carries needs-human-review): please confirm the intended ban-revocation contract for the token gateway path before merge.

Non-blocking (should fix, not gating)

[P2 — please have a human security owner confirm] Committed MinIO credentials in the default config

configs/tsdd.yaml uncomments and activates fileService: "minio" with accessKeyID: "admin" / secretAccessKey: "12345678", and this file is the default -config path (main.go). Mitigating factors: the endpoint is loopback (http://127.0.0.1:9000), the block is labeled dev-only, and the values are env-overridable (TS_-prefixed AutomaticEnv is wired, so TS_MINIO_SECRETACCESSKEY overrides). Still, committing live credentials to a public repo is a hygiene line worth not crossing, and the in-file comment ("Production should set proper creds via env vars, not commit creds") argues against the change itself. Recommend re-commenting the block or switching to env-var references before merge. The same hunk flips external.ip from auto-detected to hardcoded localhost, which would feed http://localhost:8090 to onboarding output unless External.BaseURL is set — worth a second look.

[P2] POST /v1/bot/mint has no per-UID rate limit

modules/bot_provision/bot_api.go mounts mint behind AuthMiddleware only, while the sibling bot-create route uses SharedUIDRateLimiter and even this PR's own /v1/bot/:uid/token got a StrictIPRateLimit. Each mint amplifies into ~5 DB writes + IM app creation, so an authenticated space member can script bot/space_member churn (bounded only by the global per-IP floor). Add SharedUIDRateLimiter to match the convention.

[P2] Caller-supplied bot_token has no format/entropy floor

mintBot accepts req.BotToken verbatim when non-empty (no bf_ prefix / length / entropy check). It can't hijack another bot (UNIQUE index), but it lets a caller register a low-entropy bearer value in the global bot-token namespace that authVerifyBot trusts unscoped. Prefer server-only generation, or enforce the bf_ prefix + entropy.

[P2] client_id='botfather' / status=1 duplicated as literals across 3+ sites

resolve.go, authVerifyAPIKey, and the issuance default all hard-code the daemon-gate literals (forced by a real import cycle). Not exploitable today, but a drift risk on the gate that keeps integration-client keys off the daemon path — a single exported constant would remove the class.

[P2] runtime-onboarding uses raw error responses and isn't in the legacy-response guard list

modules/botfather/api_runtime_onboarding.go uses c.ResponseErrorWithStatus(errors.New(...)) at ~8 sites (the repo's envelope/i18n contract prefers httperr.ResponseErrorL + registered errcode), and the new file isn't added to TestBotfatherNoLegacyResponseError's file list, so future edits here evade the guard. Strings are generic (no info leak today); this is a contract/consistency fix.

Build / test / coverage notes

  • go build and go vet pass for all changed packages. The new observability tests pass. The bot_provision / user auth tests require a live MySQL/Redis (they panic in testutil.NewTestServer on DB/migration setup here) — consistent with the PR's stated "CI needs MySQL/Redis/WuKongIM"; I could not execute them in this environment.
  • Out-of-repo blind spots that bound the P1's real-world blast radius: the fleet/matter consumers of owned_bots_by_space / context_included live in other repos, and the nginx internal-IP ACL the code cites as the primary control for the verify-* / token endpoints is in the deployment repo. The server-side gate is what I can assess, and it is missing on the token path.

@Jerry-Xin Jerry-Xin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is relevant to octo-server, but it contains blocking regressions outside the stated observability scope.

🔴 Blocking

  • 🔴 Critical: configs/tsdd.yaml enables local-only deployment settings in a committed config: external.ip: "localhost" and MinIO as the active fileService with admin / 12345678 credentials at lines 139-145. This changes runtime behavior for anyone using this config and commits local PoC credentials. Keep this as commented documentation or move it to a local/example override file.

  • 🔴 Critical: modules/botfather/api_runtime_onboarding.go introduces a new user-facing endpoint that returns legacy raw errors via c.ResponseErrorWithStatus at lines 66, 79, 93, 97, 107, 111, 124, and 134. This violates the repository’s localized error-envelope contract for new endpoints. The existing botfather source guard also does not include this new file: modules/botfather/api_i18n_test.go. Add registered errcodes/helpers and include the new handler file in the guard.

💬 Non-blocking

  • 🟡 Warning: authenticated new routes are missing the shared UID limiter. /v1/runtime-onboarding is mounted with only AuthMiddleware in modules/botfather/api.go, and /v1/bot/mint is mounted similarly in modules/bot_provision/bot_api.go. Project guidance says authenticated routes should mount SharedUIDRateLimiter after auth.

  • 🟡 Warning: gofmt -l reports unformatted Go files: modules/botfather/api_runtime_onboarding.go, modules/bot_provision/bot_api.go, modules/user/api.go, and modules/user/api_verify_apikey_test.go.

✅ Highlights

  • The agent-turn metrics use bounded labels and prewarm low-cardinality series.
  • The IM health probe is additive and has a timeout, which keeps the health endpoint from hanging indefinitely.

I also tried go test ./modules/bot_api ./modules/common; it failed during test server setup with an existing migration-state error (unknown migration in database), so I could not validate those packages end to end in this checkout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants