feat(observability): health IM boundary + agent-turn latency/error metrics (OCT-17)#407
feat(observability): health IM boundary + agent-turn latency/error metrics (OCT-17)#407lml2468 wants to merge 39 commits into
Conversation
…time
迁移 dmworkim 老仓 feat/agent-runtime 的 runtime 模块到 octo-server,
按简化版策略只做编译必需的最小适配:
- modules/runtime/ 12 文件(api/db/model/enrich/upgrade/version_sync/1module + 4 SQL)
- import 路径:dmwork-org/{dmworkim,dmwork-lib} → Mininglamp-OSS/{octo-server,octo-lib}
- /daemon 命令安装 URL → Mininglamp-OSS/octo-daemon-cli
- internal/modules.go 注册 runtime 模块
- modules/botfather/{const,command,api}.go surgical apply CmdDaemon + handleDaemon
业务字面(channel id "dmwork"、route prefix "dmwork/<uid>"、log/注释里的 dmwork)
有意保留,等本地端到端测试阶段统一修(与 openclaw-channel-octo channel id 改名对齐)。
Build: go build ./... 通过
Tests: enrich_test 等含 dmwork 业务字面的测试会挂,留给测试阶段
Match the daemon-cli and openclaw-channel-octo changes: - enrich.go: channel id 'dmwork' → 'octo' (route prefix matching for bot info JOIN in user × robot × space_member); the rename function collectDmworkUIDs internal name kept as-is for now - enrich_test.go: test data updated - api.go: plugin_has_update match on 'octo' (was hardcoded 'openclaw-channel-dmwork') - upgrade.go: const componentPlugin = 'octo' - model.go + sql migration comment: update doc strings Verified end-to-end locally: - go test ./modules/runtime/... pass - Runtimes UI shows Octo Plugin 1.0.7 → 1.0.8 upgrade hint - Bot route enrich: avatar/name/online dot rendered correctly - Plugin upgrade flow (npx -y openclaw-channel-octo@dev install) end-to-end completed in 86 s with task=completed, daemon-reported version 1.0.8
- adapters/claude-code-dmwork/SKILL.md: install command uses ClawHub plugin id `octo` (was `openclaw-channel-dmwork`) - docs/2026-05-mention-all-chokepoint-audit.md: source-tree path references updated from `openclaw-channel-dmwork/src/inbound.ts` to `octo/src/inbound.ts` - docs/changelog.md: package-name reference in V2 conflict-fix note updated Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two heartbeat-pull dispatch pipelines under one schema: managed_runtime_agent — Web "Create agent + Add bot" UX (PoC1). Creates an openclaw agent workspace on the daemon and mints a bot in the same flow; status: provisioning → bot_minted → dispatched → active. Routes: POST /v1/runtimes/managed-agents POST /v1/runtimes/:runtime_id/agents/:agent_id/bots POST /v1/daemon/managed-agents/:id/ack botfather.MintBotOBO extracted from /newbot handler so non-IM callers can mint bots on behalf of a user. bot_task — matter @mention/assignee → agent dispatch (PoC2+3). Resolves bot_uid → openclaw agent_id via managed_runtime_agent, queues a row, daemon claims via heartbeat. Routes: POST /v1/internal/bot-tasks (X-Internal-Token, called by matter) POST /v1/daemon/bot-tasks/:id/ack ack handler writes back to matter: - timeline (bot reply content) - activity (agent_task_completed/failed with elapsed_ms + bytes) createBotTaskReq.prompt is an optional override so mention dispatch can supply a full continuation prompt with conversation history. heartbeat handler claims both pending_command (managed_agent) and pending_task (bot_task) per cycle. tsdd.yaml: enable MinIO for local dev (admin/12345678 against testenv-minio-1) so avatar/file uploads work — production should set proper IAM creds via env. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for cross-service trust chain laid out in spec
docs/superpowers/specs/2026-06-01-octo-fleet-extraction-design.md.
octo-server is now the canonical JWT issuer; fleet/matter will fetch
the public key from /.well-known/jwks.json and verify tokens locally
without service-to-service calls.
Implementation:
- modules/auth_jwt/jwt.go: RSA-2048 keypair loaded from or generated
into ~/.octo-server/jwt-priv.pem (override via JWT_PRIVATE_KEY_PATH).
kid derived from last-8-bytes of modulus so it stays stable across
restarts — JWKS consumers can cache long-term.
- modules/auth_jwt/resolve.go: token exchange has two paths — web
session via existing cache lookup, daemon api_key via user_api_key
table + space_member membership re-check.
- POST /v1/auth/token returns {token, expires_in, scope}; web=30min,
daemon=30days.
- GET /.well-known/jwks.json (Cache-Control: public, max-age=300).
Verified locally: api-key exchange produces a valid RS256 token with
the expected claims (sub=uid, scope=daemon, space_id, daemon_id, exp).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-A.2 cross-service contract:
POST /v1/bot/mint — web session auth; wraps botfather.MintBotOBO
and returns ONLY {bot_uid}. bot_token stays
in server's robot.bot_token column.
GET /v1/bot/:uid/token — daemon JWT auth (scope=daemon); returns
{bot_token, bot_uid} after asserting
robot.creator_uid == JWT.sub.
Together these let octo-fleet orchestrate bot lifecycle without ever
seeing bot_token: web mints (browser knows uid, not token), web tells
fleet about bot_uid, daemon later fetches bot_token directly from
server using its daemon-scope JWT. No fleet↔server HTTP.
E2E verified: valid daemon JWT returns the right bot_token; missing/
non-owned bot returns 404 (no info leak).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…R-A.2) Runtime/bot endpoints have moved to octo-fleet :8092. The module still loads and runs schema migrations (no data loss), but Route() returns early so nginx-style 404s come back from server. Set LEGACY_RUNTIME_ROUTES=true to re-enable as a rollback path if fleet has problems during cutover. Code path is unchanged — just gated. Verified: - POST /v1/daemon/register → 404 (no route) - GET /v1/runtimes → 404 (no route) - POST /v1/auth/token → still works (auth_jwt module) - POST /v1/bot/mint → still works (auth_jwt module) - GET /v1/bot/:uid/token → still works (auth_jwt module) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PR-C) Spec PR-C: server's runtime/bot orchestration code is dead weight now that octo-fleet has been stable through PR-A and PR-B. Server keeps only the cross-service contract surface (modules/auth_jwt): - /.well-known/jwks.json (JWKS for fleet/matter to verify tokens) - POST /v1/auth/token (session/api_key → JWT exchange) - POST /v1/bot/mint (web-callable bot OBO) - GET /v1/bot/:uid/token (daemon-callable bot_token lookup) Cleanup steps already taken locally (one-time, not in commit): DELETE FROM gorp_migrations WHERE id LIKE 'runtime-%'; -- the runtime_* tables themselves are harmless orphans and can be -- dropped manually whenever convenient; they don't affect anything. Boot smoke: - AuthJWT loaded key from disk (existing /Users/caster/.octo-server/jwt-priv.pem) - /v1/runtimes → 404 (gone) - /v1/bot/mint → 200 (new bot minted: 27w5OWCvvD2f63623a6_bot) - /v1/auth/token (api_key) → 200 JWT Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
100 commits 主要内容:
- pkg/i18n 大规模迁移(modules/{user,message,group,workplace,category} → ResponseErrorL)
- oidc RP-Initiated Logout (#217)
- voice → octo-speech adapter 替换 (#113)
- modules/bot_api / modules/user 重构
- 各种 CI workflow
冲突点预期:internal/modules.go (我们删了 runtime import,origin 可能新增 module)
# Conflicts:
# internal/modules.go
requireDaemonJWT only did signature verification via parsed.Claims;
expired tokens still mint bot_tokens via GET /v1/bot/:uid/token. Plan
§2 AU1 explicitly requires "JWT 过期 / kid 未知 / 签名失败 → 401" but
this code path enforced only the kid + signature halves.
Risk: a leaked daemon JWT past its 30-day TTL keeps working for any
bot the original owner had access to, since exp was never checked.
30-day window limits blast radius but is still a contract drift on
the most security-sensitive endpoint we expose.
Fix: chain cl.Validate(jwt.Expected{Issuer: jwtIssuer, Time: now})
after the signature check. Issuer pinning also blocks tokens minted
by a future second issuer (e.g. test fixtures, staging keys).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
917ff7c accidentally git-add'd a deletion of the /daemon BotFather command (45 lines across botfather/{api,command,const}.go). The deletion came from a stale working-tree state during merge resolution, not from any deliberate change — and the command is the ONLY user- facing onboarding path for the daemon: user sends /daemon in OCTO IM → BotFather generates (or reuses) user_api_key row in test DB → BotFather replies with a copy-paste install snippet → user pastes into their terminal → daemon starts + registers. Without this, new users have no way to get an api_key and no way to know the install URL pattern. Restored verbatim from origin/feat/agent-runtime, plus a follow-up update to the reply payload reflecting the PR-A/B 3-URL split: export OCTO_SERVER_URL=... export OCTO_FLEET_URL=... export OCTO_MATTER_URL=... export NOTIFY_INTERNAL_TOKEN=<ops-shared-secret> octo-daemon start --api-key uk_xxx --api-url <server> New helper deriveServiceURL() best-effort swaps the port suffix on server URL to compute fleet/matter URLs for the same host. Operators running fleet/matter on different hosts must override via env after copy-paste (call this out in the reply text). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the i18n merge, user/login writes sessions to Redis as
"v2:{"uid":"...","name":"..."}" but auth_jwt.resolveSession only
parsed the legacy "uid@name@..." triple-at format. Result: every
POST /v1/auth/token from the web returned 401 "malformed session
value", which the frontend treats as session-expired and bounces
the user back to the login page — so as soon as web tried to call
any fleet/matter endpoint requiring a JWT, the user got kicked.
resolveSession now detects the "v2:" prefix and JSON-unmarshals
the payload to extract uid; legacy format remains a working
fallback so nothing else breaks.
Independent of the DualAuth work — this bug was latent since the
i18n merge and surfaced when we ran the web through the new
Runtimes/智能体 tab which exercises the session→JWT exchange.
/daemon dispatch case + handler were registered correctly and the command works when typed, but handleHelp's hardcoded command list omitted it, leading users to think it had been removed. Adds the line right after /quickstart since both are runtime-setup commands. No behavioral change beyond what /help renders.
Two reviewer follow-ups bundled (both single-spot, low-risk): resolve.go: replace the hand-rolled "v2:" + struct-unmarshal mini decoder with pkg/auth.Decode. The pkg helper has been the canonical v2 envelope reader since the i18n rewrite (commit 76bc59a); having a parallel mini-decoder here meant any new v2 field would have to be mirrored in two places. resolveSession now just calls Decode and returns info.UID. Behavior unchanged — Decode already covers v2 JSON, legacy "uid@name[@ROLE]" fallback, and empty-payload error mapping. bot_api.go: cl.Validate(jwt.Expected{Issuer, Time: time.Now()}) was silently relying on go-jose's DefaultLeeway (1 min) for daemon ↔ server clock skew tolerance on exp/iat/nbf. Added a comment so the next person who reads this knows not to wrap a tighter leeway around it "for security" — that would cause spurious failures at mint time when daemon clock drifts a few seconds ahead.
Closes 2 cross-space privilege escalations found while documenting the auth flow: #1 (POST /v1/auth/token): resolveSession used to trust caller-supplied X-Space-Id with a comment saying "downstream re-checks". In practice plan AU3 has downstream (fleet/matter) trust JWT.space_id directly. Neither side actually checked → any logged-in user could mint a JWT for an arbitrary space and gain cross-space read/write on fleet/ matter endpoints. Now resolveSession calls assertSpaceMember on the spaceHint before returning. #2 (POST /v1/bot/mint): mintBot took space_id from body, never validated caller is in that space. MintBotOBO would then drop a bot into space_member of an arbitrary space. Attack: pull the new bot into a group → observe messages of a space you don't belong to. Now mintBot calls assertSpaceMember(uid, req.SpaceID) → 403 if not a member. Refactor: extracted the existing space_member SQL from resolveAPIKey into a shared assertSpaceMember helper so the daemon path (which always had this check) and the two new check sites share one implementation. No DB migration. No API contract change. X-Internal-Token and daemon api_key paths unaffected — they already validated this. Verified via curl: legit space → 200/200; fake space → 401/403.
## Summary - \`daemonTokenTTL\` 30 * 24h → 24h - 同步 bot_api.go 里过时的 "30-day TTL" 注释 (codex review MINOR) ## Why 通信协议决策 P1-5. daemon 端 EnsureJWT 早就在过期前自动 refresh, 缩 TTL 没有运行时影响, 但把 api_key 撤销窗口从 30 天降到 1 天 (生产合规). ## Note - 如果 auth session 决定走 "去 daemon JWT 改 api_key 直连" 路线 (review 结论 4), 这个改动就作废. 但当前 JWT 路线下这是安全的紧化, 两种方向都不冲突. ## Test plan - [x] \`go build ./...\` 通过 ## Reviewers @lvsijia @lijianhui Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
##⚠️ Reviewer 请先读架构 doc, 再 review 代码 避免直接 nitpick 实现细节而错过架构层决策的 context (e.g. 为啥 AU5 删了但 issue #75 没修). ## 1. 背景 + Phase 拆解 合并 plan: **决策一** (web 验证改造 — 浏览器改回带 session token, 不再换 JWT) + **决策二** (daemon api_key 直连 — daemon 不再换 JWT, 直接持 api_key 调 fleet/matter, 由它们调 server verify-api-key 验证). 5 个 Phase: | Phase | 内容 | 涉及 repo | |---|---|---| | 1 | server `POST /v1/auth/verify-api-key` endpoint | server | | 2 | matter middleware 扩三协议 (session+bot_token+api_key) + fleet 抄 | matter + fleet | | 3A | web 客户端切回 session token | web | | 3B | daemon 改 api_key 直连 + 业务 SQL 加 owner_uid 过滤 | daemon-cli + fleet + matter | | 4 | 清理 (JWT/AU5/DualAuth/JWKS) + 配置改名 + 单测补 | 全 5 repo | 会议依据: 鉴权架构讨论会议结论 2026-06-05 决策一+决策二. ## 2. 本 repo 改动 (3 commit, 按时间顺序) ``` 6b0dac8 refactor(auth): rename auth_jwt → bot_provision + add status edge case 36fbc3e refactor(auth): tear down JWT module, botToken now uses api_key c2cf1d0 feat(auth): add POST /v1/auth/verify-api-key endpoint ``` ## 3. 配套 PR (cross-link) - octo-server (本 PR) - octo-matter: Mininglamp-OSS/octo-matter#78 - octo-fleet: Mininglamp-OSS/octo-fleet#24 - octo-daemon-cli: Mininglamp-OSS/octo-daemon-cli#26 - octo-web: Mininglamp-OSS/octo-web#294 合并顺序建议: **本 PR (server) 先 merge** — 它是 verify-api-key endpoint 的 base, matter/fleet middleware 依赖它. ## 4. 本 repo review 重点 ### Phase 1: `POST /v1/auth/verify-api-key` (modules/user/api.go) - 跟现有 `/v1/auth/verify` + `/v1/auth/verify-bot` 同 group, 复用 verifyLimit - SQL 两步走: `user_api_key WHERE api_key=? AND space_id!=''` (显式拒 legacy) + `space_member WHERE status=1` 校验 owner 还在 space - 失败统一返 401 `{"msg":"invalid api_key"}` (不细分 revoked/not-found) - 7 单测覆盖: valid / unknown / owner-left-space / nonactive-status / legacy-empty-space / missing-field / multi-space ### Phase 4: `modules/auth_jwt` 模块清理 → 改名 `bot_provision` - 删 JWT 签发 + JWKS endpoint + token exchange + Claims 全套 - `bot_api.go` 的 `botToken` handler 改成验 api_key (复用 `resolveAPIKey`), 不再验 daemon JWT - 模块改名 `auth_jwt` → `bot_provision` (内容只剩 mintBot + botToken, 名字误导) ## 5. 重要决策 — issue #75 推迟到决策四 (本 PR 不涉及, 详情见 matter PR) ## 6. e2e 验证 - 本地 5 服务跑通: daemon 心跳 + web runtime 列表 + create bot + bot 私聊响应 - 单测 (核弹 test DB 后): `go test -run TestAuthVerifyAPIKey ./modules/user/` → 7/7 PASS --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
- 4 件 v3.3.6 §7 follow-up nit (empty Bearer guard / dead code 删 / 测试
dedup / 注释精确化)
- 顺手删 JWT-era 410 Gone stub (`/.well-known/jwks.json` +
`/v1/auth/token`) + helper + 2 个 test — Phase 4 删 JWT 后 PoC 阶段无真 caller,
compat for nonexistent client 是 dead-code-by-construction
- **0 SQL 改动** — 不动 ban 语义, 保持 runtime 分支跟 main 一致 ("ban user ≠ ban
user's bots")
- **不影响 main** — 5 处全在 runtime-only 文件 (bot_provision 模块 /
verify-api-key test / zz file 都是 runtime 分支引入的)
净 diff: 5 files, +18 / -117 (净 **-99 行**)
## Scope explicitly excluded (留独立 PR / 决策)
- **nit-8 account-ban gate**: R0-R3 review 期间扩到 6 处 SQL D11 闭环, 但跟 main
"ban 不牵连 bot" 语义冲突, caster 拍 revert
- **nit-4 JWT-shaped Bearer hint**: 同 410 stub 同形, 假设不存在 client, revert
- nit-5 / nit-7 / nit-9 / verifyCache sha256: 独立 PR
## Review trail
6 轮 codex + cc 并行 review (R1-R6), 最后双 pass 0 finding:
- R1-R3: D11 SQL 扩 1→6 处 (双确认)
- caster review: 拍板回 main 语义, revert SQL
- R4: revert 后双 pass
- caster 进一步原则: no compat for nonexistent JWT clients
- R5-R6: 删 nit-4 + 410 stub, 双 pass
## Test plan
- [x] `go build ./...` PASS
- [x] `go vet ./modules/user/... ./modules/bot_provision/...` PASS
- [x] `go test -count=1 -p 1 -run 'TestAuthVerify|TestBotToken'
./modules/user/... ./modules/bot_provision/...` PASS
- [x] `go test -race -count=1 -p 1` (same scope) PASS
- [ ] reviewer 跑 `go build ./... && go vet ./...` 确认绿
- [ ] reviewer grep verify: `gone410Handler` / `TestGone410` /
`"/.well-known/jwks.json"` / `"/v1/auth/token"` 在 bot_provision/ 应 0 hit
## Notes
- 4 个 `TestDestroyApply_*` 失败 (full-package run) 是 **pre-existing
flaky** (clean origin/feat/agent-runtime 也挂; redis login-locked state 跨
test 残留), 跟本 PR 无关
- v3.3.6 §P1 (commit 07ca7db, 已 merged 的 daemon api_key user.status=1
gate) 是 runtime 分支跟 main 在 ban 语义上唯一遗留分叉, 若想 100% 跟 main 对齐需要单独 revert
PR
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mand
## Summary
Migrate runtime daemon onboarding from the BotFather IM `/daemon` command
to a web HTTP endpoint. The `/runtimes` page's CreateRuntimeModal will
call this endpoint to render the install + start commands directly in
the browser, rather than asking users to switch to IM and message
BotFather.
## Changes
### Add: GET /v1/runtime-onboarding (session auth)
- New file `modules/botfather/api_runtime_onboarding.go` (~150 lines)
- Route registered in `Route()` under a session-auth group
- Returns JSON containing:
- `api_key` (lazy-create on first call, mirrors prior /daemon
behavior — same `(uid, space_id)` row, same uk_* prefix)
- `space_id`, `server_url`, `fleet_url`, `matter_url`
- `commands.install` / `commands.start` — pre-rendered command
strings for direct display
- `env_vars` map — for users who want to set OCTO_*_URL in their
shell rc instead of inlining per-invocation
- `space_id` resolution: X-Space-Id header > ?space_id= query > 400.
No implicit "first space" fallback — would let caller end up with
api_key bound to a space they didn't intend.
- `getOrCreateUserAPIKey` helper extracted from the now-deleted
handleDaemon. INSERT race tolerant (unique constraint on
(uid, space_id) — loser re-reads).
- `deriveOnboardingURLs` helper extracted from same.
### Remove: BotFather /daemon command
Following the same principle as PR #334 (no compat for nonexistent
clients post-Phase 4), removing rather than deprecating:
- `modules/botfather/const.go:46` — `CmdDaemon` constant
- `modules/botfather/api.go:308` — command list entry
- `modules/botfather/command.go:174-175` — `case CmdDaemon` dispatch
- `modules/botfather/command.go:489` — help text line
- `modules/botfather/command.go:1190-1258` — `handleDaemon` function
(70 lines, behavior reproduced in the web handler)
`deriveServiceURL` (still in command.go) is preserved — both the new
web handler's `deriveOnboardingURLs` and any future caller share it.
## Verification
- `go build ./...` PASS
- `go vet ./modules/botfather/...` PASS
- Pre-existing test flake unrelated to this PR: `botfather` package
test setup panics with `unknown migration: 20201222000001_report_
legacy01.sql` from a polluted local test DB. Verified on clean
`origin/feat/agent-runtime` — same panic, not introduced here.
Fixable via `bootstrap.sh --fresh` (drops + re-creates test DB).
## Out of scope (separate PR)
octo-web `/runtimes` page restructure:
- Sidebar `Runtimes` → `运行时`
- Top-level Runtime/Bot tabs removed; tree view: device → runtime → bot
- Top-right `+` popover: 创建 Runtime / 创建 Bot
- New `CreateRuntimeModal.tsx` calls this endpoint
Frontend PR follows once this server PR is merged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc round 1 verdict=fail 7 finding 2 P0 / codex round 1 verdict=fail 3 finding 2 P0. 双方共识 P0 + 单方 codex P0 全修. - C1 (双方 P0): 缺 SpaceMember + user.status=1 校验. 接 spacepkg.CheckMembership (已 join space.status=1 + space_member.status=1) + 二重 user.status=1 SQL gate (D11 撤销链路审计: liftBanUser 走 redis token 已堵 session 路径, 这里 belt-and-braces 防 banned user lazy-create dead api_key row). query 优先 / header 备选 (X-Space-Id / X-Space-ID 都吃) 跟 SpaceMiddleware 顺序对齐. - C2 (codex P0): commands.start 是 standalone multi-line block (含 export OCTO_SERVER_URL + OCTO_FLEET_URL + OCTO_MATTER_URL + octo-daemon start 行), caller 复制即跑. env_vars 字段保留供想 set 到 shell rc 而 非 inline 的场景. - C3 (codex P0): QUICKSTART.md 删 BotFather /daemon 引导, 改成 Web Runtimes 页 + popover 创建 Runtime. - C4 (cc): deriveOnboardingURLs 拼出 'http://:8090' (External.BaseURL + External.IP 都空) 时 handler 报 500 + log, 不再给前端展示 broken command. cc 提的 minor 推 PR-3 (errcode 风格 / 单测). cc 提的 X-Space-Id 大小写 / header-query 顺序由本 PR 自带兼容消化 (双 case 都吃, 跟 SpaceMiddleware 对齐). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cc round 2 verdict=pass / codex round 2 verdict=pass, 0 pre-push blocker.
顺手清 cc 提的 3 个 minor 提高 PR review 通过率.
- F1 (cc): 删第二个 c.GetHeader('X-Space-ID') dead branch — Go HTTP
header 大小写不敏感 (textproto.CanonicalMIMEHeaderKey), 一次查询
X-Space-Id / X-Space-ID 都吃, 第二次永远不会比第一次多查到. 改注释
显式说明.
- F2 (cc): 删 bf.ctx.DB().NewSession(nil) 冗余调用, 改成 bf.ctx.DB()
跟 SpaceMiddleware (pkg/space/middleware.go:102) 调用风格对齐.
- F4 (cc 提的 question): octo-daemon start --api-url 改回 serverURL
跟旧 handleDaemon 一致. daemon-cli 内部 cfg.APIURL 只是 env (OCTO_FLEET_URL/
OCTO_SERVER_URL) 缺失时的 fallback, 我们 commands.start 已经全设
了 export, 实际 cfg.APIURL 不会被读到, 改 fleetURL/serverURL 行为
等价 — 但语义上跟旧版对齐避免 future drift. 注释改写.
cc 的 F3 (CheckMembership + user.status 拆两次 vs verify-api-key 一条
SQL JOIN sm+s+u 简洁版) / F6 (broken URL 检测覆盖度盲区) 推 PR-3 (cosmetic).
codex 的 C4 (broken URL 漏 'http://' / 空白 IP case) 同 F6 推 PR-3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint feat(runtime-onboarding): web endpoint 替代 BotFather /daemon 命令
daemon 0.0.3 二进制不读 OCTO_MATTER_URL (它不直接调 matter), 且 matter 当前未上线; 启动命令只下发 OCTO_SERVER_URL + OCTO_FLEET_URL 两个 export. 顶层 matter_url 响应字段保留, 不影响契约.
QUICKSTART "Connect an AI Agent" still showed the old `go install ...octo-daemon-cli@latest`; align it with the runtime-onboarding API (now `npm install -g @mininglamp-oss/octo-daemon`) so the in-repo docs and the served command match. Note Node.js >= 18.
fix(onboarding): npm daemon install + drop OCTO_MATTER_URL from start
# Conflicts: # internal/modules.go
chore: sync origin/main into agent-runtime
## 问题 合入 main 后,`user_api_key` 表新增 `status`(1=active/0=revoked)与 `client_id`(默认 botfather)两列以支持吊销与第三方 integration client 维度。但 runtime 的两条 daemon 鉴权直查是在加列之前写的,只匹配 `api_key + space_id`,**未过滤 status/client_id**: - 被吊销的 key(status=0)仍能通过 daemon 鉴权; - 原生 daemon 路径可能接受第三方 integration client key(client_id ≠ botfather)。 属语义缺口,`go build` 抓不到。 ## 改动 两条查询同步补 `status=1 AND client_id='botfather'`(绑定参数): - `modules/user/api.go` `authVerifyAPIKey`(/v1/auth/verify-api-key,fleet/matter 调用) - `modules/bot_provision/resolve.go` `resolveAPIKey`(daemon→bot_token) 字面量 `1`/`'botfather'` 对应列语义;botfather 包常量未导出,且 `user` 无法 import botfather(`botfather → user` 循环)。两处 SQL 旁加注释说明。 **不在范围**:`modules/integration/db.go` 的 Update 是 integration 自身管理 key 的写路径,非 daemon 鉴权读。 ## 测试 两条路径各加回归用例: - 吊销 key(status=0)→ 401 - legacy 明文非 botfather client(client_id≠botfather)→ 401(刻意 direct-insert 明文,使 `api_key` 仍能命中,**仅靠 client_id 过滤拒绝**,证明过滤生效) - 正常 botfather key(status=1)→ 仍 200 `modules/user`、`modules/bot_provision` 全量测试通过;`go build ./...` + `go vet` 通过。
…ses (#386) ## What Migrate the runtime error surface to the registered error-code i18n framework. - bot_provision: all 13 hardcoded error sites -> registered codes + httperr.ResponseErrorL / ResponseErrorLWithStatus, preserving each site's wire status; daemon credential failures collapse to one anti-enumeration 401; internal failures route through the shared internal code without leaking detail. - user (verify-api-key): the two raw 401 responses -> status-preserving ErrUserAPIKeyInvalid (anti-enumeration, reason logged), clearing the lint-direct-error-response (D23) regression on user/api.go. ## Tests / gates - New i18n tests: source guard, zh parity, envelope + wire-status contracts. - Green: make i18n-extract-check, lint-unregistered-code, lint-direct-error-response; full bot_provision + user module tests pass.
…362) The /daemon onboarding flow (now GET /v1/runtime-onboarding) used to instruct users to `export OCTO_MATTER_URL`, but the octo-daemon binary never reads that env — it reaches matter through the fleet/server endpoints. The instruction was a no-op that caused bot-provisioning friction. The runnable command block and env_vars already omit OCTO_MATTER_URL, but nothing guarded it against reintroduction. Extract the command/env builders into pure functions (buildDaemonStartBlock, daemonEnvVars) and add hermetic regression tests asserting the daemon-facing instructions expose exactly OCTO_SERVER_URL + OCTO_FLEET_URL and never OCTO_MATTER_URL. No behavior change; response shape unchanged. Co-Authored-By: Paperclip <noreply@paperclip.ing>
Instrument the bot reply delivery leg (sendMessage -> dispatchMsgSendReq),
the server<->channel (WuKongIM) boundary of an agent turn:
- dmwork_agent_turn_delivery_duration_seconds (HistogramVec) — delivery latency
- dmwork_agent_turn_delivery_total{result} (CounterVec) — result=error counts
failed agent-turn deliveries
Metrics register on the default registry via promauto (consistent with
modules/oidc/metrics.go) so the existing /metrics scrape exposes them.
Labels kept low-cardinality: channel_type{person|group|topic|other} x
result{ok|error}, pre-warmed to zero series for stable dashboards.
Co-Authored-By: Paperclip <noreply@paperclip.ing>
The existing /v1/health already probes db + redis. Add a WuKongIM reachability check so readiness reflects the channel/IM dependency the agent turn relies on. Any HTTP response (even non-2xx) counts as reachable — the probe asserts the IM API answers on the network, not that a specific health route exists — so readiness isn't coupled to a WuKongIM version/path. Only a transport failure marks im=down. 2s timeout bounds a hung IM control plane; unconfigured APIURL reports im=unknown without failing overall readiness. HTTP stays 200 (aggregate in body), preserving existing consumer behavior. Co-Authored-By: Paperclip <noreply@paperclip.ing>
mochashanyao
left a comment
There was a problem hiding this comment.
[Octo-Q · automated review]
Verdict: Approve — no blocking findings; notes below (data-flow traced).
PR #407 Review Report — OCT-17 Observability + Bot Provisioning + Auth Extensions
Reviewer: Octo-Q (automated review)
Head SHA: dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb
Repo: Mininglamp-OSS/octo-server
Scope: security_sensitive (new auth endpoints, credential handling, authorization gates)
1. Verification Summary
| Area | Status | Evidence |
|---|---|---|
| Agent-turn metrics (metrics.go) | ✅ | Low-cardinality labels, init pre-warm, correct data flow from req.ChannelType uint8 (send.go:27) → agentTurnChannelTypeLabel() switch with default "other" |
| Health IM probe (common/api.go) | ✅ | Config-sourced URL (not user input → no SSRF), 2s timeout, empty→"unknown", transport-fail→"down" |
| Bot mint (bot_provision/bot_api.go) | ✅ | Session auth → assertSpaceMember (3-way JOIN: space.status=1 + user.status=1 + sm.status=1) → MintBotOBO |
| Bot token (bot_provision/bot_api.go) | ✅ | api_key Bearer → resolveAPIKey (status=1 + client_id='botfather') → assertSpaceMember → SQL cross-space filter + creator check |
| Auth verify-token (user/api.go:4011) | ✅ | OwnedBots UNCONDITIONAL (v3.3.5 fix), ?include=context fail-secure (empty lists, not 500) |
| Auth verify-api-key (user/api.go:4333) | ✅ | 3-step: key lookup → 3-way JOIN membership → opt-in ownedBotsBySpace |
| Runtime onboarding (botfather/api_runtime_onboarding.go) | ✅ | Session auth → CheckMembership → user.status=1 gate → getOrCreateAPIKey → URL sanity check |
| SQL injection safety | ✅ | All new queries use parameterized ? placeholders via dbr |
| Anti-enumeration (bot_provision) | ✅ | All credential failures collapse to ErrBotProvisionAuthFailed 401 |
| Rate limiting (botToken) | ✅ | 1000 req/min/IP "verify" bucket, appropriate for credential-touching endpoint |
| i18n error codes | ✅ | 5 new bot_provision codes + ErrUserAPIKeyInvalid registered with zh-CN/en-US translations |
2. Findings
F1 — P2: Hardcoded MinIO credentials committed to tracked config
File: configs/tsdd.yaml (new block added by this PR)
Lines: PR diff +135–144
fileService: "minio"
minio:
url: "http://127.0.0.1:9000"
accessKeyID: "admin"
secretAccessKey: "12345678"Diff-scope: NEW — this PR adds the block. The file is tracked in git (not in .gitignore). The inline comment acknowledges "Production should set proper AccessKey/SecretKey via env vars, not commit creds" but the creds are committed.
Risk: admin/12345678 are MinIO's factory defaults. They only reach 127.0.0.1:9000 so no remote production exposure, but:
- Anyone deploying from this config without overriding gets a publicly-known storage credential
- Sets a precedent that weak creds in config are acceptable
Recommendation: Move the dev block to a gitignored configs/tsdd.dev.yaml (add to .gitignore), or replace with env-var references (${MINIO_ACCESS_KEY}). Keep the commented-out template as documentation.
F2 — P2: New http.Client per health probe call (no connection reuse)
File: modules/common/api.go, imReachable() function
func imReachable(apiURL string) error {
client := &http.Client{Timeout: imHealthTimeout} // new client each call
...
}Diff-scope: NEW — introduced by this PR.
Impact: Each health probe creates a new http.Client with default Transport (no connection pooling). For a health endpoint probed every 10–30s by orchestrators, this means a fresh TCP+TLS handshake per probe. Not a correctness issue but wasteful.
Recommendation: Use a package-level var imHealthClient = &http.Client{Timeout: imHealthTimeout} or configure a custom Transport with MaxIdleConnsPerHost: 1.
3. Suggestions
- F1 fix: Move dev MinIO creds to gitignored config or env-var interpolation. Low effort, high hygiene value.
- F2 fix: Hoist the
http.Clientto package level. One-line change. - Nit: The health endpoint's
statusMap["error"] = lastError.Error()(pre-existing pattern, not this PR) can leak internal hostnames/connection strings to unauthenticated callers. Consider a future PR to redact the error detail from the public response while keeping it in the log.
4. Additional Observations
- MintBotOBO friend-add is best-effort (mint_obo.go:59-65):
AddFriendfailures are logged-and-continue. This is intentional (bot is usable without friend link) but worth noting that a minted bot may lack the bidirectional friend relationship in edge cases. - botToken SQL no-rows → 500 not 404 (bot_api.go:155): When
SelectBySql().Load(&r)finds no matching robot, dbr returns an error → 500. Ther.BotToken == ""404 path (line 160) is unreachable for the no-rows case. Functionally correct (caller gets an error either way) but the status code is imprecise. Pre-existing dbr pattern, not a blocker. - deriveServiceURL (command.go:1137): Simple port-swap helper, correctly handles URLs with and without trailing port. No edge-case issues found.
5. Data-Flow Tracing
| Consumer | Upstream Source | Verified Flow |
|---|---|---|
observeAgentTurnDelivery(req.ChannelType, ...) |
req struct field ChannelType uint8 (send.go:27), JSON-bound from HTTP body |
✅ → agentTurnChannelTypeLabel() switch maps to {person,group,topic,other}, default catches all unknowns |
imReachable(imURL) |
cn.ctx.GetConfig().WuKongIM.APIURL (config, not user input) |
✅ → TrimSpace → empty check → GET with 2s timeout |
assertSpaceMember(uid, spaceID) |
mintBot: c.GetLoginUID() + req.SpaceID; resolveAPIKey: user_api_key.uid + user_api_key.space_id |
✅ → 3-way JOIN filters space.status=1, user.status=1, sm.status=1 |
botToken SQL callerSpace |
resolveAPIKey() → user_api_key.space_id (DB-sourced, not user input) |
✅ → parameterized query, cross-space filter enforced |
authVerifyToken OwnedBots |
robot table JOIN user on robot_id=uid, filtered by creator_uid=resp.UID + status=1 |
✅ → UNCONDITIONAL load (v3.3.5 fix), DB error → empty list (fail-secure) |
authVerifyAPIKey Step 2 membership |
keyInfo.UID + keyInfo.SpaceID from Step 1 DB lookup |
✅ → same 3-way JOIN pattern as assertSpaceMember |
runtimeOnboarding spaceID |
c.Query("space_id") or c.GetHeader("X-Space-Id") (user input) |
✅ → validated by CheckMembership + user.status=1 before any side-effect |
6. Blindspot Checklist (R5 — security_sensitive)
C1 — Dual-path parity: ✅ CLEAR
- add/remove: No symmetric pair operations in this PR
- create-gate ↔ execute-path: mintBot checks space membership before MintBotOBO; botToken checks api_key + space + creator; runtimeOnboarding checks membership + user.status before lazy-create of api_key. All authorization gates are symmetric between creation and execution paths.
- Parallel implementations:
assertSpaceMember(bot_provision/resolve.go) and the inline 3-way JOIN inauthVerifyAPIKey(user/api.go:4385) use identical SQL patterns (space.status=1 + user.status=1 + sm.status=1). No drift.
C2 — Control-flow ordering / nesting reuse: ✅ CLEAR
observeAgentTurnDeliverycalled exactly once persendMessage— no double-counting riskderiveServiceURLcalled twice (fleet + matter) with different port suffixes — no nestingimReachablecalled once per health check — no re-entry- No security controls (escape/sanitize/regex) introduced that need non-canonical-form testing
C3 — Authorization boundary ≠ capability boundary: ✅ CLEAR
- mintBot: session auth (web) → space membership gate → MintBotOBO. Only authenticated web users who are space members can mint.
- botToken: api_key Bearer → resolveAPIKey (status=1 + client_id='botfather') → space membership + creator check. Only the bot's creator with a valid daemon key can pull the token.
- runtimeOnboarding: session auth → space membership + user.status=1 → lazy-create api_key. Banned users blocked.
- authVerifyAPIKey: internal endpoint, api_key validation → 3-way JOIN. No unauthorized path.
C4 — Authorization lifecycle / container-member state cascade: ✅ CLEAR
assertSpaceMemberchecks all three levels: space.status=1 (container), sm.status=1 (membership), user.status=1 (account ban). A disabled space cascades to deny all member operations. A banned user cascades to deny all api_key/mint operations.authVerifyAPIKeyStep 2 uses the same 3-way JOIN — symmetric coverage.queryUserSpaceContextspaces query JOINs space.status=1 — disabled spaces excluded from context.- No gap found where a parent container's disabled state fails to cascade.
7. Cross-round Blocker Recheck (R6)
N/A — first review of this PR.
[Octo-Q] verdict: APPROVED
No P0/P1 findings. Two P2 items (hardcoded dev credentials in tracked config, per-call http.Client allocation) are hygiene/efficiency improvements that should be addressed but do not block landing. All authorization chains verified complete across the new bot provisioning, auth verification, and runtime onboarding surfaces. C1–C4 blindspot checks all clear.
yujiawei
left a comment
There was a problem hiding this comment.
Code Review — PR #407 (octo-server)
Reviewed at head dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb. Scope: 5 files / +268 lines (modules/common/api.go, modules/bot_api/metrics.go, modules/bot_api/send.go, plus tests). This PR is security-sensitive (it touches an unauthenticated endpoint and adds an outbound network probe), so I looked at it carefully.
Verdict: Approve. No correctness, data-loss, or build-breaking issues. go build, go vet, and all new tests pass locally. The two new metrics are correctly wired through the single delivery chokepoint, label cardinality is bounded, and the health probe is timeout-bounded and fails safe. The items below are non-blocking — one info-disclosure point I'd like a maintainer to consciously accept, plus three minor robustness/test nits.
What's good
- Metrics design is sound.
delivery_duration_seconds/delivery_totalare labeledchannel_type{person|group|topic|other} × result{ok|error}— fully bounded, norobotID/channelIDto explode Prometheus memory. Thedefault -> "other"collapse is fuzz-safe and unit-tested.init()pre-warms the 8-series matrix so dashboards can distinguish "zero deliveries" from "metric absent". Namespace/subsystem/buckets are consistent with the existingdmwork_http_*andmodules/oidc/metrics.gopatterns. - Instrumentation point is correct.
observeAgentTurnDelivery(req.ChannelType, elapsed, err)is recorded immediately afterdispatchMsgSendReq(the single caller), before theif err != nilbranch — intentional and right. The error classification captures both transport failures and non-2xx IM responses asresult=error. - Health probe fails safe. Unconfigured
WuKongIM.APIURL→im=unknownwithout failing overall readiness; only a transport-level failure marksim=down. The 2shttp.Client{Timeout}bounds a hung IM control plane so it can't stall the handler (or a liveness probe behind it). Treating any HTTP response (incl. 4xx/5xx) as reachable correctly decouples readiness from a specific WuKongIM route/version.
Non-blocking findings
[P2 — please have a human accept] Unauthenticated /v1/health discloses the internal IM address on probe failure
modules/common/api.go — /v1/health is registered via raw r.GET(...) (no auth middleware), and on a transport failure the handler returns statusMap["error"] = lastError.Error() in the JSON body. The imReachable transport error embeds the configured WuKongIM.APIURL (e.g. Get "http://10.x.x.x:5001": dial tcp ...: connection refused), so an unauthenticated caller can learn internal control-plane host:port / DNS topology.
Important context: this is not a new vulnerability class — the parent commit already returned lastError.Error() for db.Ping/redis.Ping failures, which leak the DB/Redis host:port the same way. This PR adds one more internal endpoint (IM) to an already-leaky surface. I also confirmed no secret leaks: WuKongIM.ManagerToken is a separate config field sent via an HTTP header at the real call sites and is never part of APIURL.
Because it's incremental rather than novel, I'm not blocking on it — but since this PR is security-sensitive and it does widen the disclosure surface by one endpoint, a maintainer should consciously accept it. The clean fix (which also closes the pre-existing db/redis case): keep the per-component up/down/unknown fields in the public body but drop the raw lastError.Error() string, logging it server-side only (the handler already does cn.Error(...)).
[P2] imReachable follows redirects, partially contradicting "any HTTP response = reachable"
modules/common/api.go — imReachable uses the default http.Client (no CheckRedirect), which follows up to 10 redirects. The doc comment says any HTTP response counts as reachable, but a 3xx from the IM root is followed rather than counted; if the redirect target then fails transport, the probe reports im=down even though WuKongIM answered. The tests cover 200/404/500 but not a redirect. Low likelihood (the IM root is a flat API server) and it fails closed, so this is a robustness nit — consider client.CheckRedirect = func(*http.Request, []*http.Request) error { return http.ErrUseLastResponse } to make the "first response wins" intent explicit.
[P2] Health probe doesn't propagate the request context
modules/common/api.go — the probe uses http.NewRequest rather than http.NewRequestWithContext(c.Request.Context(), ...). If the health caller disconnects, the outbound IM probe still runs to the 2s timeout. Bounded and minor (and a global rate limiter already guards the endpoint), but threading the request context would let it cancel early under health-check floods.
[P2 — test quality] TestAgentTurnDeliveryTotal_AllLabelsPrewarmed doesn't assert what its name claims
modules/bot_api/metrics_test.go — the test calls GetMetricWithLabelValues, which lazily creates the series if absent, so it would pass even if init() stopped pre-warming the 8 combinations. It validates label well-formedness, not prewarming. To actually assert prewarming, check testutil.CollectAndCount(metricAgentTurnDeliveryTotal) == 8 (or assert each series is present) before any observe* call runs.
Coverage notes / blind spots
- I verified build, vet, the new tests, the
WuKongIM.APIURLconfig field, theChannelType*constants, and thatdispatchMsgSendReqhas a single caller. I did not exercise the probe against a live WuKongIM, nor confirm howocto-deploymentk8s manifests consume/v1/health(whether the JSON body is parsed or only the 200 status is read) — that drives the real-world impact of the info-disclosure point above. - Latency semantics: the histogram measures the outbound dispatch leg only, not the end-to-end human→bot→human turn. That matches the stated OCT-17 scope and the PR description's rationale (no reliable server-side correlation key across the two async HTTP legs), so it's a deliberate, documented choice — just worth keeping in mind when naming dashboards.
yujiawei
left a comment
There was a problem hiding this comment.
Code Review — PR #407 (octo-server)
Reviewed at head dfea4c5613e9fdb3dd82c4014538c77d8f9cf9bb. This is a full-scope review of all 36 changed files (≈3.1k additions), correcting an earlier review that only covered the OCT-17 observability slice (5 files) and missed the bot-provisioning / cross-service auth surface that makes up the bulk of this PR.
Verdict: Request changes. One P1 (a ban-revocation gap on the user-token verification path) and one credential-hygiene item I'd want a human to sign off on, plus several non-blocking nits. The auth surface is, on the whole, carefully built — the items below are specific and each has a small fix.
What's solid
- Cross-space / IDOR gates on the daemon paths are correct.
GET /v1/bot/:uid/tokenonly returns a token for a bot that is both a member of the caller api_key's bound space (INNER JOIN space_member sm ON sm.uid=r.robot_id AND sm.space_id=? AND sm.status=1) and created by the caller (r.creator_uid == callerUID); a guessedbotUIDyields 403/404 with no existence leak. - Anti-enumeration is sound. Credential failures on the daemon path collapse to single codes (
ErrBotProvisionAuthFailed/BotForbidden/BotNotFound). I confirmed the dbr semantics:botTokenuses.Load()(multi-row), which returns(0, nil)on zero rows — so the no-rows case correctly falls through to ther.BotToken == ""→ 404 branch (it does not 500). The 404 path is reachable and correct. - Caller-supplied
bot_tokencannot hijack an existing bot.robot.bot_tokencarries a functional UNIQUE index onNULLIF(bot_token,''), so a colliding INSERT fails;createBotCoreWithRetryholds the token constant across retries, so a collision errors out rather than overwriting a victim's row. - Parameterized SQL throughout; the only string concatenation is static identifier quoting (
`user`), never user input. owned_bots_by_spacewhitelist holds. The token path filters bot rows throughvalidSpacesbuilt from the caller's own active spaces, and the api_key path scopes via SQLAND sm.space_id=?— neither leaks a foreignspace_id.
Blocking
[P1] POST /v1/auth/verify does not gate user.status, so a banned user keeps gateway access until token TTL
modules/user/api.go — authVerifyToken resolves the session token purely from the Redis cache (Cache().Get(TokenCachePrefix + token)) and builds the full response — identity, OwnedBots, and (with ?include=context) spaces + owned_bots_by_space + context_included — with no user.status=1 check.
The ban path does not evict that cache. Admin ban (liftBanUser, modules/user/api_manager.go) sets user.status=0, bans the IM channel, and calls QuitUserDevice(uid, -1). I traced QuitUserDevice in the pinned dependency: it issues a single POST to the IM control plane's /user/device_quit and touches nothing else. The Redis key authVerifyToken reads (token:{token}) is deleted only by revokeAllDeviceTokens, which is called only from the account-deletion path (deleteAdminUsers) — never from ban. So after a ban, the user's session token remains valid in the cache (default TTL is long), and this endpoint keeps returning a full, populated authorization context for a banned account.
Why this is in-scope for this PR rather than pre-existing noise: the PR's two new sibling credential endpoints — authVerifyAPIKey and runtime-onboarding — both explicitly add a user.status=1 gate, with comments stating they are closing exactly this account-ban bypass. Their comments also assert that the session-token path is already covered because "QuitUserDevice clears the redis token cache." That premise is false (per the trace above), and the token path is the one path left ungated. The PR builds its new cross-service authz primitives (owned_bots_by_space, context_included) on this ungated path, so it extends the gap rather than merely inheriting it.
Fix: mirror the sibling pattern — gate the resolved info.UID on user.status=1 in authVerifyToken (and/or add the INNER JOIN user u ... u.status=1 to queryUserSpaceContext's spaces query) so a banned user fails closed here as it already does on the api_key path. If the downstream gateway consumers in fact re-check ban state and this is intentional, please say so explicitly and remove the now-incorrect "clears redis token cache" comments — but as written, this is a fail-open revocation gap on a credential-verification endpoint in a security-sensitive PR.
Human verification requested (this PR carries needs-human-review): please confirm the intended ban-revocation contract for the token gateway path before merge.
Non-blocking (should fix, not gating)
[P2 — please have a human security owner confirm] Committed MinIO credentials in the default config
configs/tsdd.yaml uncomments and activates fileService: "minio" with accessKeyID: "admin" / secretAccessKey: "12345678", and this file is the default -config path (main.go). Mitigating factors: the endpoint is loopback (http://127.0.0.1:9000), the block is labeled dev-only, and the values are env-overridable (TS_-prefixed AutomaticEnv is wired, so TS_MINIO_SECRETACCESSKEY overrides). Still, committing live credentials to a public repo is a hygiene line worth not crossing, and the in-file comment ("Production should set proper creds via env vars, not commit creds") argues against the change itself. Recommend re-commenting the block or switching to env-var references before merge. The same hunk flips external.ip from auto-detected to hardcoded localhost, which would feed http://localhost:8090 to onboarding output unless External.BaseURL is set — worth a second look.
[P2] POST /v1/bot/mint has no per-UID rate limit
modules/bot_provision/bot_api.go mounts mint behind AuthMiddleware only, while the sibling bot-create route uses SharedUIDRateLimiter and even this PR's own /v1/bot/:uid/token got a StrictIPRateLimit. Each mint amplifies into ~5 DB writes + IM app creation, so an authenticated space member can script bot/space_member churn (bounded only by the global per-IP floor). Add SharedUIDRateLimiter to match the convention.
[P2] Caller-supplied bot_token has no format/entropy floor
mintBot accepts req.BotToken verbatim when non-empty (no bf_ prefix / length / entropy check). It can't hijack another bot (UNIQUE index), but it lets a caller register a low-entropy bearer value in the global bot-token namespace that authVerifyBot trusts unscoped. Prefer server-only generation, or enforce the bf_ prefix + entropy.
[P2] client_id='botfather' / status=1 duplicated as literals across 3+ sites
resolve.go, authVerifyAPIKey, and the issuance default all hard-code the daemon-gate literals (forced by a real import cycle). Not exploitable today, but a drift risk on the gate that keeps integration-client keys off the daemon path — a single exported constant would remove the class.
[P2] runtime-onboarding uses raw error responses and isn't in the legacy-response guard list
modules/botfather/api_runtime_onboarding.go uses c.ResponseErrorWithStatus(errors.New(...)) at ~8 sites (the repo's envelope/i18n contract prefers httperr.ResponseErrorL + registered errcode), and the new file isn't added to TestBotfatherNoLegacyResponseError's file list, so future edits here evade the guard. Strings are generic (no info leak today); this is a contract/consistency fix.
Build / test / coverage notes
go buildandgo vetpass for all changed packages. The new observability tests pass. Thebot_provision/userauth tests require a live MySQL/Redis (they panic intestutil.NewTestServeron DB/migration setup here) — consistent with the PR's stated "CI needs MySQL/Redis/WuKongIM"; I could not execute them in this environment.- Out-of-repo blind spots that bound the P1's real-world blast radius: the fleet/matter consumers of
owned_bots_by_space/context_includedlive in other repos, and the nginx internal-IP ACL the code cites as the primary control for the verify-* / token endpoints is in the deployment repo. The server-side gate is what I can assess, and it is missing on the token path.
Jerry-Xin
left a comment
There was a problem hiding this comment.
This PR is relevant to octo-server, but it contains blocking regressions outside the stated observability scope.
🔴 Blocking
-
🔴 Critical: configs/tsdd.yaml enables local-only deployment settings in a committed config:
external.ip: "localhost"and MinIO as the activefileServicewithadmin/12345678credentials at lines 139-145. This changes runtime behavior for anyone using this config and commits local PoC credentials. Keep this as commented documentation or move it to a local/example override file. -
🔴 Critical: modules/botfather/api_runtime_onboarding.go introduces a new user-facing endpoint that returns legacy raw errors via
c.ResponseErrorWithStatusat lines 66, 79, 93, 97, 107, 111, 124, and 134. This violates the repository’s localized error-envelope contract for new endpoints. The existing botfather source guard also does not include this new file: modules/botfather/api_i18n_test.go. Add registered errcodes/helpers and include the new handler file in the guard.
💬 Non-blocking
-
🟡 Warning: authenticated new routes are missing the shared UID limiter.
/v1/runtime-onboardingis mounted with onlyAuthMiddlewarein modules/botfather/api.go, and/v1/bot/mintis mounted similarly in modules/bot_provision/bot_api.go. Project guidance says authenticated routes should mountSharedUIDRateLimiterafter auth. -
🟡 Warning:
gofmt -lreports unformatted Go files:modules/botfather/api_runtime_onboarding.go,modules/bot_provision/bot_api.go,modules/user/api.go, andmodules/user/api_verify_apikey_test.go.
✅ Highlights
- The agent-turn metrics use bounded labels and prewarm low-cardinality series.
- The IM health probe is additive and has a timeout, which keeps the health endpoint from hanging indefinitely.
I also tried go test ./modules/bot_api ./modules/common; it failed during test server setup with an existing migration-state error (unknown migration in database), so I could not validate those packages end to end in this checkout.
Summary
Observability for the end-to-end agent turn on the server↔channel (WuKongIM) boundary (OCT-17, sub-task of OCT-10).
/v1/health): already probed db + redis; now also probes WuKongIM reachability. Any HTTP response (even non-2xx) counts as reachable — the probe asserts the IM API answers on the network, not that a specific health route exists, so readiness isn't coupled to a WuKongIM version/path. Only a transport failure marksim=down; unconfiguredAPIURLreportsim=unknownwithout failing overall readiness. 2s timeout bounds a hung IM control plane. HTTP stays 200 (aggregate status in body) — additive, non-breaking for existing consumers.modules/bot_api/metrics.go): instrument the bot reply delivery leg (sendMessage→dispatchMsgSendReq):dmwork_agent_turn_delivery_duration_seconds(HistogramVec) — latency of bot reply delivery to the channel.dmwork_agent_turn_delivery_total{result}(CounterVec) —result=erroris the failed agent-turn deliveries counter.promauto(consistent withmodules/oidc/metrics.go); exposed by the existing/metricsscrape. Labels kept low-cardinality:channel_type{person|group|topic|other}×result{ok|error}, pre-warmed to zero series.Acceptance (from OCT-10)
Test plan
go build ./...go vet ./modules/bot_api/ ./modules/common/(only linter gated in CI)go test ./modules/bot_api/ -run 'TestAgentTurn|TestObserveAgentTurn|TestBotAPINoLegacyResponseError'go test ./modules/common/ -run 'TestIMReachable'No REST/WS contract break (health gains an additive
imfield; metrics are new). No auth/secrets/permissions changes.