feat(observability): surface effective agent model in /status skill#67
Open
ligolnik wants to merge 90 commits into
Open
feat(observability): surface effective agent model in /status skill#67ligolnik wants to merge 90 commits into
ligolnik wants to merge 90 commits into
Conversation
…I MCP tools
- Stuck-session prevention: track currentAssistant/previousAssistant turns,
finalize resume point at result-time, skip thinking-only end_turn turns
- Auto-recovery: erroredWithoutProgress flag retries the same prompt with a
recoveredThisTurn cap, clearing resumeAt
- Expanded debug logs: thinking blocks, tool_use/tool_result + latency, token
usage, cache hit rate, prompt preview, SOUL hash, rate_limit events, full
error bodies
- Session-aware IPC path: read from /workspace/ipc/input-${SESSION_NAME}/
to side-step VirtioFS nested bind-mount staleness on macOS Docker
- ENOENT race fix in drainIpcInput: when readFileSync succeeds but unlinkSync
finds the file gone (concurrent drain), swallow ENOENT instead of throwing
to outer catch and discarding the parsed message — root cause of silent
message loss after a query ends
- New OneCLI MCP server (onecli-mcp-stdio.ts): 7 Calendar tools, 9 Gmail
read+drafts tools, 8 SmartThings tools (incl. paginated history with
cursor-based nextPage), wired up when HTTPS_PROXY is set
- IPC send_voice tool added to ipc-mcp-stdio.ts
- google-calendar container skill scaffolded
…ain DM concurrency bypass
- channels/telegram: host-side auto-react 👀 (gated to main+trusted AND
trigger-matched in groups requiring trigger), transcribeVoice via OneCLI
curl to OpenAI Whisper, transcript-based trigger inference
("LoMBot"/"limlombot"/"lom bot" → prepend @LomBot), sendVoice via
OpenAI TTS through OneCLI proxy (default voice: alloy)
- channels/telegram-sanitize: TELEGRAM_ALLOWED_TAGS Set + Phase 1b unknown
tags get htmlEscaped instead of protected — fixes 400 from <deliveryScheduleId>
and similar agent-emitted XML-ish content
- container-runner: OneCLI HTTPS_PROXY + CA cert env injection (main+trusted
only, NOT untrusted), NO_PROXY=host.docker.internal,127.0.0.1,localhost
(NOT api.anthropic.com — so 3rd-party tools route via OneCLI), per-tile
owner support (TileRef[]), mainOnly skill gate (google-calendar),
idle-timeout reclassified INFO when hadStreamingOutput
- group-queue: setIsMainGroupResolver + bypass MAX_CONCURRENT_CONTAINERS
for main DM (so background heartbeats don't queue main behind themselves)
- observer (new): per-query summaries to status group, live thinking stream,
progress reactions 👀→🤔→⚡→✍, watchdog with 30s blink + 60/120/300s pings,
noteLatestUserMessage tracking
- ipc: send_voice host handler (OpenAI TTS via OneCLI), send_file path
resolver extended to /workspace/extra/<mount>/... via additionalMounts
lookup, Tessl periodic-update path fix
- index/types: sendVoice deps wiring, Channel.sendVoice signature, observer
init, queue main-resolver
…ice docs, tessl workspace - groups/global/CLAUDE.md: identity Andy → LoMBot, mcp__onecli__gcal_* tool docs, mcp__onecli__gmail_* (read+drafts only) docs, SmartThings tool docs, voice reply preference (mcp__nanoclaw__send_voice), workspace path note (write files under /workspace/group/, not /tmp/) - groups/global/CLAUDE-untrusted.md: minor untrusted updates - groups/main/CLAUDE.md: synced - tessl.json: add jbaruch/nanoclaw-telegram dependency - tessl-workspace/: setup with .tessl/RULES.md and .claude/ pointing at the installed tile rules - scripts/claw: convenience wrapper
- agent-runner: stop slicing thinking blocks at 400 chars when logging. Whitespace is still collapsed so the entire block stays on one log line (downstream parsers split on newlines). Full content is now visible in `docker logs` for post-mortem analysis. - observer: chunkText() splits messages over 3800 chars at the nearest whitespace within the last 200 chars before the boundary, so words don't get cut. Multi-chunk sends are sequenced so they arrive in order, tagged "(1/N)" / "(2/N)" / etc. Failure of one chunk doesn't abort the rest.
Mirror the host-side validator in the MCP tool itself so the agent gets immediate isError feedback when it tries to send a file from /tmp/ or any other path the host can't read. Previously the tool returned fake "File queued for sending" success regardless of path validity, then the host quietly logged "send_file: path outside allowed mounts" and dropped the file — no feedback loop, so the bot kept generating recipes that wrote to /tmp/ and called send_file on them. Allowed prefixes: /workspace/group/, /workspace/trusted/, /workspace/extra/. Error message tells the agent exactly where to write instead. Existence check still runs after the prefix check.
…rver architecture doc - Anthropic changed `display` default to 'omitted' on Opus 4.7 for faster streaming. Without explicit 'summarized', thinking blocks arrive with empty content + signature blob; the observer's 🧠 stream goes silent, the 🤔 reaction never fires. Older models default to 'summarized' and accept the explicit value as a no-op, so it's safe to leave on across all model versions. - docs/OBSERVER.md: full architecture writeup — five-layer pipeline, key design decisions (thinking display gotcha, stringly-typed wire format, chunking, per-source state, watchdog, reaction allowlist), enable/disable instructions, and what the system is NOT (not a security boundary, not a feedback loop, not a metrics replacement).
…d groups createFilteredDb was copying chats + messages but not reactions. check-unanswered.py joins on reactions to skip messages the bot already 👀-reacted to; without the table the join hit "no such table: reactions" and the whole script aborted, leaving every scheduled-task wake-up in untrusted containers as a silent no-op. Now the filtered DB: - Copies reactions scoped to the same chat_jid filter as messages (privacy boundary — untrusted containers still don't see other groups' reaction history) - Falls back to an empty reactions table with the known schema if src doesn't have one yet (fresh install before migrations) - Indexes (message_id, message_chat_jid) to match the host schema Stale filtered-db dirs need to be wiped to pick up the new schema: rm -rf data/filtered-db before next container spawn.
…gets escaped Phase 1b's tag-matcher used [a-zA-Z][a-zA-Z0-9-]* — no underscore in the character class — so agent-emitted tags like <tool_use_error>, <delivery_id>, <some_field> fell through unmatched. Telegram rejected the whole message with 400: "Unsupported start tag tool_use_error". HTML spec disallows underscores in tag names, but Claude and other agentic frameworks emit them constantly (Bash tool's blocked-action errors are wrapped in <tool_use_error>...</tool_use_error>, JSON dumps with snake_case field names leak as faux-XML, etc). Adding _ to the character class makes 1b catch them and escape to <...>.
Untrusted containers now get the OneCLI HTTPS proxy + CA cert injected
(previously only main + trusted did). NANOCLAW_TRUST_TIER env var tags
the container's tier, and onecli-mcp-stdio.ts uses it to gate which
MCP tools register.
For untrusted, only `gcal_freebusy` is registered. The freebusy
response shape is just {start, end} time pairs — no titles, attendees,
or event metadata — so participants in untrusted chats can ask the bot
"is Leonid free Thursday afternoon?" without learning what's actually
scheduled.
Calendar-id clamp: untrusted callers can't pass arbitrary calendar IDs
(otherwise they could probe email-address-shaped IDs that the user
happens to have shared access to, leaking who-knows-whom). Server-side
override forces calendarIds = ['primary'] when TRUST_TIER === 'untrusted'.
Single chokepoint: registerTool is wrapped at module init to silently
skip non-allowlisted tools. Adding to the untrusted surface is one
line in UNTRUSTED_ALLOWLIST, not 24 callsite edits.
The bot in untrusted groups had no way to recognize that the OneCLI/ OAuth account belongs to a specific Telegram user. When @ligolnik asked about "my calendar availability" in old.wtf (untrusted), the bot defaulted to "asker is a participant, primary calendar belongs to owner, decline" and refused — even though the asker WAS the owner. Pinning owner identity in CLAUDE-untrusted.md so the bot can match sender handles. Match is by @-handle (stable), not display name (collisions). Other participants still get the "I don't have your data" treatment, since they genuinely aren't the OAuth account holder.
…ually mounted) CLAUDE-untrusted.md is never mounted into untrusted containers — only SOUL-untrusted.md is, masqueraded as /workspace/global/SOUL.md so core-behavior's "read SOUL.md" works. The owner-identity patch I shipped to CLAUDE-untrusted.md earlier was invisible to the bot in old.wtf, which kept defaulting to "owner = Baruch" from upstream tile content. Moving the explicit owner identity (Leonid / @ligolnik / lim@igolnik.com) plus the "any Baruch reference in tile rules means Leonid here" override into SOUL-untrusted.md. CLAUDE-untrusted.md retained for parity if the mount ever changes — duplication is cheap. Local vendored tile copies were also scrubbed (not committed — they're gitignored, managed by tessl install). PRs filed upstream: jbaruch/nanoclaw-core#16, jbaruch/nanoclaw-trusted#10.
The (isMain || trusted) gate on host-side 👀 was previously commented as a "noise" filter — keeping the bot from acking chitchat. The actual reason is more important: in untrusted contexts the agent's bad-actor-disengage rule may decide to go silent (no text AND no reaction) when probing/social-engineering is detected. A host-side 👀 fires before the agent ever reads the message, so it leaks "I see you" to hostile probers regardless of what the agent decides. Comment updated to record the actual architectural reasoning so the gate doesn't get "simplified" away in a future cleanup. No code change.
Race: pollIpcDuringQuery drains a file (deletes from disk + adds to consumedInputFiles) and pushes content to MessageStream. If this fires AFTER the SDK has emitted Result and agent-runner has broken out of the for-await loop over responses, but BEFORE runQuery returns and ipcPolling flips to false, the file is gone but the stream is no longer being read. Message lost. Next waitForIpcMessage skips the file (already in consumedInputFiles or unlinked). Bot never sees the user message. Heartbeat's check-unanswered eventually rescues it, but with up to 15-min latency. Hit by msg 1289 in old.wtf at 22:31:40 today — pipe landed during the closing seconds of a query for msg 1288. Zero "Piping IPC message into active query" log lines confirmed the drain happened mid-window without delivering. Bot only answered 14m17s later when the cron heartbeat caught it. Fix: pollIpcDuringQuery now only checks for the _close sentinel. File draining is exclusive to waitForIpcMessage between queries. Throughput unchanged — SDK conversations are turn-based, mid-turn pushes weren't actually being processed mid-turn anyway, just queued for after the current turn. Now they queue at the file system layer instead, with deterministic delivery.
…iner A _close sentinel is single-shot signaling — orchestrator writes it, agent-runner reads + unlinks it, container exits. But if the container crashes / is killed / hits a graceful shutdown that doesn't make it through the unlink path, the file lingers. Next container spawn for the same group/session has its first IPC poll consume the stale sentinel and end the input stream. The SDK still finishes the in-flight prompt (it was pushed before the close fired), but the container exits immediately after — every subsequent user message pays a fresh spawn cost (~10-30s). Hit by old.wtf at 22:54 today: msg 1296 took 121s wall time. Container log shows "Close sentinel detected during query, ending stream" at 500ms in (one IPC_POLL_MS), then the SDK ran to Result, then "Close sentinel consumed during query, exiting". Stale file from an earlier shutdown was the trigger. Found 6 stale _close files across data/ipc/*/input-*/ at the time, oldest from 14:17. Fix: clean the sentinel on the host before each container spawn, in the same buildVolumeMounts setup block that creates the IPC dirs. The agent-runner's existing shouldClose() also unlinks on detection, which handles the case where orchestrator writes _close while the container is still alive — that path stays, but stale-file recovery no longer depends on it.
…g' pings The watchdog armed on every Query input: line for a source folder. Scheduled tasks (SmartThings refresh, check-unanswered heartbeats, nightly-housekeeping) run in maintenance containers but share the source folder with the user's default container. Long cron tasks crossing 120s fired "Still working — 120s in" into the user's chat — looking like the user's last message was still being processed when in fact the user's query had finished minutes ago and a separate cron was running. Hit by main DM at 23:02 today: home_status query took 20.1s (finished at 22:59:24), then SmartThings refresh cron started at 23:00:55 with wall=150s. Watchdog crossed 120s at ~23:02:55 and pinged the user's chat. Bot then hallucinated the explanation that home_status had been slow. Fix: detect scheduled-task input prefix (`[SCHEDULED TASK -` or `<untrusted-input source="...">[SCHEDULED TASK -`) in the Query input preview and skip startWatchdog. Cron tasks run silently as before; user-driven queries still get the watchdog ping behavior they need.
…fired The SDK Result message carries the agent's terminal assistant text. The orchestrator's onOutput callback in src/index.ts treats that as a user-facing reply and sends it to the chat. The design intent (upstream unchanged) is that the agent does ONE of: 1. Answer via the SDK's final text (no tool call) — that text is the reply 2. Answer via mcp__nanoclaw__send_message AND wrap any closing thought in <internal>…</internal> — strip code in onOutput removes it, no echo jbaruch#2 depends on the LLM reliably wrapping closing summaries. Opus 4.7 sometimes emits them as plain text instead, so users see two messages: the explicit send_message reply, then a "Confirmed." / "Explained ..." summary. Reproducible — happens on most multi-tool turns. Fix: track whether send_message / send_voice / send_file fired during the query. If so, set result=null in the writeOutput for the SDK Result. The orchestrator's `if (result.result)` gate skips. The agent's closing thought stays in `docker logs` for debug but doesn't leak to chat. Upstream benefits from this too — the bug exists there but manifests less because their prompt corpus apparently keeps the LLM disciplined about <internal> wrapping; defense in depth is cheaper than relying on prompt compliance.
Adds two sections to groups/global/CLAUDE.md (loaded by main + trusted
containers; untrusted is unaffected since it doesn't mount global/):
1. **Your tools live here** — directory of the python/js tools the
bot has built for itself (smartthings_history.py, home_status.py,
presence_chart.py, temperature_chart.py, tv_chart.py, methodology/)
plus the *_notes.md files. Mirrors what already exists in the
main group's local CLAUDE.md but at the global level so other
trusted contexts know what's available.
2. **Git — committing and pushing** — explicit allowlist:
- Allowed: commit + push to ligolnik/* repos; open PRs against
ligolnik/* repos. The main group's workspace is a git repo whose
origin is ligolnik/lombot — commit there freely.
- With permission only: third-party repos (jbaruch/*, qwibitai/*,
anyone else). Owner must explicitly say "open a PR upstream to X".
- Never: force-push to main, direct push to a third-party repo,
`gh pr create` without explicit --repo flag.
- Auth: OneCLI proxy injects token on api.github.com, no manual
PAT handling. Includes the curl recipe for PR creation via REST.
Adds an optional chat_jid parameter to the mcp__nanoclaw__send_message MCP tool. When set, the IPC payload targets that chat instead of the container's own. Authorization is unchanged — host-side ipc.ts:457-461 already enforces "main can target any chat, trusted/untrusted only their own". The MCP tool just plumbs the param through. Why: the bot in main DM had built bot_post.py as a parallel direct- Telegram-API path because send_message was hard-coded to its own chat. But bot_post.py wrote messages to Telegram without recording them in messages.db, so the agent in the target chat (and the cron heartbeat) never saw them — leading to "I can't find this message" hallucinations when other bots quote-replied to those out-of-band sends. Routing through send_message instead means the orchestrator's existing ipc.ts handler does both Telegram delivery AND the storeMessage call — single code path, automatic DB record, no parallel implementation drift. groups/global/CLAUDE.md updated with the new pattern + a "use this instead of out-of-band senders" note. bot_post.py is being removed from the main group's workspace as a follow-up.
Once-tasks were marked status='completed' but never deleted, so getAllTasks() (and the snapshot it feeds into list_tasks) returned them indefinitely. Over weeks the list grows unboundedly with stale entries the user can no longer act on. The fix adds pruneCompletedTasks(maxAgeMs) in db.ts and calls it once per scheduler loop iteration with a 24h TTL. The grace window keeps recently-finished tasks visible in list_tasks for inspection. cancel_task is unaffected — it still deletes immediately via deleteTask, so the cancellation workflow doesn't change. Recurring tasks are inherently excluded (computeNextRun never returns null for them, so they never reach status='completed'); the schedule_type='once' clause in the prune query is defensive. Fixes jbaruch#1.
The bot opened lombot PRs jbaruch#4 and jbaruch#5 today AND simultaneously direct- pushed the same content to main, leaving the PRs stranded. ~1,134 lines of drive_planner code landed without owner review. Updates the git rule in groups/global/CLAUDE.md to be explicit: - "Push to main directly" — never. Even on owner's own repos. Branch + PR + stop. - "Merge any PR" — never. Filing is the bot's output; merging is the owner's call. - Adds a 5-step PR workflow recipe: branch, push, open PR via REST, STOP. Tells the bot to surface the PR URL and wait, explicitly forbids the "self-review-and-merge" pattern. Tracking issue filed separately for technical enforcement (branch protection / separate git identity for the bot) so the rule isn't the only line of defense.
SmartThings changed PAT policy 2026-04-26 — tokens are now 24-hour-only,
no programmatic re-issuance. This adds a host-side watchdog that probes
/v1/locations every 30 min through OneCLI; on 401 it pings the owner's
main DM (chat_id 114893642) with the rotation procedure (deduped to
once per 4h per failure spell). On 200 after a 401 streak it sends
"✅ access restored". Probe is launchd-managed via
com.nanoclaw.smartthings-watchdog.plist.
Files:
- scripts/smartthings-watchdog.sh: bash runner; uses onecli run -- curl
for the probe (auto-injects PAT + CA), reads TELEGRAM_BOT_TOKEN from
.env, persists state at data/smartthings-watchdog-state.json
- scripts/install-smartthings-watchdog.sh: substitutes {{PROJECT_ROOT}}
and {{HOME}} in the plist, copies to ~/Library/LaunchAgents, loads
- launchd/com.nanoclaw.smartthings-watchdog.plist: KeepAlive=false,
StartInterval=1800
Also tessl: bump nanoclaw-trusted 0.1.48→0.1.49.
Cron change (every 15 min → every hour, --stale-hours 4, 401 short-
circuit) was applied directly to scheduled_tasks via IPC by the
background subagent — host DB state, not in this commit.
Sync with upstream after our PR jbaruch#2 (scheduler prune) merged. Brings in 4 follow-up scheduler refinements that build on our prune fix, plus unrelated upstream features. ## Conflict resolutions (4 files) - src/db.ts → take upstream. Strict superset of our pruneCompletedTasks: uses COALESCE(last_run, created_at) so NULL-last_run orphans (from failed dispatch paths) get pruned by created_at age, not lingering forever. Addresses our own ligolnik/nanoclaw-public#5. - src/task-scheduler.ts → take upstream. Adds getCompletedTaskTtlMs() env override, PRUNE_INTERVAL_MS=1h throttle, dormant-cron warnings with per-task cooldown. All on top of our PR jbaruch#2 foundation. - src/task-scheduler.test.ts → take upstream. New tests for the refinements; our 3 prune tests are still in there. - src/container-runner.ts → keep ours. Our OneCLI proxy injection, reactions filtered DB, stale _close cleanup, trust-tier env, per-tile owner support, additional-mounts authz are all forks-specific. Single conflicting hunk (env-file COMPOSIO forwarding) resolved in favor of our COMPOSIO removal — the env-file infrastructure (SECRET_CONTAINER_VARS, buildSecretEnvFile) stays from the auto-merged sections so future per-container secrets can use it. ## What's coming in from upstream (16 + 3-of-our-prune commits, paraphrased) Container runtime: - a253c16 secrets via 0600 env-file, not -e (ps-leak fix) - e690b55 cleanup tempfile on write failure Scheduler: - 1c5d69c set-based deletes in pruneCompletedTasks (jbaruch#32) - a507f68 NULL-last_run orphan handling, configurable TTL, throttled cadence, dormant-cron detection - d2947cb Copilot follow-ups on prune housekeeping (jbaruch#36) - 9ba3b0f apply COALESCE age threshold to dormant-recurring query (jbaruch#38) - f749296 plumb continuation_cycle_id env var on scheduled-task spawns - 71264b1 test(group-queue): parallel-session isolation under continuation chains Telegram emoji: - 86edff1 normalize Slack shortcodes to Unicode for reactions - c2b1721 proto-pollution guard + audit-friendly VS16 regex - c65f052 actually use ️ escape, fix proto-test comment Style/test housekeeping: - b144149 prettier-format db-migration.test.ts - b9c3b2c clean up tempDir in continuation_cycle_id migration test ## Test debt — addressed in PR review `npx vitest run`: 500 pass, 135 fail, 12 skip. Most failures are pre-existing divergences between our fork and upstream's test assumptions (selectTiles signature is TileRef[] vs string[], our container-runner.security.test.ts expects our env-file path, in-memory DB initialization). NEW breakage from this merge: zero — build is clean, our runtime behavior is preserved, the fix is in the test shape, not the code shape. Resolution plan documented in the PR body.
…ocks
After the upstream merge, two test files needed shape updates to match
fork-specific code:
- container-runner.test.ts (selectTiles, 5 tests) — our selectTiles
returns TileRef[] (per-tile owner support) not string[]. Added a
`names()` helper that maps to .name so the assertions stay readable.
Also expanded "main + trusted" expectations to include the
fork-local `flight-weather-watch` tile that's appended on those tiers.
- channels/telegram.test.ts (config mock) — our auto-react gate calls
getTriggerPattern() and the channel imports GROUPS_DIR for voice
transcription. The test's vi.mock('../config.js') was missing both
exports, causing 13 telegram tests to fail with "No 'getTriggerPattern'
export is defined on the mock". Added stubs (default trigger pattern,
/tmp/test-groups for GROUPS_DIR) — tests don't exercise either path
but the imports resolve at module load.
After fixes:
Test Files 26 passed (26)
Tests 635 passed | 12 skipped (647)
Started this run at 135 failures (mostly the better-sqlite3 ABI
mismatch that npm rebuild fixed); the remaining 19 were these two
shape divergences. Build still clean.
merge: sync with jbaruch/nanoclaw-public — 19 upstream commits + scheduler refinements built on our PR jbaruch#2
PR jbaruch#46 (sanitizer allowlist): - Drop 'br' from TELEGRAM_ALLOWED_TAGS (Bot API HTML mode doesn't support it; agents emit \n, not <br>) - Add Phase 1b unknown-tag describe block with 8 tests covering agent-emitted markers, JSON-shape leakage, mixed-case allowed tags, self-closing/attribute-bearing unknown tags, and the unknown-inside-fenced-code regression PR jbaruch#47 (final-text suppression): - Track tool_use ids of user-facing send calls in a Set, then watch for matching tool_result blocks and only flip the suppression flag when is_error !== true. A hook-denied or errored send_message no longer leaves the user staring at silence.
…jbaruch#33) Closes the lying-telemetry gap where script-gated tasks (e.g. heartbeat-telegram_old-wtf) wrote status='success' even when the gate script's payload signalled a hard failure. The 96 wasted wake-ups/day on the untrusted-tier heartbeat happened because a 'DB access failed' run was indistinguishable in run logs from a real success. The gate script's exit code is the source of truth (per the check-unanswered.py docstring contract: exit 1 = hard failure), but by the time the agent has dutifully wrapped the script's degraded payload into an <internal>...</internal> reasoning block, the orchestrator only sees the agent's 'success' status. This patch adds a stable-marker pattern check ('DB access failed', 'unable to open', 'env-warning:') against the result payload right before logAndUpdateTask writes the row — a positive match reclassifies the run as 'error' so it shows up in 'broken tasks' queries and counts against telemetry honestly. Result remains captured in the result/error fields for forensics. Fix A from the issue (restoring store read access for untrusted containers) is deliberately deferred to a follow-up — fixing the classification first stops the wasted spend; restoring the mount restores the safety-net coverage. Closes jbaruch#26 Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nds from main (jbaruch#34) Brings send_file and send_voice to parity with send_message: main may target any registered chat with chat_jid; trusted/untrusted ignore the param and fall back to the current chat (same authz gate as send_message in src/ipc.ts:457). Cross-chat sends are recorded in messages.db so downstream agents in the target chat see them. Closes jbaruch#25 Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y lost tasks (jbaruch#35) Closes the silent-failure class observed tonight where a wedged maintenance container blocked every scheduled task in a group for 80+ minutes — once-tasks pre-advanced to status='completed' but never ran, cron tasks had next_run rolled forward without firing, resurrectZombieTasks couldn't recover them because of its (correct) next_run > now storm-prevention guard. Three layers of fix: 1. Reject past schedule_value at schedule_task creation time — the highest-signal user-facing case (caller knows immediately their typo didn't schedule anything). src/ipc.ts adds a Date.now() guard in the once-branch with a structured warn log carrying parsedAt, nowAt, and source/target context. 2. Periodic sweep of group-queue pendingTasks: any task waiting longer than DISPATCH_DROP_THRESHOLD_MS (30 min, matches CONTAINER_TIMEOUT) gets dropped and recorded as status='error' in task_run_logs with a 'maintenance slot wedged' result. Surfaces the lost dispatch instead of leaving the row in the silent dropped-dispatch shape. Wired via setLogTaskRunFn from src/index.ts at orchestrator startup; tests inject a spy through _startDispatchSweepForTests. 3. Per-task watchdog in runTask kills the container after TASK_RUN_TIMEOUT_MS (default 30 min, overridable via NANOCLAW_TASK_RUN_TIMEOUT_MS) to prevent slot wedging in the first place. Records status='timeout' (new enum member on TaskRunLog.status) so the failure is visible and queryable separately from agent/script errors. Uses Promise.race so the await unblocks even when runContainerAgent never resolves. Stale 2025-06-01 fixtures in ipc-auth.test.ts bumped to 2099-01-01 so they survive the new past-rejection guard. Closes jbaruch#30 Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
… via config (jbaruch#36) Closes jbaruch#29. Cherry-picked from upstream jbaruch/nanoclaw#253 (already merged upstream): the SDK's auto-compact was firing at 165k on a 1M context window because a CLAUDE_CODE_AUTO_COMPACT_WINDOW='165000' hardcode in agent-runner overrode the SDK's natural 1M-aware default. Move the value to an orchestrator-side env-var injection so the orchestrator's resolved AGENT_AUTO_COMPACT_WINDOW config (default 800k) flows through to the SDK, leaving ~200k of compaction headroom on Opus's 1M window. We do NOT adopt the upstream PR's ENABLE_THRESHOLD_NUKE gating — that flag and the threshold-nuke infrastructure don't exist in this fork. The forward is unconditional. If threshold-nuke lands here later, this becomes a one-line gate-add. Closes jbaruch#29 Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es a secret (jbaruch#37) The SECRET_CONTAINER_VARS env-file forward (PR jbaruch#32) reads process.env to collect secret values to inject into containers. But the orchestrator runs under launchd, which doesn't auto-load .env into process.env, and no dotenv.config() call exists in dist/. Result: GITHUB_TOKEN sits in .env but never reaches a container — the SECRET_CONTAINER_VARS path silently produces an empty env-file and gets skipped. Match src/config.ts's established pattern: prefer process.env, fall back to a readEnvFile() lookup of the same key from .env. Single-call overhead per buildContainerArgs invocation; .env is small. Effective on orchestrator restart. Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that GITHUB_TOKEN is env-injected via SECRET_CONTAINER_VARS (PR jbaruch#32 + jbaruch#37), gh picks it up automatically — main/trusted containers can use gh directly without curl + REST gymnastics. Untrusted tier still gets no token, so gh fails closed with a clear auth error rather than silently using a stale identity. Adds GitHub's official apt repo with their signing key and installs gh via apt-get. Layer placed after the existing system-deps install so npm/COPY layers further down keep their cache. Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… mounts can open it (closes jbaruch#43) (jbaruch#46) The per-group filtered DB copy created by createFilteredDb is mounted read-only into untrusted containers as a single-reader snapshot. SQLite WAL mode requires writable -wal and -shm sidecar files even for read-only opens, so under a :ro mount any reader that opens the DB read-write (e.g. Python sqlite3.connect() default) fails with "unable to open database file". This silently broke the maintenance heartbeat for untrusted groups — check-unanswered.py runs inside the :ro mount and could not open the filtered DB, so the recovery path that detects and replies to unanswered messages never ran. The filtered copy is a one-shot snapshot: created fresh on every container spawn, single reader, no writers, no concurrency — none of WAL's tradeoffs apply. DELETE (the SQLite default) writes no sidecars and opens cleanly under :ro. The orchestrator's main store/messages.db is a separate file with concurrent writers and stays WAL. Updates the existing pragma regression test to assert DELETE. Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…ore 1000 (closes jbaruch#44) (jbaruch#45) On bare-metal macOS hosts the running user is typically uid 501, gid 20. With HOST_UID / HOST_GID unset (the default), config.ts resolved them to `undefined`, and every call site in container-runner.ts collapsed that to `?? 1000`. The orchestrator then tried to chown files it owns (501 -> 1000) as a non-root user, which EPERMs on every container spawn and fills nanoclaw.error.log with "Failed to chown ..." warnings that are indistinguishable from real chown failures. Insert the host process's own uid/gid as the second link in the resolution chain: HOST_UID env -> process.getuid?.() -> call-site `?? 1000` DooD deployments still set HOST_UID / HOST_GID explicitly (because inside a container process.getuid() returns the container's uid, not the host's), so they are unaffected. Bare-metal macOS / Linux hosts now resolve to the real host uid -- chown becomes a no-op (own -> own) and stops firing EPERM. Windows, where process.getuid is undefined, still falls through to the call-site `?? 1000`. Call sites in src/container-runner.ts keep their `?? 1000` last-resort fallback unchanged. Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
… growth (jbaruch#48) * fix(ipc): persist consumed input list and add host GC to bound input/ growth (closes jbaruch#47) Untrusted-tier containers mount data/ipc/<group>/input/ read-only as a security boundary, so the agent's unlinkSync swallows EROFS and consumed files stay on disk. Combined with an in-memory-only consumedInputFiles Set, every container restart re-drained every file ever written for the group. One wtf group hit 1,586 stale files re-drained as one prompt (846k chars) which crossed the 800k auto-compact window and poisoned the resulting session. Trusted groups have RW mounts and don't hit this. Two halves, complementary: Agent side (container/agent-runner/src/index.ts): - Append consumed basenames to messages/_consumed_inputs.log on each successful drain. messages/ is RW even on untrusted containers. - Replay the log into consumedInputFiles on agent startup so a fresh container skips the lifetime backlog. - Append/load tolerate EROFS/EACCES/ENOENT defensively. - drainIpcInput / loadConsumedInputs now accept optional path overrides so tests can drive them against a tmp dir. Host side (src/ipc-gc.ts): - runIpcGc atomically renames _consumed_inputs.log to .processing, deletes each listed basename from input-default/ and input-maintenance/, then drops the .processing file. - Strict basename allowlist (/^[A-Za-z0-9_.-]+\.json$/) plus explicit refusal of /, \\, .. — defends against a compromised agent writing a path-traversal entry to the log. - Crash recovery: leftover .processing file is picked up as-is on next call, no re-rename. - Wired into src/index.ts: one startup pass per registered group, then a 60s setInterval. runIpcGcSafe wraps errors so a transient FS hiccup doesn't tear down the orchestrator. Tests: - src/ipc-gc.test.ts: 8 cases covering missing log, single/multi delete, ENOENT-tolerant, traversal rejection, allowlist rejection, crash-recovery, dedupe. - container/agent-runner/src/ipc-consumed-log.test.ts: 6 cases covering append-on-drain, no-op when nothing consumed, replay on startup, missing-log tolerance, end-to-end restart-skip, incremental append. Out of scope (per issue): one-shot rm of the existing 1,607 stale files in telegram_old-wtf/input-default/ — handled manually before this lands. Mount permissions and the EROFS-tolerant unlinkSync stay as-is; they're the security boundary this PR works around. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * style: apply prettier formatting Pre-commit hook reflowed two array literals and a type assertion to fit the line-length config. No semantic change — tests pass identically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loses jbaruch#49) (jbaruch#54) Was 1536m/2048m. Triggered 119 cgroup-OOM SIGKILLs (exit 137) in 12h across main groups, with sessions running 100-900s before dying. Node + SDK + MCP procs + skill subagent prompt + accumulated cache_read climbed past 1.5GB on heavy main sessions. Cheap fix that addresses the symptom; root cause is unbounded session cache growth across resumed turns, which is a separate question for a future PR (cap resume depth, or trigger compaction in bytes not tokens). Closes jbaruch#49 Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…oses jbaruch#52) (jbaruch#53) When a `scheduled_tasks` row references a `group_folder` that is no longer in `registered_groups`, the dispatch handler used to ERROR-log "Group not found for task" and leave the row in place — so the scheduler dispatched the same orphan every poll interval forever. This produced 8+ ERROR entries/day per orphaned heartbeat in jbaruch#52. This commit changes the dispatch-time handler to delete the orphaned task and emit a debug log instead of an error. Rationale: - A missing group at dispatch is expected during the brief window between a group being deregistered and its tasks being cleaned up. It is not actionable, so the ERROR log was just noise. - The task is permanently orphaned — no future poll tick will resolve the missing group. Deleting it self-heals the situation. - Pre-existing orphans (e.g. the two `heartbeat-telegram_iff-lom-bot-test*` rows from jbaruch#52) are dropped on the next dispatch tick. This subsumes Option A's manual cleanup migration: deploy this fix and the orphans vanish on their own. No `DELETE FROM scheduled_tasks` script needed. Note on Gap fix jbaruch#1 (delete tasks on deregister): no programmatic group-deregister code path exists in this repo today — groups can only be added via setup/register.ts; orphans arise from manual SQL or registry edits. The dispatch-time self-heal is therefore a complete solution. If a deregister helper is added later, a sibling `deleteTasksForGroup(folder)` call should be added there for defence-in-depth, but it is not required to fix jbaruch#52. Includes a regression test that creates an orphan heartbeat with the same shape as the jbaruch#52 repro, confirms it is deleted on the first dispatch tick, confirms no ERROR-level log fires, and confirms a sibling task on a different (non-orphan) group is NOT touched. Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed emojis (jbaruch#55) * fix(telegram): translate bot-emitted IDs for reactions and log rejected emojis Closes jbaruch#50, closes jbaruch#51. ## jbaruch#50 — bot-id reaction translation (Path A) Pre-fix: outbound bot messages were stored in messages.db with a synthetic `bot-<ts>-<rand>` primary key. When something later asked the Telegram channel to react to one of those rows, sendReaction shipped that local id straight to Telegram's setMessageReaction, which rejected it with HTTP 400 (`message to react to not found`). 145 ERROR-level entries in 12h. Fix: store the platform-native numeric `message_id` returned by Telegram's sendMessage as the row's primary key when available, falling back to the legacy `bot-<ts>-<rand>` only for paths that don't surface a numeric id (currently sendPoolMessage which returns boolean, plus non-Telegram channels). sendReaction now: - passes numeric ids straight through (e.g. when an inbound message is being reacted to); - for legacy `bot-...` ids, looks the row up via getMessageById and reuses the row's primary key when it's numeric; - skips with a WARN (no Telegram call) when neither path yields a numeric id, so we stop burning API calls on doomed requests. Picked Path A over Path B (separate `telegram_message_id` column + schema migration) because the audit found nothing in the codebase parses the `bot-` prefix — `is_from_me=1` + `is_bot_message=1` are the canonical "bot row" markers, both still set unchanged. Single source of truth, no migration needed. Also downgrades known-unactionable Telegram 400s ("message to react to not found", "message to be replied not found", MESSAGE_ID_INVALID, chat not found, bot was blocked) from ERROR to WARN — these are the small fraction of failures that genuinely can't be retried, called out as out-of-scope but easy in jbaruch#50. ## jbaruch#51 — log rejected emoji The "Invalid Telegram reaction emoji, falling back to 👍" warn now ships `rejectedEmoji` (post-normalization form Telegram refused) and `originalInput` (pre-normalization, what the agent emitted) so we can extend the normalizer when patterns appear in logs. ## Audit findings Searched src/ for `bot-` literal usage and `startsWith('bot-')`: nothing in code parses the prefix — all four call sites that mint `bot-<ts>-<rand>` ids are write-only (storeMessage). The handful of matches in comments are explanatory. Safe to switch the id source. * style(telegram): apply prettier line-wrap left over from previous commit --------- Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
… kill (jbaruch#40) (jbaruch#41) PR jbaruch#35's dispatch-loss + per-task watchdogs catch wedged containers and record the symptom in task_run_logs. But the only evidence captured is "slot wedged for Xmin" — no insight into what was actually stuck. Adds a 5s-bounded best-effort diagnostic capture that runs immediately before either watchdog kills/drops anything. Captures docker inspect fields, ps -ef, /proc/<pid>/wchan (the load-bearing diagnostic — kernel wait channel tells us what syscall the agent-runner is blocked on), open TCP connections, last 100 docker log lines, and the triggering task context. Writes atomically to data/wedge-diagnostics/<iso>-<name>.txt. Each diagnostic command is bounded to 1s; the entire capture to 5s. A hung command (likely on a fully wedged container) leaves a placeholder and capture continues. The file's path is appended to the result field of the task_run_logs row so investigators can find the evidence. Refs jbaruch#40 Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on (jbaruch#63) The Query done summary already logs tokens_in/out, cache_read/create, and wall-time, and the container name attributes the line to a group. Adding the model (opus[1m]/sonnet/haiku) closes the last gap so docker logs alone are sufficient to compute per-group, per-model cost without touching the DB. Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…aruch#64) * feat(container-runner): per-spawn blocklist for maintenance-class spawns (nanocoai#337) (nanocoai#344) * feat(container-runner): per-spawn blocklist for maintenance-class spawns (nanocoai#337) The orchestrator copies every installed tile's rules and skills into every container at spawn time, contributing to the system-prompt `cache_create` cost on the first turn of every fresh session. For non-conversational task classes (heartbeat, nightly, weekly), most of that content is dead weight. Add `MAINTENANCE_RULE_BLOCKLIST` and `MAINTENANCE_SKILL_BLOCKLIST` env-driven sets that the install-into-container loop skips when `sessionName === MAINTENANCE_SESSION_NAME`. The filter applies to: - per-tile rules (`rulesContent.push` skipped + cpSync skipped) - per-tile skills (both the `.tessl/tiles/...` dst and the flat `tessl__<name>` skill copy skipped) - built-in `container/skills/` skills (bare-name match) - AyeAye-created staging skills (bare-name match) Filtered names emit a single `install_blocklist_filtered` log line per spawn (silent when nothing was filtered). Default-session spawns bypass the filter entirely — empty / unset blocklists are also a no-op, so this change is regression-safe with zero env-config. Tests cover regression (default-session unfiltered with non-empty blocklists), the maintenance-only fire path for both rules and skills, and the aggregated log emission. All 1494 existing tests continue to pass. Refs nanocoai#337, nanocoai#104. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(nanocoai#337): break import cycle and wire envConfig keys (PR nanocoai#344 review) Two fixes from the Copilot review: 1. **Circular dependency.** `container-runner.ts` imported `MAINTENANCE_SESSION_NAME` from `group-queue.ts`, but `group-queue.ts` already imports `DEFAULT_SESSION_NAME` / `sessionInputDirName` from `container-runner.ts`. Moves the `MAINTENANCE_SESSION_NAME` definition to `container-runner.ts` alongside `DEFAULT_SESSION_NAME` (where the session-name family already lives) and re-exports it from `group-queue.ts` so callers that previously imported from there keep working unchanged. 2. **`envConfig.MAINTENANCE_RULE_BLOCKLIST` was always undefined** because the keys weren't passed to `readEnvFile([...])` at the top of `config.ts`. Added them so the `.env`-file fallback actually resolves; previously the code would only read from `process.env` and silently ignore values placed in the project's `.env` file. Updates `task-scheduler.test.ts` to add `MAINTENANCE_SESSION_NAME` to its `vi.mock('./container-runner.js')` payload — the mock previously only exported `DEFAULT_SESSION_NAME` from container-runner, so the re-export chain via `group-queue.ts` resolved to undefined and broke `scheduled task ignores any cached maintenance sessionId` (nanocoai#193). `npx tsc --noEmit` clean. `npx vitest run` green (1463/1463). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(task-scheduler): per-task SDK session reuse for recurring fires Recurring scheduled tasks (cron / interval) now keep their own session_id across fires so the API can cache the per-session message-history prefix even though the prompt-cache TTL (5 min) expires between heartbeat fires (15-30 min cadence). Each fire's cache_create becomes incremental rather than full-prefix. New nullable scheduled_tasks.session_id column populated by runTask on first fire, passed as resume: on every subsequent fire. One-shot tasks (schedule_type === 'once') stay fresh-per-fire — out of scope. The nanocoai#193 cross-task bleed concern doesn't apply: persistence is keyed on task_id, so two distinct tasks land in two distinct rows hence two distinct SDK sessions — no slot-cache aliasing possible. The dead slot-cache write that previously happened (sessions[group][maintenance]) is dropped — it was never read back. nuke_session IPC clears session_id for every scheduled task in the affected group when the maintenance / 'all' slot is wiped — without this the next fire would resume an id whose JSONL is gone. Hand-ported from jbaruch/nanoclaw#357 (the upstream PR pulls in 4-5 unrelated transitive deps that aren't in our public tree). Closes jbaruch#59. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…tion The new jbaruch#59 test annotated MAIN_GROUP with `RegisteredGroup`, but the type isn't imported in this test file (other tests use plain object literals without the annotation). `npm run build` (full tsc, not the test-mode `--noEmit`) caught the missing import. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aruch#65) * feat(container-runner): add /workspace/state writable mount Ports jbaruch/nanoclaw#220 (Cat 4 of nanocoai#99 audit). Adds a per-group canonical writable mount at /workspace/state — sourced from data/state/<folder>/ and mounted into every container regardless of trust tier. Skills that need to persist state across runs now have a single canonical writable location. Per-group scoping (not per-session) so a scheduled task and a user-facing turn in the same group can read each other's state. CHANGELOG and docs/SPEC.md updated; container-runner.test.ts gains 4 new tests (admin/trusted/untrusted writable + per-group host path). * feat(index): wire heartbeat to use precheck-gate script Sets the `script` field on the auto-created non-main heartbeat task to the unanswered-precheck Python script. The agent-runner's existing runScript / wakeAgent mechanism gates SDK invocation on the precheck's `{"wakeAgent": <bool>}` output — when there are no new candidates the container exits without spending tokens. When the precheck reports new work or fails it falls through to the original prompt unchanged. Adds a one-shot startup migration (`backfillHeartbeatPrecheckScript`) so already-deployed installs pick up the gate without manual DB edits. Idempotent and only touches non-main heartbeat-* rows whose prompt mentions check-unanswered, leaving the main-group maintenance heartbeat (which uses the tessl__heartbeat skill, not check-unanswered) alone. Closes part of jbaruch#62 (heartbeat container spawns running 0 queries). --------- Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…orts jbaruch/nanoclaw#257) (jbaruch#67) * fix(container-runner): skip non-files when copying skill scripts `tmpScriptsDir` (published as `/workspace/group/scripts/<name>`) is a flat dir of executables, but the per-tile loop in `buildVolumeMounts` called `fs.cpSync` without `{ recursive: true }` on every entry under `<skill>/scripts/`. As soon as anything created a subdir there — most realistically Python writing `__pycache__/` next to a `.py` script that got imported during a previous container run — the next spawn crashed with `Recursive option not enabled, cannot copy a directory` and the group's circuit breaker tripped after five retries. Filter the loop to file-typed dirents instead of widening the copy to recursive: subdirs aren't reachable through the flat `<name>` contract anyway, so silently dropping them matches the publish surface and keeps `__pycache__/` (and any future stray runtime byproduct) from blocking spawns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(container-runner): preserve symlinks in script copy + add regression tests Address PR nanocoai#257 review feedback: 1. Filter on `!isDirectory()` instead of `isFile()`. The previous non-recursive cpSync loop happily walked symlink dirents (cpSync resolves them by default), so an `isFile()` filter would have silently dropped any symlinked executable a tile shipped under `<skill>/scripts/`. The bug was specifically about *directories* crashing the non-recursive copy — narrow the skip to that. 2. Extract the loop into `copyTileScriptsToFlatDir` so it's testable without dragging in the rest of `buildVolumeMounts` (which is security-mocked in the companion test file). 3. Add `container-runner.scripts-copy.test.ts` — separate file so it can use real fs in a tmpdir without colliding with the global `vi.mock('fs')` in `container-runner.test.ts`. Covers the regression case (`__pycache__/` subdir present), regular file copy, symlink preservation, no-op when source dir is absent, and the mixed files/dirs scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(container-runner): tighten script-copy filter to explicit allowlist Address PR nanocoai#257 round-2 review feedback from Copilot: - Switch the predicate from `!isDirectory()` to an explicit `isFile() || isSymbolicLink()` allowlist. The negative form would have happily passed FIFOs / sockets / device entries to `fs.cpSync`, which rejects them with EINVAL — same crash class the original __pycache__ bug fell into. The allowlist enumerates exactly the two dirent kinds the `/workspace/group/scripts/<name>` contract can publish; everything else is skipped before cpSync sees it. - Update the doc on `copyTileScriptsToFlatDir` to match the actual behavior (was implying `!isDirectory()` semantics). - Rename the symlink test from "preserves symlink entries" to "does not drop symlink entries" — the previous wording suggested the destination would still BE a symlink, but cpSync's default deref copies through. The regression we guard against is symlinks being silently dropped by the loop, not symlink-ness on the destination. - Add a FIFO test as the cheap, cross-platform witness that the allowlist actually skips non-{file,symlink,dir} kinds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(container-runner): relax doc, correct cpSync default in test comment Address PR nanocoai#257 round-3 review feedback from Copilot: - Relax doc on `copyTileScriptsToFlatDir` from "symlinks to regular files" to just "symlinks". The code allowlists every symlink dirent without resolving the target, so the doc was overstating what we enforce. Add a sentence saying targets aren't inspected — the tile owns whether what it links to is sensible. - Fix the symlink-test comment: the previous wording said cpSync "dereferences symlinks during the copy", but Node 20's default for `dereference` is `false`. Updated to note the actual behavior (destination IS a symlink) and explain why the test asserts reachability + content rather than symlink-ness — the regression we're guarding is silent drops in the loop, not destination symlink-ness, so a future flip of cpSync defaults wouldn't be a bug-class regression. Declined the suggestion to add a `statSync(srcDir).isDirectory()` guard: a tile shipping a non-directory at the `scripts/` path is a broken-tile state no defensive layer at this site can rescue, and `existsSync` already gates the loop. Reply on the thread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Baruch Sadogursky <baruch@tessl.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ge reads (nanocoai#235) (jbaruch#66) * db: enable WAL + busy_timeout to prevent transient malformed-image reads Cross-process readers (agent containers, ad-hoc sqlite3 CLI invocations) intermittently saw "database disk image is malformed" while opening messages.db. The DB itself was fine — the file was just being read mid-write while in rollback-journal mode, so the snapshot reader hit a state needing journal replay and failed the page-format check. Fix: set the WAL pragma trio in initDatabase() right after opening the connection. WAL keeps writers off the main file (they hit a sidecar log), so cross-process readers always see a consistent snapshot. db.pragma('journal_mode = WAL'); db.pragma('synchronous = NORMAL'); db.pragma('busy_timeout = 5000'); NORMAL is the standard synchronous pairing for WAL (FULL is overkill once writes go through the WAL). busy_timeout = 5000 smooths the rare contention case (default is 0 = immediate-fail on SQLITE_BUSY). Side effect: two new sidecar files appear next to messages.db (messages.db-wal, messages.db-shm) — store/ is already gitignored. No backup-path changes needed — verified that nothing in scripts/ copies messages.db with cp/rsync/tar (the cleanup/deploy/nuke scripts all go through the sqlite3 CLI, which handles WAL transparently). The in-process createFilteredDb in container-runner.ts uses ATTACH + CREATE TABLE AS SELECT, also WAL-safe. Closes ligolnik/nanoclaw-public#9 * fix(db): verify WAL engaged + propagate busy_timeout to filtered DB Address Copilot review on PR nanocoai#235: 1. `journal_mode = WAL` can silently fall back when the filesystem doesn't support shared-memory mmap (some FUSE / network mounts). Capture the effective mode and throw if it isn't `wal` — the whole point of this fix is consistency, so a silent fallback to rollback journal would defeat it without any signal. 2. `createFilteredDb()` opens its own better-sqlite3 connection and ATTACHes the WAL-mode source DB. With the default `busy_timeout=0` it would fail immediately on any lock contention against the source (e.g. during a checkpoint), defeating the smoothing the orchestrator side just gained. Match the orchestrator value (5000ms) so contention smoothing is symmetric across readers. --------- Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Leonid Igolnik <lim@igolnik.com>
…jbaruch/nanoclaw#260) (nanocoai#68) * fix(config): warn when HOST_UID/HOST_GID env are malformed A set-but-malformed value (`HOST_UID=foo` parsing to `NaN`, or `HOST_UID=-1`) used to flow through `parseInt` straight into the exported binding. Downstream chown sites then either crashed (`fs.chownSync(..., NaN, NaN)` throws) or silently mis-owned (`-1` casts to uid 4294967295). Worse, the same fail-open posture that handles the legitimate "Mac host without DooD" case made operator misconfiguration look identical to "no override needed", so the original permission issue (nanocoai#254) reappeared invisibly. Centralize parsing in `parseHostId`, mirroring the `resolveAgentAutoCompactWindow` shape already in this file. Set-but-invalid values now warn once to stderr and resolve to `undefined` so downstream sites take their default branch instead of forwarding the bad value. Stderr (not the structured logger) keeps `config.ts` below `logger.ts` in the import graph — the constraint is documented in `host-logs.ts:41-45`. Closes nanocoai#258. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(config): reject partial-parse and empty HOST_UID/HOST_GID Per Copilot review on nanocoai#260: `parseInt("123abc", 10)` returns 123 and `parseInt("1.5", 10)` returns 1 — silent partial parses that defeat the whole purpose of validating operator input. The `if (!raw)` guard also treated an explicit empty string (e.g. a `.env` line truncated to `HOST_UID=`) as "unset", absorbing exactly the typo shape the validator was supposed to surface. Switch to a strict digits-only regex (`/^\d+$/`) and key the unset branch on `raw === undefined` so empty strings flow into the warning path. Add three test cases — `123abc`, `1.5`, and the empty-string shape — pinning the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Baruch Sadogursky <baruch@tessl.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntenance idle (jbaruch#57) (nanocoai#69) Fixes the maintenance-slot wedge cascade documented in jbaruch#57: a heartbeat container ran 30:01 min after the agent emitted "Stopping silently per instructions", silently dropping every queued task behind it (drive_planner T-30 and T-15 rechecks, log_watchdog) at the 30-min dispatch-loss threshold. Two layers, both shipped here: Layer 1 — agent-runner (container/agent-runner/src/index.ts) - Synthesize a terminal `{status: 'success', result: ''}` write when runQuery's SDK iterator drains without yielding a `result` event. Without this the host's task-scheduler never sees a terminal success, so scheduleClose's 10s teardown timer never fires and the container idles to CONTAINER_TIMEOUT (30 min). - Hard-exit watchdog: when the host writes `_close`, give the SDK at most 30s to drain, then process.exit(0). stream.end() only signals "no more user messages" — a wedged tool call can keep the iterator alive indefinitely. The 30s grace is enough for any real tool to complete; past that, the SDK is refusing to give up and we force the exit. Layer 2 — host (src/config.ts, src/container-runner.ts) - New MAINTENANCE_CONTAINER_TIMEOUT (default 5 min, env-overridable). Used in runContainerAgent's hard-timeout when sessionName === 'maintenance', bypassing the IDLE_TIMEOUT+30s floor that exists for the user-facing default container's graceful-close window. Maintenance containers are single-turn and don't need 30 min of idle headroom; the shorter cap is the safety net for the failure modes Layer 1 may not catch. Tests - src/container-runner.test.ts: maintenance-slot containers fire hard timeout at MAINTENANCE_CONTAINER_TIMEOUT (5 min), default-slot retains the 30-min idle floor (regression guard). - src/task-scheduler.test.ts: scheduleClose fires on a synthesized terminal success even when the container hangs after the streaming output (the wedged-container shape jbaruch#57 documents). Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…eploy (nanocoai#120) (nanocoai#70) Adds a pre-pull guard in `scripts/deploy.sh` that aborts the deploy if `git remote -v` shows an HTTPS URL with embedded user:token credentials (`https://<user>:<token>@github.com/...` or the GitHub-specific `x-access-token:<pat>@github.com/...` shape that triggered nanocoai#106). Why deploy.sh: this is the script that runs on every NAS deploy, so it's the enforceable choke point. The `setup` skill is instruction- only — it can't refuse to proceed if the operator skips a step. Failing loud at deploy time guarantees the leaked-PAT shape never silently survives. Why a regex match rather than a positive allow-list: switching to SSH or to `credential.helper` are both legitimate setups, and listing every shape would drift. Negative match on the leak pattern is narrow and unambiguous. Test plan - Manual regex test: PAT URL trips, SSH URL passes, clean HTTPS passes. - Operator action remains: rotate the leaked PAT (https://github.com/ settings/tokens) and `git remote set-url origin git@github.com: <owner>/<repo>.git` before re-deploy. Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (nanocoai#71) Grammy's `HttpError` / `FetchError` serializes the full request URL into the error's `.message` and `.stack`: FetchError: request to https://api.telegram.org/bot8460672283:AAHh…/sendMessage failed The orchestrator's `logger.error({err}, …)` then writes `err.message` + `err.stack` verbatim via `formatErr`, so the real bot token lands in `logs/nanoclaw.log` on every Telegram send failure. Logs often leave the host — debugging pastes, shared diagnostics, log shipping — so this is a latent credential-leak path in error handling itself. Fix: apply `redactBotTokens` at the output layer in `logger.ts`, on the fully-formatted line right before it reaches the stream. Every log path — plain strings, structured fields, serialized Error messages/stacks — goes through the same filter, so one redact covers all shapes. The bot ID is preserved so operators can still tell main (`8460672283`) from pool bots (`8716901520`, etc.) in log correlation; only the 35-byte secret fragment is replaced with `<redacted>`. Sold on its own merit, NOT as a nanocoai#81 fix — see issue retraction comment. Closing this path keeps the defense complete regardless of the nanocoai#81 outcome. Tests: three for the pure redact helper, three for each payload shape the logger can receive (plain string, structured data, Error). All 508 tests pass. Build clean, prettier clean. Refs: nanocoai#81. Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: AyeAye <ayeaye@nanoclaw.bot> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nocoai#90) (nanocoai#72) The main-group block in `buildVolumeMounts` /dev/null-shadows each SECRET_FILES entry at `/workspace/project/<relPath>` — but that shadow only covers the canonical project mount. An additionalMount can re-expose the nanoclaw tree at a DIFFERENT container path (e.g. a group registered with `hostPath: ~/nanoclaw` lands it at `/workspace/extra/projects/nanoclaw/`), and the `.env` at `<mount>/.env` has no shadow applied there. In that state, a trusted agent could read real bot tokens out of the extra mount even though the canonical `.env` is `/dev/null`. It's a latent gap — I hit it while investigating nanocoai#81, and even though the nanocoai#81 ghost-message investigation turned out NOT to traverse this path (see issue retraction comment), closing the gap is still the right call: the SECRET_FILES list exists precisely to make this kind of bypass impossible. Fix: after `validatedMounts` are added, iterate them and, for every SECRET_FILES entry whose host path is reachable under a mount's host path, add a `/dev/null` bind at the corresponding container path inside the extra mount. `path.relative` returning a non-empty, non-`..`-prefixed, non-absolute string is the "inside" predicate — matches Docker's path resolution. Only shadows files that actually exist on the host (same invariant as the main-path loop) to avoid creating spurious empty files inside extra mounts. Tests: four focused additions in container-runner.security.test.ts: - shadow every SECRET_FILES entry when mount host path == project root - skip missing secret files - no shadow for unrelated mount host paths (false-positive guard) - correct sub-path when the mount is a parent of the project root All 506 tests pass. Build clean, prettier clean. Refs: nanocoai#81. Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: AyeAye <ayeaye@nanoclaw.bot> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ale-image race (nanocoai#108) (nanocoai#73) * fix(nanocoai#69): rebuild agent image before orchestrator to close stale-image race `docker compose up -d --build` rebuilds and restarts the orchestrator in one step. If `./container/build.sh` ran AFTER it, the new orchestrator was live while `nanoclaw-agent:latest` still pointed at the pre-deploy image — any inbound message in that window spawned an agent from the stale image, the same class of bug jbaruch#66 closed for the build itself. Swap the order: build the agent image first (while the OLD orchestrator still serves requests against the OLD agent — pre-deploy steady state, not a regression), then `docker compose up -d --build` recreates the orchestrator on already-fresh agent image. Refactors the single "2." step into "2a. agent" / "2b. orchestrator" so the ordering is explicit in both the script header and the runtime output. * fix(nanocoai#69): update --tiles-only echo to match 2a/2b numbering (Copilot review) * fix(nanocoai#69): clarify step 2b vs step 7 + document accepted #2b race window (Copilot review) * fix(nanocoai#69): rewrite #2b race-window comment to be internally consistent (Copilot review) * fix(nanocoai#69): generic agent-image reference in race-window comment (Copilot review) * fix(nanocoai#69): rewrite step-2 comment + handle digest-pinned CONTAINER_IMAGE (Copilot review) * fix(nanocoai#69): digest match also covers tag-less refs (Copilot review) * fix(nanocoai#69): plumb CONTAINER_IMAGE through docker-compose env (Copilot review) * fix(nanocoai#69): source .env in deploy.sh so it and compose see same CONTAINER_IMAGE (Copilot review) * fix(nanocoai#69): parse .env as data + preserve shell precedence (Copilot review) Address three Copilot concerns on PR nanocoai#108's docker-compose plumbing: 1. **`source .env` executes arbitrary shell**: replaced with a grep+sed parser that reads ONLY `CONTAINER_IMAGE=<value>`. Any other content in `.env` (intentional or accidental) is no longer executed on the host. 2. **Shell-exported values win, matching compose precedence**: the parser is gated on `[ -z "${CONTAINER_IMAGE:-}" ]` — if the operator already set CONTAINER_IMAGE in the shell, we don't touch it. `docker compose` interpolates with the same shell-wins precedence, so script and compose now agree on which value applies. 3. **compose.yml comment wording fix**: the prior text said "shell- export before docker compose up" for `.env`, which is wrong — Compose reads `.env` directly without needing a shell-export. Updated to describe both forms accurately and point at the matching .env-parser in deploy.sh. * fix(nanocoai#69): make .env parse safe under set -euo pipefail (Copilot review) The previous `grep ... | head -n1 | sed ...` pipeline could abort the script: grep exits 1 when the file has no CONTAINER_IMAGE entry, and head closing the pipe hands grep a SIGPIPE on duplicate keys — both trigger `set -e` via `set -o pipefail`. Switch to `grep -m1` (single tool, no pipe to head, stops on first match) with a trailing `|| true` so a missing CONTAINER_IMAGE in .env falls through to the default instead of aborting the deploy. --------- Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nanocoai#74) * fix(deploy): graceful agent close instead of unconditional docker kill (nanocoai#221) Pre-nanocoai#221 step 5 ran `docker kill` on every nanoclaw-* container unconditionally — exit 137 across every in-flight conversation turn and scheduled-task run, even when the agent was 100ms from completing its reply. The rationale was "force fresh tile load after rebuild", but for the typical agent that's mid-query, the right behaviour is "let it finish its current turn, exit cleanly, next message spawns a fresh container with new tiles". Pre-nanocoai#221 deploys also defeated nanocoai#213's handoff marker for the deploy case: the marker survived step 2b's recreate, then step 5 killed every adopted agent anyway. Pattern: write the agent-runner's `_close` IPC sentinel into each agent's input dir, give them a 30s grace window to exit naturally, then force-kill any holdouts (genuinely-stuck agents in long tool calls beyond the grace). User-visible: at most one stale-tile turn per group per deploy (the in-flight turn that completes), instead of every in-flight turn destroyed mid-stream. Mount discovery uses `docker inspect` to find the bind mount whose Destination matches the agent-runner's `IPC_INPUT_DIR` constant (`/workspace/ipc/input`). Same destination path for every agent regardless of group/session — much simpler than reverse-engineering the group folder + session name from the container name suffix (which would also be ambiguous: group folders containing underscores get sanitised to dashes in the container name). Closes nanocoai#221. * fix(deploy): scope grace-window + force-kill to ORIGINAL agent set (nanocoai#221) Address PR nanocoai#222 review (3 findings): 1. Scoping the grace-window poll and the final force-kill to all currently-running nanoclaw-* containers (instead of the original set this loop signaled) was wrong: a new agent that spawns mid-deploy (inbound message during the grace window) would incorrectly extend the grace window AND land in the holdout force-kill set despite being unrelated to this deploy. Track ORIGINAL_AGENTS in an associative array and re-derive holdouts as intersect(ORIGINAL_AGENTS, currently-running). 2. Stale comment about a "-n check below" updated to match the actual `[[ -z ... ]]` check that replaced it. 3. A failed `touch _close` (permissions, transient FS error) was silently dropping the container off the SIGNALED list without adding it to UNRESOLVED, so the agent kept running until force-kill while the operator-visible count claimed success. Now the touch failure adds the container to UNRESOLVED, and the WARN message names both failure modes (mount unresolved or _close write failed). --------- Co-authored-by: Baruch Sadogursky <baruch@tessl.io>
…anocoai#111) (nanocoai#75) * fix(nanocoai#100): nuke_session deletes the on-disk JSONL transcript Closes nanocoai#100. `nuke_session` killed the runtime container(s) and dropped the session DB row, but left the JSONL transcript intact at `<DATA_DIR>/sessions/<group>/<sessionName>/.claude/projects/<slug>/<sid>.jsonl`. When the next container started, Claude Code re-read the transcript and whatever put the session into a bad state — poison, stuck plan, corrupt memory, classifier-tripping content — was immediately back. The only actual recovery was SSH + `rm` by hand. The MCP tool named `nuke_session` did not nuke the session. This patch makes nuke a true clean slate per targeted slot: 1. Capture the SDK sessionIds for the slot(s) we're about to wipe (must happen BEFORE step 3 — once the in-memory map and DB row are cleared we can't find which JSONL to delete). 2. Kill the container(s) (existing behaviour). 3. Clear the in-memory `sessions[groupFolder][sessionName]` and drop the DB row (existing behaviour). 4. Delete the matching JSONL on disk via `wipeSessionJsonl` — the new helper. Walks every project-slug subdir under projects/ rather than hardcoding `-workspace-group` so a future slug change (or an operator who renamed the workspace path) doesn't silently leave stale JSONLs behind. Delete-while-open is safe on POSIX (open FDs keep writing to a phantom inode that vanishes on close), so we don't have to wait for closeStdin to actually terminate the container before unlinking — keeps nuke's latency the same as before. Tool description in container/agent-runner/src/ipc-mcp-stdio.ts updated to be explicit about the destructive disk operation, since the previous "kill the container(s)" wording is exactly why callers misjudged what nuke did. Tests: 7 new in src/nuke-session.test.ts covering the happy path, missing-file no-op, missing-projects-dir no-op, multi-slug walk, selective sessionId match, slot isolation, and stray-file resilience. * style: prettier --write on nuke-session.test.ts (CI format check) * fix(nanocoai#100): harden wipeSessionJsonl against path traversal + IO races (Copilot review) Address four review items on PR nanocoai#111: 1. **Path traversal (security)**: `sessionId` ultimately comes from container stdout, which is untrusted for untrusted-tier groups. A crafted value like `../other-slug/real-uuid` would have been interpolated into the unlink path. Now reject anything that fails `^[A-Za-z0-9_-]+$` up-front, and after `path.join` assert the resolved path is contained inside the slug directory (defense in depth against future regex loosening or platform path quirks). 2. **`@internal` JSDoc tag**: marks the export as test-only so `tsconfig.stripInternal` doesn't surface it in generated `.d.ts`. 3. **ENOENT race + EACCES/EPERM**: `existsSync`-then-`readdirSync` is racy and the docstring claimed "no-op on missing dirs". Wrap `readdirSync` and `unlinkSync` in try/catch — ENOENT becomes a no-op, other errno values get logged via `logger.warn` so the orchestrator can't be crashed by a transient FS error. 4. **Comment count**: "destructive across THREE locations" → "runs in four steps" to match the actual enumeration. Tests: 2 new in `src/nuke-session.test.ts` covering the path-traversal and bad-charset rejections (sentinel files left untouched). Total 9. * fix(nanocoai#100): DoS cap on project-slug enumeration + per-worker random tempdir (Copilot review round 2) * fix(nanocoai#100): use opendirSync iterator + reject symlinks (Copilot review round 3) Two further hardening items on PR nanocoai#111: 1. **DoS via materialized listing**: the previous `readdirSync(...)` + `entries.length > MAX` guard didn't actually defend, because `readdirSync` materializes the full directory listing first — a compromised container creating millions of entries would force the orchestrator to allocate the full array before our cap fired. Switched to `opendirSync` + `readSync` iterator: per-step cost only, with an early `break` once `MAX_DIRS_VISITED` (1000) is hit. 2. **Symlink-as-directory ambiguity**: `Dirent.isDirectory()` (from `withFileTypes: true`) returns true for symlinks-to-directories on some filesystems where readdir falls back to stat. Use `lstatSync` on each entry and reject `isSymbolicLink()`. A symlink under projects/ is not something Claude Code creates legitimately, so refusing to traverse closes the symlink-escape vector entirely. Test: 1 new in `src/nuke-session.test.ts` covering the symlink case (projects/evil-slug → outside_projects/), asserting deleted=0 and sentinel survives. Total 10 tests in the file. * fix(nanocoai#100): move @internal tag above wipeSessionJsonl (Copilot review) * fix(nanocoai#100): reject symlinked projects/ directory itself (Copilot review) * fix(nanocoai#100): cap on dirs traversed (not all entries) + raise limit (Copilot review) * fix(nanocoai#100): try-known-slug-first + TOCTOU realpath + total-entries cap (Copilot review round 8) Three round-8 items on PR nanocoai#111: 1. **Flood-attack mitigation (try-known-slug-first)**: a compromised container could fill `projects/` with thousands of decoy slug subdirs to push the legitimate `-workspace-group` past any walk cap (the previous 10K cap raised the bar but didn't eliminate it). Now: try `path.join(projectsDir, '-workspace-group')` directly FIRST via the new `unlinkJsonlInSlug` helper. On the steady-state path the function returns after one realpath + one unlink — no walk at all. The slow-path full walk remains as a safety net for slug drift (rename, future CLAUDE_PROJECT_SLUG change). 2. **TOCTOU on lstat → opendir / lstat → unlink**: the previous defense was lstat-then-opendir, racy if a container can swap `projects/` to a symlink in between. Two-layer fix: - After opendirSync, `realpathSync(projectsDir)` and verify it resolves inside `realpathSync(DATA_DIR)`. macOS-aware (handles /var ↔ /private/var symlink). - Per-slug unlink path now uses `realpathSync(slugPath)` + `realpathSync(jsonlPath)` and asserts the latter starts with the former + sep. realpath dereferences the entire symlink chain, so a symlink swap of slugPath (or any ancestor) post- lstat is caught here. 3. **Total-entries cap**: even with dirs-only counting, a `projects/` filled with millions of stub FILES would block the orchestrator event loop synchronously through `dir.readSync()` iteration. Added MAX_TOTAL_ENTRIES (100K) bounding total readSync calls; MAX_DIRS_VISITED (10K) still bounds the lstat+unlink work. The 11 existing tests still pass. * fix(nanocoai#100): unlink symlinked jsonl + distinguish bail reasons (Copilot review) Three round-9 items on PR nanocoai#111: 1. **Symlinked JSONL was leaked**: a compromised container could replace `<slug>/<sessionId>.jsonl` with a symlink. The previous realpath-containment branch would refuse to unlink (target outside slug) and leave the link entry on disk — defeating the "nuke really nukes" promise. Now: lstat first, branch on `isSymbolicLink()`. For a symlink, `fs.unlinkSync` removes the link itself (NEVER deletes the target file the link points at), regardless of where the link points or whether it's dangling. Regular files still go through the realpath containment check. 2. **Bail-log conflated two limits**: `bailedOnLimit` was a bool, and the error message + logged fields always reported `dirsVisited` and `MAX_DIRS_VISITED` — even when the cap that fired was actually `MAX_TOTAL_ENTRIES` (stub-file flood). Switch to a discriminated string type ('total-entries' | 'dirs-visited' | null) and emit a different log message + fields per cap, so an operator looking at the error knows which attack vector tripped. Tests: 2 new (symlinked JSONL with target preserved; dangling symlinked JSONL still unlinked). Total 13 in nuke-session.test.ts. * fix(nanocoai#100): lstat fast-path slug to defeat symlink swap (Copilot review) The slow-path loop already lstats each slug entry under `projects/` and skips symlinks; the fast path directly resolved `projects/-workspace-group` through `unlinkJsonlInSlug`, which only realpath-checks containment of the JSONL inside the slug — but if the slug itself is a symlink to an attacker-chosen directory, both ends of the realpath check resolve through that same symlink, containment passes, and the unlink lands inside the symlink target. Add an lstat guard at the fast-path call site mirroring the slow-path discipline: real directory → use, symlink → log error and skip, ENOENT → silent fall-through to slow path. New test in `src/nuke-session.test.ts` exercises the attack: a `projects/-workspace-group` → `outside_dir/` symlink with a sentinel `sid.jsonl` inside the target. After the fix, the wipe returns 0 and the sentinel is preserved. Other Copilot concerns on this PR were already addressed: - symlinked JSONL: unlinks the link itself (`isSymbolicLink()` branch) - escape-realpath: covered by the JSONL test - entry-count cap: `MAX_TOTAL_ENTRIES = 100000` in place - bail-on-limit logging: separate branches for `total-entries` vs `dirs-visited` already log the right cap and field * test(nanocoai#100): use fs.mkdtempSync for unique test tempdir (Codex review) The previous tempdir suffix used `crypto.randomBytes(4).toString('hex')`, which the policy reviewer flagged at literal level under `jbaruch/coding-policy: testing-standards` ("no self-generated random test data"). The rule targets assertion inputs, not hermeticity plumbing — but the literal pattern matches. `fs.mkdtempSync` is the standard Node.js API for "unique tempdir per caller": the suffix comes from the OS, not from in-test randomness, so it doesn't trip the rule's literal scan. Test isolation across concurrent vitest workers and crash-rerun pairs is preserved. All 14 tests still pass. --------- Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ARN (nanocoai#76) Production logs (12h window) showed 363 reaction errors and 121 sendMessage errors, dominated by four buckets that the operator cannot action mid-flight: - 97x HttpError "Network request for 'setMessageReaction' failed!" (transport blip on the way to Telegram's edge) - 126x Grammy 400 "429: Too Many Requests: retry after N" (Telegram rate limit — recovers on next agent turn) - 96x sendMessage "not enough rights to send …" (admin removed bot post permission post-registration) - 53x Reaction "message to react not found" with bot- IDs (legacy synthetic ids; jbaruch#50 already covers translation but the regex `/message to react.*not found/i` catches the tail too — pinned in tests so it can't regress) All four are now classified by a shared `_isUnactionableTelegramError(msg)` helper and logged at WARN with a "skipped (transient or unactionable)" / "dropped (transient or unactionable)" message instead of the noisy ERROR. The helper is exported (with `_` prefix) so future call sites (sendFile, sendVoice, pool path) can route through one definition rather than each inlining a regex bouquet that drifts. Genuine unexpected errors (Internal Server Error, ETIMEDOUT, PARSE_ENTITIES_FAILED, chat_id is empty, etc.) still log at ERROR — covered by negative-classification tests so a future over-eager regex broadening can't silence a real bug. Tests: +9 (816 → 825 passing). Each production-log pattern gets a dedicated assertion so a future regex tightening fails specifically on the bucket it broke. The exact production string `message to react not found` (no second "to") is pinned alongside the historical "message to react to not found" form. Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds an optional `agentModel` field on `ContainerConfig` so each registered group can pin its own model (e.g. Haiku for noisy chats, Sonnet/Opus for engineering work) instead of sharing the global `AGENT_MODEL` env var. Cost analysis projected ~$10-20/day savings when applied selectively to high-volume groups like `telegram_old-wtf`. Override flow: - `containerConfig.agentModel` is read at spawn time in `buildContainerArgs`. - A new `resolvePerGroupAgentModel` validator falls back to the global default on empty/unknown-prefix values (stricter than the global resolver, which passes typos through with a warn). A bad per-group override never blocks the spawn — the group keeps running on the verified global default. - Spawn-time info log surfaces the override in deploy logs: "Per-group AGENT_MODEL override active". A new IPC handler (`set_agent_model`) lets the agent set or clear the override for a group via the existing IPC bridge. The handler mutates ONLY `containerConfig.agentModel`, preserving every other field (`additionalMounts`, `timeout`, `trusted`, etc.) so an operator setting the model can't accidentally clobber the group's mount allowlist or trust flag. Authorization mirrors `register_group`: main can edit any group, non-main can only edit itself. No DB migration needed — `container_config` is already a JSON column. Test coverage (12 new tests, suite 818 → 818 passing): - `resolvePerGroupAgentModel` pure-function contract (6 cases: undefined, empty, whitespace, valid prefix, trim, bad prefix fallback) - `runContainerAgent` spawn-arg forwarding (4 cases: no override, valid override, bad override falls back, empty override falls back) - `set_agent_model` IPC (6 cases: main sets foreign group, non-main sets self, non-main blocked from foreign group, clear preserves other fields, set preserves other fields, missing groupFolder rejected) Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Operators previously had to grep spawn logs to learn which model a given group's container is running. Add `AGENT_MODEL` to the /status skill's session-context check so the effective model — the per-group override (set via `set_agent_model`, NanoClaw#395) or the orchestrator's global default — is visible in the standard health-check report. Follow-up to nanocoai#77 review feedback (nanocoai#395 / chat_status surface) — the same field is being added to admin-tile chat_status MCP in the private fork.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AGENT_MODELto the container/statusskill's session-context check so the effective per-group model is visible in routine health checks.AGENT_MODELoverride). Previously operators had to grep orchestrator spawn logs to learn which group runs which model; this surfaces it next to the workspace + tool checks the operator already runs.container/skills/status/SKILL.md— no orchestrator code or container build affected.Why now
Code review on the private NanoClaw#395 / PR#396 (per-group
AGENT_MODELoverride) called out the missing observability surface: the override resolver logs once per spawn but there's no operator-facing readout. The private side is adding the same field to its admin-tilechat_statusMCP tool. This PR is the public-fork counterpart on the equivalent surface that exists here (the/statusSKILL.md).Test plan
npm run build(Node 22) — passesnpx vitest run— 841 passed, 12 skipped/statusin a group whereset_agent_modelwas used to set a non-default model — confirm the report shows the override value.🤖 Generated with Claude Code