Skip to content

feat(observability): surface effective agent model in /status skill#67

Open
ligolnik wants to merge 90 commits into
jbaruch:mainfrom
ligolnik:feat/observability-effective-agent-model
Open

feat(observability): surface effective agent model in /status skill#67
ligolnik wants to merge 90 commits into
jbaruch:mainfrom
ligolnik:feat/observability-effective-agent-model

Conversation

@ligolnik

@ligolnik ligolnik commented May 1, 2026

Copy link
Copy Markdown

Summary

  • Adds AGENT_MODEL to the container /status skill's session-context check so the effective per-group model is visible in routine health checks.
  • Follow-up to NanoClaw#77 (per-group AGENT_MODEL override). Previously operators had to grep orchestrator spawn logs to learn which group runs which model; this surfaces it next to the workspace + tool checks the operator already runs.
  • Markdown-only change to container/skills/status/SKILL.md — no orchestrator code or container build affected.

Why now

Code review on the private NanoClaw#395 / PR#396 (per-group AGENT_MODEL override) called out the missing observability surface: the override resolver logs once per spawn but there's no operator-facing readout. The private side is adding the same field to its admin-tile chat_status MCP tool. This PR is the public-fork counterpart on the equivalent surface that exists here (the /status SKILL.md).

Test plan

  • npm run build (Node 22) — passes
  • npx vitest run — 841 passed, 12 skipped
  • Operator dry-run: trigger /status in a group where set_agent_model was used to set a non-default model — confirm the report shows the override value.

🤖 Generated with Claude Code

Leonid Igolnik and others added 30 commits April 26, 2026 18:22
…I MCP tools

- Stuck-session prevention: track currentAssistant/previousAssistant turns,
  finalize resume point at result-time, skip thinking-only end_turn turns
- Auto-recovery: erroredWithoutProgress flag retries the same prompt with a
  recoveredThisTurn cap, clearing resumeAt
- Expanded debug logs: thinking blocks, tool_use/tool_result + latency, token
  usage, cache hit rate, prompt preview, SOUL hash, rate_limit events, full
  error bodies
- Session-aware IPC path: read from /workspace/ipc/input-${SESSION_NAME}/
  to side-step VirtioFS nested bind-mount staleness on macOS Docker
- ENOENT race fix in drainIpcInput: when readFileSync succeeds but unlinkSync
  finds the file gone (concurrent drain), swallow ENOENT instead of throwing
  to outer catch and discarding the parsed message — root cause of silent
  message loss after a query ends
- New OneCLI MCP server (onecli-mcp-stdio.ts): 7 Calendar tools, 9 Gmail
  read+drafts tools, 8 SmartThings tools (incl. paginated history with
  cursor-based nextPage), wired up when HTTPS_PROXY is set
- IPC send_voice tool added to ipc-mcp-stdio.ts
- google-calendar container skill scaffolded
…ain DM concurrency bypass

- channels/telegram: host-side auto-react 👀 (gated to main+trusted AND
  trigger-matched in groups requiring trigger), transcribeVoice via OneCLI
  curl to OpenAI Whisper, transcript-based trigger inference
  ("LoMBot"/"limlombot"/"lom bot" → prepend @LomBot), sendVoice via
  OpenAI TTS through OneCLI proxy (default voice: alloy)
- channels/telegram-sanitize: TELEGRAM_ALLOWED_TAGS Set + Phase 1b unknown
  tags get htmlEscaped instead of protected — fixes 400 from <deliveryScheduleId>
  and similar agent-emitted XML-ish content
- container-runner: OneCLI HTTPS_PROXY + CA cert env injection (main+trusted
  only, NOT untrusted), NO_PROXY=host.docker.internal,127.0.0.1,localhost
  (NOT api.anthropic.com — so 3rd-party tools route via OneCLI), per-tile
  owner support (TileRef[]), mainOnly skill gate (google-calendar),
  idle-timeout reclassified INFO when hadStreamingOutput
- group-queue: setIsMainGroupResolver + bypass MAX_CONCURRENT_CONTAINERS
  for main DM (so background heartbeats don't queue main behind themselves)
- observer (new): per-query summaries to status group, live thinking stream,
  progress reactions 👀→🤔→⚡→✍, watchdog with 30s blink + 60/120/300s pings,
  noteLatestUserMessage tracking
- ipc: send_voice host handler (OpenAI TTS via OneCLI), send_file path
  resolver extended to /workspace/extra/<mount>/... via additionalMounts
  lookup, Tessl periodic-update path fix
- index/types: sendVoice deps wiring, Channel.sendVoice signature, observer
  init, queue main-resolver
…ice docs, tessl workspace

- groups/global/CLAUDE.md: identity Andy → LoMBot, mcp__onecli__gcal_*
  tool docs, mcp__onecli__gmail_* (read+drafts only) docs, SmartThings
  tool docs, voice reply preference (mcp__nanoclaw__send_voice), workspace
  path note (write files under /workspace/group/, not /tmp/)
- groups/global/CLAUDE-untrusted.md: minor untrusted updates
- groups/main/CLAUDE.md: synced
- tessl.json: add jbaruch/nanoclaw-telegram dependency
- tessl-workspace/: setup with .tessl/RULES.md and .claude/ pointing at
  the installed tile rules
- scripts/claw: convenience wrapper
- agent-runner: stop slicing thinking blocks at 400 chars when logging.
  Whitespace is still collapsed so the entire block stays on one log line
  (downstream parsers split on newlines). Full content is now visible in
  `docker logs` for post-mortem analysis.
- observer: chunkText() splits messages over 3800 chars at the nearest
  whitespace within the last 200 chars before the boundary, so words don't
  get cut. Multi-chunk sends are sequenced so they arrive in order, tagged
  "(1/N)" / "(2/N)" / etc. Failure of one chunk doesn't abort the rest.
Mirror the host-side validator in the MCP tool itself so the agent gets
immediate isError feedback when it tries to send a file from /tmp/ or
any other path the host can't read. Previously the tool returned fake
"File queued for sending" success regardless of path validity, then the
host quietly logged "send_file: path outside allowed mounts" and
dropped the file — no feedback loop, so the bot kept generating
recipes that wrote to /tmp/ and called send_file on them.

Allowed prefixes: /workspace/group/, /workspace/trusted/,
/workspace/extra/. Error message tells the agent exactly where to
write instead. Existence check still runs after the prefix check.
…rver architecture doc

- Anthropic changed `display` default to 'omitted' on Opus 4.7 for
  faster streaming. Without explicit 'summarized', thinking blocks
  arrive with empty content + signature blob; the observer's 🧠 stream
  goes silent, the 🤔 reaction never fires. Older models default to
  'summarized' and accept the explicit value as a no-op, so it's safe
  to leave on across all model versions.
- docs/OBSERVER.md: full architecture writeup — five-layer pipeline,
  key design decisions (thinking display gotcha, stringly-typed wire
  format, chunking, per-source state, watchdog, reaction allowlist),
  enable/disable instructions, and what the system is NOT (not a
  security boundary, not a feedback loop, not a metrics replacement).
…d groups

createFilteredDb was copying chats + messages but not reactions.
check-unanswered.py joins on reactions to skip messages the bot already
👀-reacted to; without the table the join hit "no such table: reactions"
and the whole script aborted, leaving every scheduled-task wake-up in
untrusted containers as a silent no-op.

Now the filtered DB:
- Copies reactions scoped to the same chat_jid filter as messages
  (privacy boundary — untrusted containers still don't see other groups'
  reaction history)
- Falls back to an empty reactions table with the known schema if src
  doesn't have one yet (fresh install before migrations)
- Indexes (message_id, message_chat_jid) to match the host schema

Stale filtered-db dirs need to be wiped to pick up the new schema:
rm -rf data/filtered-db before next container spawn.
…gets escaped

Phase 1b's tag-matcher used [a-zA-Z][a-zA-Z0-9-]* — no underscore in the
character class — so agent-emitted tags like <tool_use_error>,
<delivery_id>, <some_field> fell through unmatched. Telegram rejected
the whole message with 400: "Unsupported start tag tool_use_error".

HTML spec disallows underscores in tag names, but Claude and other
agentic frameworks emit them constantly (Bash tool's blocked-action
errors are wrapped in <tool_use_error>...</tool_use_error>, JSON
dumps with snake_case field names leak as faux-XML, etc). Adding _ to
the character class makes 1b catch them and escape to &lt;...&gt;.
Untrusted containers now get the OneCLI HTTPS proxy + CA cert injected
(previously only main + trusted did). NANOCLAW_TRUST_TIER env var tags
the container's tier, and onecli-mcp-stdio.ts uses it to gate which
MCP tools register.

For untrusted, only `gcal_freebusy` is registered. The freebusy
response shape is just {start, end} time pairs — no titles, attendees,
or event metadata — so participants in untrusted chats can ask the bot
"is Leonid free Thursday afternoon?" without learning what's actually
scheduled.

Calendar-id clamp: untrusted callers can't pass arbitrary calendar IDs
(otherwise they could probe email-address-shaped IDs that the user
happens to have shared access to, leaking who-knows-whom). Server-side
override forces calendarIds = ['primary'] when TRUST_TIER === 'untrusted'.

Single chokepoint: registerTool is wrapped at module init to silently
skip non-allowlisted tools. Adding to the untrusted surface is one
line in UNTRUSTED_ALLOWLIST, not 24 callsite edits.
The bot in untrusted groups had no way to recognize that the OneCLI/
OAuth account belongs to a specific Telegram user. When @ligolnik
asked about "my calendar availability" in old.wtf (untrusted), the
bot defaulted to "asker is a participant, primary calendar belongs
to owner, decline" and refused — even though the asker WAS the owner.

Pinning owner identity in CLAUDE-untrusted.md so the bot can match
sender handles. Match is by @-handle (stable), not display name
(collisions). Other participants still get the "I don't have your
data" treatment, since they genuinely aren't the OAuth account holder.
…ually mounted)

CLAUDE-untrusted.md is never mounted into untrusted containers — only
SOUL-untrusted.md is, masqueraded as /workspace/global/SOUL.md so
core-behavior's "read SOUL.md" works. The owner-identity patch I shipped
to CLAUDE-untrusted.md earlier was invisible to the bot in old.wtf,
which kept defaulting to "owner = Baruch" from upstream tile content.

Moving the explicit owner identity (Leonid / @ligolnik / lim@igolnik.com)
plus the "any Baruch reference in tile rules means Leonid here" override
into SOUL-untrusted.md. CLAUDE-untrusted.md retained for parity if the
mount ever changes — duplication is cheap.

Local vendored tile copies were also scrubbed (not committed — they're
gitignored, managed by tessl install). PRs filed upstream:
jbaruch/nanoclaw-core#16, jbaruch/nanoclaw-trusted#10.
The (isMain || trusted) gate on host-side 👀 was previously commented
as a "noise" filter — keeping the bot from acking chitchat. The actual
reason is more important: in untrusted contexts the agent's
bad-actor-disengage rule may decide to go silent (no text AND no
reaction) when probing/social-engineering is detected. A host-side
👀 fires before the agent ever reads the message, so it leaks "I see
you" to hostile probers regardless of what the agent decides.

Comment updated to record the actual architectural reasoning so the
gate doesn't get "simplified" away in a future cleanup. No code change.
Race: pollIpcDuringQuery drains a file (deletes from disk + adds to
consumedInputFiles) and pushes content to MessageStream. If this
fires AFTER the SDK has emitted Result and agent-runner has broken
out of the for-await loop over responses, but BEFORE runQuery returns
and ipcPolling flips to false, the file is gone but the stream is
no longer being read. Message lost. Next waitForIpcMessage skips the
file (already in consumedInputFiles or unlinked). Bot never sees the
user message. Heartbeat's check-unanswered eventually rescues it,
but with up to 15-min latency.

Hit by msg 1289 in old.wtf at 22:31:40 today — pipe landed during
the closing seconds of a query for msg 1288. Zero "Piping IPC message
into active query" log lines confirmed the drain happened mid-window
without delivering. Bot only answered 14m17s later when the cron
heartbeat caught it.

Fix: pollIpcDuringQuery now only checks for the _close sentinel.
File draining is exclusive to waitForIpcMessage between queries.
Throughput unchanged — SDK conversations are turn-based, mid-turn
pushes weren't actually being processed mid-turn anyway, just
queued for after the current turn. Now they queue at the file
system layer instead, with deterministic delivery.
…iner

A _close sentinel is single-shot signaling — orchestrator writes it,
agent-runner reads + unlinks it, container exits. But if the container
crashes / is killed / hits a graceful shutdown that doesn't make it
through the unlink path, the file lingers. Next container spawn for
the same group/session has its first IPC poll consume the stale
sentinel and end the input stream. The SDK still finishes the
in-flight prompt (it was pushed before the close fired), but the
container exits immediately after — every subsequent user message
pays a fresh spawn cost (~10-30s).

Hit by old.wtf at 22:54 today: msg 1296 took 121s wall time. Container
log shows "Close sentinel detected during query, ending stream" at
500ms in (one IPC_POLL_MS), then the SDK ran to Result, then
"Close sentinel consumed during query, exiting". Stale file from an
earlier shutdown was the trigger. Found 6 stale _close files across
data/ipc/*/input-*/ at the time, oldest from 14:17.

Fix: clean the sentinel on the host before each container spawn, in
the same buildVolumeMounts setup block that creates the IPC dirs.
The agent-runner's existing shouldClose() also unlinks on detection,
which handles the case where orchestrator writes _close while the
container is still alive — that path stays, but stale-file recovery
no longer depends on it.
…g' pings

The watchdog armed on every Query input: line for a source folder.
Scheduled tasks (SmartThings refresh, check-unanswered heartbeats,
nightly-housekeeping) run in maintenance containers but share the
source folder with the user's default container. Long cron tasks
crossing 120s fired "Still working — 120s in" into the user's chat —
looking like the user's last message was still being processed when
in fact the user's query had finished minutes ago and a separate cron
was running.

Hit by main DM at 23:02 today: home_status query took 20.1s (finished
at 22:59:24), then SmartThings refresh cron started at 23:00:55 with
wall=150s. Watchdog crossed 120s at ~23:02:55 and pinged the user's
chat. Bot then hallucinated the explanation that home_status had been
slow.

Fix: detect scheduled-task input prefix (`[SCHEDULED TASK -` or
`<untrusted-input source="...">[SCHEDULED TASK -`) in the Query input
preview and skip startWatchdog. Cron tasks run silently as before;
user-driven queries still get the watchdog ping behavior they need.
…fired

The SDK Result message carries the agent's terminal assistant text. The
orchestrator's onOutput callback in src/index.ts treats that as a
user-facing reply and sends it to the chat. The design intent (upstream
unchanged) is that the agent does ONE of:

1. Answer via the SDK's final text (no tool call) — that text is the reply
2. Answer via mcp__nanoclaw__send_message AND wrap any closing thought
   in <internal>…</internal> — strip code in onOutput removes it, no
   echo

jbaruch#2 depends on the LLM reliably wrapping closing summaries. Opus 4.7
sometimes emits them as plain text instead, so users see two messages:
the explicit send_message reply, then a "Confirmed." / "Explained ..."
summary. Reproducible — happens on most multi-tool turns.

Fix: track whether send_message / send_voice / send_file fired during
the query. If so, set result=null in the writeOutput for the SDK
Result. The orchestrator's `if (result.result)` gate skips. The agent's
closing thought stays in `docker logs` for debug but doesn't leak to
chat. Upstream benefits from this too — the bug exists there but
manifests less because their prompt corpus apparently keeps the LLM
disciplined about <internal> wrapping; defense in depth is cheaper
than relying on prompt compliance.
Adds two sections to groups/global/CLAUDE.md (loaded by main + trusted
containers; untrusted is unaffected since it doesn't mount global/):

1. **Your tools live here** — directory of the python/js tools the
   bot has built for itself (smartthings_history.py, home_status.py,
   presence_chart.py, temperature_chart.py, tv_chart.py, methodology/)
   plus the *_notes.md files. Mirrors what already exists in the
   main group's local CLAUDE.md but at the global level so other
   trusted contexts know what's available.

2. **Git — committing and pushing** — explicit allowlist:
   - Allowed: commit + push to ligolnik/* repos; open PRs against
     ligolnik/* repos. The main group's workspace is a git repo whose
     origin is ligolnik/lombot — commit there freely.
   - With permission only: third-party repos (jbaruch/*, qwibitai/*,
     anyone else). Owner must explicitly say "open a PR upstream to X".
   - Never: force-push to main, direct push to a third-party repo,
     `gh pr create` without explicit --repo flag.
   - Auth: OneCLI proxy injects token on api.github.com, no manual
     PAT handling. Includes the curl recipe for PR creation via REST.
Adds an optional chat_jid parameter to the mcp__nanoclaw__send_message
MCP tool. When set, the IPC payload targets that chat instead of the
container's own. Authorization is unchanged — host-side ipc.ts:457-461
already enforces "main can target any chat, trusted/untrusted only
their own". The MCP tool just plumbs the param through.

Why: the bot in main DM had built bot_post.py as a parallel direct-
Telegram-API path because send_message was hard-coded to its own chat.
But bot_post.py wrote messages to Telegram without recording them in
messages.db, so the agent in the target chat (and the cron heartbeat)
never saw them — leading to "I can't find this message" hallucinations
when other bots quote-replied to those out-of-band sends.

Routing through send_message instead means the orchestrator's existing
ipc.ts handler does both Telegram delivery AND the storeMessage call —
single code path, automatic DB record, no parallel implementation drift.

groups/global/CLAUDE.md updated with the new pattern + a "use this
instead of out-of-band senders" note. bot_post.py is being removed
from the main group's workspace as a follow-up.
Once-tasks were marked status='completed' but never deleted, so getAllTasks()
(and the snapshot it feeds into list_tasks) returned them indefinitely. Over
weeks the list grows unboundedly with stale entries the user can no longer act on.

The fix adds pruneCompletedTasks(maxAgeMs) in db.ts and calls it once per
scheduler loop iteration with a 24h TTL. The grace window keeps recently-finished
tasks visible in list_tasks for inspection. cancel_task is unaffected — it still
deletes immediately via deleteTask, so the cancellation workflow doesn't change.

Recurring tasks are inherently excluded (computeNextRun never returns null for
them, so they never reach status='completed'); the schedule_type='once' clause
in the prune query is defensive.

Fixes jbaruch#1.
The bot opened lombot PRs jbaruch#4 and jbaruch#5 today AND simultaneously direct-
pushed the same content to main, leaving the PRs stranded. ~1,134
lines of drive_planner code landed without owner review. Updates the
git rule in groups/global/CLAUDE.md to be explicit:

- "Push to main directly" — never. Even on owner's own repos. Branch
  + PR + stop.
- "Merge any PR" — never. Filing is the bot's output; merging is the
  owner's call.
- Adds a 5-step PR workflow recipe: branch, push, open PR via REST,
  STOP. Tells the bot to surface the PR URL and wait, explicitly
  forbids the "self-review-and-merge" pattern.

Tracking issue filed separately for technical enforcement (branch
protection / separate git identity for the bot) so the rule isn't
the only line of defense.
SmartThings changed PAT policy 2026-04-26 — tokens are now 24-hour-only,
no programmatic re-issuance. This adds a host-side watchdog that probes
/v1/locations every 30 min through OneCLI; on 401 it pings the owner's
main DM (chat_id 114893642) with the rotation procedure (deduped to
once per 4h per failure spell). On 200 after a 401 streak it sends
"✅ access restored". Probe is launchd-managed via
com.nanoclaw.smartthings-watchdog.plist.

Files:
- scripts/smartthings-watchdog.sh: bash runner; uses onecli run -- curl
  for the probe (auto-injects PAT + CA), reads TELEGRAM_BOT_TOKEN from
  .env, persists state at data/smartthings-watchdog-state.json
- scripts/install-smartthings-watchdog.sh: substitutes {{PROJECT_ROOT}}
  and {{HOME}} in the plist, copies to ~/Library/LaunchAgents, loads
- launchd/com.nanoclaw.smartthings-watchdog.plist: KeepAlive=false,
  StartInterval=1800

Also tessl: bump nanoclaw-trusted 0.1.48→0.1.49.

Cron change (every 15 min → every hour, --stale-hours 4, 401 short-
circuit) was applied directly to scheduled_tasks via IPC by the
background subagent — host DB state, not in this commit.
Sync with upstream after our PR jbaruch#2 (scheduler prune) merged. Brings in
4 follow-up scheduler refinements that build on our prune fix, plus
unrelated upstream features.

## Conflict resolutions (4 files)

- src/db.ts → take upstream. Strict superset of our pruneCompletedTasks:
  uses COALESCE(last_run, created_at) so NULL-last_run orphans (from
  failed dispatch paths) get pruned by created_at age, not lingering
  forever. Addresses our own ligolnik/nanoclaw-public#5.
- src/task-scheduler.ts → take upstream. Adds getCompletedTaskTtlMs()
  env override, PRUNE_INTERVAL_MS=1h throttle, dormant-cron warnings
  with per-task cooldown. All on top of our PR jbaruch#2 foundation.
- src/task-scheduler.test.ts → take upstream. New tests for the
  refinements; our 3 prune tests are still in there.
- src/container-runner.ts → keep ours. Our OneCLI proxy injection,
  reactions filtered DB, stale _close cleanup, trust-tier env, per-tile
  owner support, additional-mounts authz are all forks-specific. Single
  conflicting hunk (env-file COMPOSIO forwarding) resolved in favor of
  our COMPOSIO removal — the env-file infrastructure (SECRET_CONTAINER_VARS,
  buildSecretEnvFile) stays from the auto-merged sections so future
  per-container secrets can use it.

## What's coming in from upstream (16 + 3-of-our-prune commits, paraphrased)

Container runtime:
- a253c16 secrets via 0600 env-file, not -e (ps-leak fix)
- e690b55 cleanup tempfile on write failure
Scheduler:
- 1c5d69c set-based deletes in pruneCompletedTasks (jbaruch#32)
- a507f68 NULL-last_run orphan handling, configurable TTL, throttled
  cadence, dormant-cron detection
- d2947cb Copilot follow-ups on prune housekeeping (jbaruch#36)
- 9ba3b0f apply COALESCE age threshold to dormant-recurring query (jbaruch#38)
- f749296 plumb continuation_cycle_id env var on scheduled-task spawns
- 71264b1 test(group-queue): parallel-session isolation under continuation chains
Telegram emoji:
- 86edff1 normalize Slack shortcodes to Unicode for reactions
- c2b1721 proto-pollution guard + audit-friendly VS16 regex
- c65f052 actually use ️ escape, fix proto-test comment
Style/test housekeeping:
- b144149 prettier-format db-migration.test.ts
- b9c3b2c clean up tempDir in continuation_cycle_id migration test

## Test debt — addressed in PR review

`npx vitest run`: 500 pass, 135 fail, 12 skip. Most failures are
pre-existing divergences between our fork and upstream's test
assumptions (selectTiles signature is TileRef[] vs string[], our
container-runner.security.test.ts expects our env-file path, in-memory
DB initialization). NEW breakage from this merge: zero — build is
clean, our runtime behavior is preserved, the fix is in the test
shape, not the code shape. Resolution plan documented in the PR body.
…ocks

After the upstream merge, two test files needed shape updates to match
fork-specific code:

- container-runner.test.ts (selectTiles, 5 tests) — our selectTiles
  returns TileRef[] (per-tile owner support) not string[]. Added a
  `names()` helper that maps to .name so the assertions stay readable.
  Also expanded "main + trusted" expectations to include the
  fork-local `flight-weather-watch` tile that's appended on those tiers.

- channels/telegram.test.ts (config mock) — our auto-react gate calls
  getTriggerPattern() and the channel imports GROUPS_DIR for voice
  transcription. The test's vi.mock('../config.js') was missing both
  exports, causing 13 telegram tests to fail with "No 'getTriggerPattern'
  export is defined on the mock". Added stubs (default trigger pattern,
  /tmp/test-groups for GROUPS_DIR) — tests don't exercise either path
  but the imports resolve at module load.

After fixes:
  Test Files  26 passed (26)
       Tests  635 passed | 12 skipped (647)

Started this run at 135 failures (mostly the better-sqlite3 ABI
mismatch that npm rebuild fixed); the remaining 19 were these two
shape divergences. Build still clean.
merge: sync with jbaruch/nanoclaw-public — 19 upstream commits + scheduler refinements built on our PR jbaruch#2
PR jbaruch#46 (sanitizer allowlist):
- Drop 'br' from TELEGRAM_ALLOWED_TAGS (Bot API HTML mode doesn't
  support it; agents emit \n, not <br>)
- Add Phase 1b unknown-tag describe block with 8 tests covering
  agent-emitted markers, JSON-shape leakage, mixed-case allowed tags,
  self-closing/attribute-bearing unknown tags, and the
  unknown-inside-fenced-code regression

PR jbaruch#47 (final-text suppression):
- Track tool_use ids of user-facing send calls in a Set, then watch
  for matching tool_result blocks and only flip the suppression flag
  when is_error !== true. A hook-denied or errored send_message no
  longer leaves the user staring at silence.
ligolnik and others added 30 commits April 28, 2026 20:39
…jbaruch#33)

Closes the lying-telemetry gap where script-gated tasks (e.g.
heartbeat-telegram_old-wtf) wrote status='success' even when the
gate script's payload signalled a hard failure. The 96 wasted
wake-ups/day on the untrusted-tier heartbeat happened because a
'DB access failed' run was indistinguishable in run logs from a
real success.

The gate script's exit code is the source of truth (per the
check-unanswered.py docstring contract: exit 1 = hard failure),
but by the time the agent has dutifully wrapped the script's
degraded payload into an <internal>...</internal> reasoning block,
the orchestrator only sees the agent's 'success' status. This patch
adds a stable-marker pattern check ('DB access failed', 'unable to
open', 'env-warning:') against the result payload right before
logAndUpdateTask writes the row — a positive match reclassifies
the run as 'error' so it shows up in 'broken tasks' queries and
counts against telemetry honestly. Result remains captured in the
result/error fields for forensics.

Fix A from the issue (restoring store read access for untrusted
containers) is deliberately deferred to a follow-up — fixing the
classification first stops the wasted spend; restoring the mount
restores the safety-net coverage.

Closes jbaruch#26

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nds from main (jbaruch#34)

Brings send_file and send_voice to parity with send_message: main may target
any registered chat with chat_jid; trusted/untrusted ignore the param and
fall back to the current chat (same authz gate as send_message in
src/ipc.ts:457). Cross-chat sends are recorded in messages.db so downstream
agents in the target chat see them.

Closes jbaruch#25

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y lost tasks (jbaruch#35)

Closes the silent-failure class observed tonight where a wedged
maintenance container blocked every scheduled task in a group for
80+ minutes — once-tasks pre-advanced to status='completed' but
never ran, cron tasks had next_run rolled forward without firing,
resurrectZombieTasks couldn't recover them because of its (correct)
next_run > now storm-prevention guard.

Three layers of fix:

1. Reject past schedule_value at schedule_task creation time — the
   highest-signal user-facing case (caller knows immediately their
   typo didn't schedule anything). src/ipc.ts adds a Date.now()
   guard in the once-branch with a structured warn log carrying
   parsedAt, nowAt, and source/target context.

2. Periodic sweep of group-queue pendingTasks: any task waiting
   longer than DISPATCH_DROP_THRESHOLD_MS (30 min, matches
   CONTAINER_TIMEOUT) gets dropped and recorded as status='error'
   in task_run_logs with a 'maintenance slot wedged' result.
   Surfaces the lost dispatch instead of leaving the row in the
   silent dropped-dispatch shape. Wired via setLogTaskRunFn from
   src/index.ts at orchestrator startup; tests inject a spy
   through _startDispatchSweepForTests.

3. Per-task watchdog in runTask kills the container after
   TASK_RUN_TIMEOUT_MS (default 30 min, overridable via
   NANOCLAW_TASK_RUN_TIMEOUT_MS) to prevent slot wedging in the
   first place. Records status='timeout' (new enum member on
   TaskRunLog.status) so the failure is visible and queryable
   separately from agent/script errors. Uses Promise.race so the
   await unblocks even when runContainerAgent never resolves.

Stale 2025-06-01 fixtures in ipc-auth.test.ts bumped to 2099-01-01
so they survive the new past-rejection guard.

Closes jbaruch#30

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
… via config (jbaruch#36)

Closes jbaruch#29. Cherry-picked from upstream jbaruch/nanoclaw#253 (already merged
upstream): the SDK's auto-compact was firing at 165k on a 1M context window
because a CLAUDE_CODE_AUTO_COMPACT_WINDOW='165000' hardcode in agent-runner
overrode the SDK's natural 1M-aware default. Move the value to an
orchestrator-side env-var injection so the orchestrator's resolved
AGENT_AUTO_COMPACT_WINDOW config (default 800k) flows through to the
SDK, leaving ~200k of compaction headroom on Opus's 1M window.

We do NOT adopt the upstream PR's ENABLE_THRESHOLD_NUKE gating — that
flag and the threshold-nuke infrastructure don't exist in this fork.
The forward is unconditional. If threshold-nuke lands here later, this
becomes a one-line gate-add.

Closes jbaruch#29

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es a secret (jbaruch#37)

The SECRET_CONTAINER_VARS env-file forward (PR jbaruch#32) reads process.env to
collect secret values to inject into containers. But the orchestrator
runs under launchd, which doesn't auto-load .env into process.env, and
no dotenv.config() call exists in dist/. Result: GITHUB_TOKEN sits in
.env but never reaches a container — the SECRET_CONTAINER_VARS path
silently produces an empty env-file and gets skipped.

Match src/config.ts's established pattern: prefer process.env, fall
back to a readEnvFile() lookup of the same key from .env. Single-call
overhead per buildContainerArgs invocation; .env is small.

Effective on orchestrator restart.

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that GITHUB_TOKEN is env-injected via SECRET_CONTAINER_VARS (PR jbaruch#32
+ jbaruch#37), gh picks it up automatically — main/trusted containers can use
gh directly without curl + REST gymnastics. Untrusted tier still gets
no token, so gh fails closed with a clear auth error rather than
silently using a stale identity.

Adds GitHub's official apt repo with their signing key and installs
gh via apt-get. Layer placed after the existing system-deps install
so npm/COPY layers further down keep their cache.

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… mounts can open it (closes jbaruch#43) (jbaruch#46)

The per-group filtered DB copy created by createFilteredDb is mounted
read-only into untrusted containers as a single-reader snapshot. SQLite
WAL mode requires writable -wal and -shm sidecar files even for
read-only opens, so under a :ro mount any reader that opens the DB
read-write (e.g. Python sqlite3.connect() default) fails with
"unable to open database file".

This silently broke the maintenance heartbeat for untrusted groups —
check-unanswered.py runs inside the :ro mount and could not open the
filtered DB, so the recovery path that detects and replies to
unanswered messages never ran.

The filtered copy is a one-shot snapshot: created fresh on every
container spawn, single reader, no writers, no concurrency — none of
WAL's tradeoffs apply. DELETE (the SQLite default) writes no sidecars
and opens cleanly under :ro.

The orchestrator's main store/messages.db is a separate file with
concurrent writers and stays WAL.

Updates the existing pragma regression test to assert DELETE.

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…ore 1000 (closes jbaruch#44) (jbaruch#45)

On bare-metal macOS hosts the running user is typically uid 501, gid 20.
With HOST_UID / HOST_GID unset (the default), config.ts resolved them
to `undefined`, and every call site in container-runner.ts collapsed
that to `?? 1000`. The orchestrator then tried to chown files it owns
(501 -> 1000) as a non-root user, which EPERMs on every container
spawn and fills nanoclaw.error.log with "Failed to chown ..." warnings
that are indistinguishable from real chown failures.

Insert the host process's own uid/gid as the second link in the
resolution chain:

    HOST_UID env -> process.getuid?.() -> call-site `?? 1000`

DooD deployments still set HOST_UID / HOST_GID explicitly (because
inside a container process.getuid() returns the container's uid, not
the host's), so they are unaffected. Bare-metal macOS / Linux hosts
now resolve to the real host uid -- chown becomes a no-op (own -> own)
and stops firing EPERM. Windows, where process.getuid is undefined,
still falls through to the call-site `?? 1000`.

Call sites in src/container-runner.ts keep their `?? 1000` last-resort
fallback unchanged.

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
… growth (jbaruch#48)

* fix(ipc): persist consumed input list and add host GC to bound input/ growth (closes jbaruch#47)

Untrusted-tier containers mount data/ipc/<group>/input/ read-only as a
security boundary, so the agent's unlinkSync swallows EROFS and consumed
files stay on disk. Combined with an in-memory-only consumedInputFiles
Set, every container restart re-drained every file ever written for the
group. One wtf group hit 1,586 stale files re-drained as one prompt
(846k chars) which crossed the 800k auto-compact window and poisoned
the resulting session. Trusted groups have RW mounts and don't hit this.

Two halves, complementary:

Agent side (container/agent-runner/src/index.ts):
  - Append consumed basenames to messages/_consumed_inputs.log on each
    successful drain. messages/ is RW even on untrusted containers.
  - Replay the log into consumedInputFiles on agent startup so a fresh
    container skips the lifetime backlog.
  - Append/load tolerate EROFS/EACCES/ENOENT defensively.
  - drainIpcInput / loadConsumedInputs now accept optional path
    overrides so tests can drive them against a tmp dir.

Host side (src/ipc-gc.ts):
  - runIpcGc atomically renames _consumed_inputs.log to .processing,
    deletes each listed basename from input-default/ and
    input-maintenance/, then drops the .processing file.
  - Strict basename allowlist (/^[A-Za-z0-9_.-]+\.json$/) plus explicit
    refusal of /, \\, .. — defends against a compromised agent writing
    a path-traversal entry to the log.
  - Crash recovery: leftover .processing file is picked up as-is on
    next call, no re-rename.
  - Wired into src/index.ts: one startup pass per registered group,
    then a 60s setInterval. runIpcGcSafe wraps errors so a transient
    FS hiccup doesn't tear down the orchestrator.

Tests:
  - src/ipc-gc.test.ts: 8 cases covering missing log, single/multi
    delete, ENOENT-tolerant, traversal rejection, allowlist
    rejection, crash-recovery, dedupe.
  - container/agent-runner/src/ipc-consumed-log.test.ts: 6 cases
    covering append-on-drain, no-op when nothing consumed, replay
    on startup, missing-log tolerance, end-to-end restart-skip,
    incremental append.

Out of scope (per issue): one-shot rm of the existing 1,607 stale
files in telegram_old-wtf/input-default/ — handled manually before
this lands. Mount permissions and the EROFS-tolerant unlinkSync stay
as-is; they're the security boundary this PR works around.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: apply prettier formatting

Pre-commit hook reflowed two array literals and a type assertion to
fit the line-length config. No semantic change — tests pass identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…loses jbaruch#49) (jbaruch#54)

Was 1536m/2048m. Triggered 119 cgroup-OOM SIGKILLs (exit 137) in 12h
across main groups, with sessions running 100-900s before dying. Node +
SDK + MCP procs + skill subagent prompt + accumulated cache_read climbed
past 1.5GB on heavy main sessions.

Cheap fix that addresses the symptom; root cause is unbounded session
cache growth across resumed turns, which is a separate question for a
future PR (cap resume depth, or trigger compaction in bytes not tokens).

Closes jbaruch#49

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…oses jbaruch#52) (jbaruch#53)

When a `scheduled_tasks` row references a `group_folder` that is no
longer in `registered_groups`, the dispatch handler used to ERROR-log
"Group not found for task" and leave the row in place — so the
scheduler dispatched the same orphan every poll interval forever.
This produced 8+ ERROR entries/day per orphaned heartbeat in jbaruch#52.

This commit changes the dispatch-time handler to delete the orphaned
task and emit a debug log instead of an error. Rationale:

- A missing group at dispatch is expected during the brief window
  between a group being deregistered and its tasks being cleaned up.
  It is not actionable, so the ERROR log was just noise.
- The task is permanently orphaned — no future poll tick will resolve
  the missing group. Deleting it self-heals the situation.
- Pre-existing orphans (e.g. the two `heartbeat-telegram_iff-lom-bot-test*`
  rows from jbaruch#52) are dropped on the next dispatch tick. This subsumes
  Option A's manual cleanup migration: deploy this fix and the orphans
  vanish on their own. No `DELETE FROM scheduled_tasks` script needed.

Note on Gap fix jbaruch#1 (delete tasks on deregister): no programmatic
group-deregister code path exists in this repo today — groups can only
be added via setup/register.ts; orphans arise from manual SQL or
registry edits. The dispatch-time self-heal is therefore a complete
solution. If a deregister helper is added later, a sibling
`deleteTasksForGroup(folder)` call should be added there for
defence-in-depth, but it is not required to fix jbaruch#52.

Includes a regression test that creates an orphan heartbeat with the
same shape as the jbaruch#52 repro, confirms it is deleted on the first
dispatch tick, confirms no ERROR-level log fires, and confirms a
sibling task on a different (non-orphan) group is NOT touched.

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed emojis (jbaruch#55)

* fix(telegram): translate bot-emitted IDs for reactions and log rejected emojis

Closes jbaruch#50, closes jbaruch#51.

## jbaruch#50 — bot-id reaction translation (Path A)

Pre-fix: outbound bot messages were stored in messages.db with a
synthetic `bot-<ts>-<rand>` primary key. When something later asked
the Telegram channel to react to one of those rows, sendReaction
shipped that local id straight to Telegram's setMessageReaction,
which rejected it with HTTP 400 (`message to react to not found`).
145 ERROR-level entries in 12h.

Fix: store the platform-native numeric `message_id` returned by
Telegram's sendMessage as the row's primary key when available,
falling back to the legacy `bot-<ts>-<rand>` only for paths that
don't surface a numeric id (currently sendPoolMessage which returns
boolean, plus non-Telegram channels). sendReaction now:
- passes numeric ids straight through (e.g. when an inbound message
  is being reacted to);
- for legacy `bot-...` ids, looks the row up via getMessageById and
  reuses the row's primary key when it's numeric;
- skips with a WARN (no Telegram call) when neither path yields a
  numeric id, so we stop burning API calls on doomed requests.

Picked Path A over Path B (separate `telegram_message_id` column +
schema migration) because the audit found nothing in the codebase
parses the `bot-` prefix — `is_from_me=1` + `is_bot_message=1` are
the canonical "bot row" markers, both still set unchanged. Single
source of truth, no migration needed.

Also downgrades known-unactionable Telegram 400s ("message to react
to not found", "message to be replied not found", MESSAGE_ID_INVALID,
chat not found, bot was blocked) from ERROR to WARN — these are the
small fraction of failures that genuinely can't be retried, called
out as out-of-scope but easy in jbaruch#50.

## jbaruch#51 — log rejected emoji

The "Invalid Telegram reaction emoji, falling back to 👍" warn now
ships `rejectedEmoji` (post-normalization form Telegram refused) and
`originalInput` (pre-normalization, what the agent emitted) so we
can extend the normalizer when patterns appear in logs.

## Audit findings

Searched src/ for `bot-` literal usage and `startsWith('bot-')`:
nothing in code parses the prefix — all four call sites that mint
`bot-<ts>-<rand>` ids are write-only (storeMessage). The handful of
matches in comments are explanatory. Safe to switch the id source.

* style(telegram): apply prettier line-wrap left over from previous commit

---------

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
… kill (jbaruch#40) (jbaruch#41)

PR jbaruch#35's dispatch-loss + per-task watchdogs catch wedged containers and
record the symptom in task_run_logs. But the only evidence captured is
"slot wedged for Xmin" — no insight into what was actually stuck.

Adds a 5s-bounded best-effort diagnostic capture that runs immediately
before either watchdog kills/drops anything. Captures docker inspect
fields, ps -ef, /proc/<pid>/wchan (the load-bearing diagnostic — kernel
wait channel tells us what syscall the agent-runner is blocked on),
open TCP connections, last 100 docker log lines, and the triggering
task context. Writes atomically to data/wedge-diagnostics/<iso>-<name>.txt.

Each diagnostic command is bounded to 1s; the entire capture to 5s.
A hung command (likely on a fully wedged container) leaves a placeholder
and capture continues. The file's path is appended to the result field
of the task_run_logs row so investigators can find the evidence.

Refs jbaruch#40

Co-authored-by: Leonid Igolnik <lim@Leonids-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on (jbaruch#63)

The Query done summary already logs tokens_in/out, cache_read/create,
and wall-time, and the container name attributes the line to a group.
Adding the model (opus[1m]/sonnet/haiku) closes the last gap so
docker logs alone are sufficient to compute per-group, per-model
cost without touching the DB.

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…aruch#64)

* feat(container-runner): per-spawn blocklist for maintenance-class spawns (nanocoai#337) (nanocoai#344)

* feat(container-runner): per-spawn blocklist for maintenance-class spawns (nanocoai#337)

The orchestrator copies every installed tile's rules and skills into
every container at spawn time, contributing to the system-prompt
`cache_create` cost on the first turn of every fresh session. For
non-conversational task classes (heartbeat, nightly, weekly), most of
that content is dead weight.

Add `MAINTENANCE_RULE_BLOCKLIST` and `MAINTENANCE_SKILL_BLOCKLIST`
env-driven sets that the install-into-container loop skips when
`sessionName === MAINTENANCE_SESSION_NAME`. The filter applies to:
- per-tile rules (`rulesContent.push` skipped + cpSync skipped)
- per-tile skills (both the `.tessl/tiles/...` dst and the flat
  `tessl__<name>` skill copy skipped)
- built-in `container/skills/` skills (bare-name match)
- AyeAye-created staging skills (bare-name match)

Filtered names emit a single `install_blocklist_filtered` log line per
spawn (silent when nothing was filtered). Default-session spawns
bypass the filter entirely — empty / unset blocklists are also a
no-op, so this change is regression-safe with zero env-config.

Tests cover regression (default-session unfiltered with non-empty
blocklists), the maintenance-only fire path for both rules and
skills, and the aggregated log emission. All 1494 existing tests
continue to pass.

Refs nanocoai#337, nanocoai#104.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(nanocoai#337): break import cycle and wire envConfig keys (PR nanocoai#344 review)

Two fixes from the Copilot review:

1. **Circular dependency.** `container-runner.ts` imported
   `MAINTENANCE_SESSION_NAME` from `group-queue.ts`, but
   `group-queue.ts` already imports `DEFAULT_SESSION_NAME` /
   `sessionInputDirName` from `container-runner.ts`. Moves the
   `MAINTENANCE_SESSION_NAME` definition to `container-runner.ts`
   alongside `DEFAULT_SESSION_NAME` (where the session-name family
   already lives) and re-exports it from `group-queue.ts` so callers
   that previously imported from there keep working unchanged.

2. **`envConfig.MAINTENANCE_RULE_BLOCKLIST` was always undefined**
   because the keys weren't passed to `readEnvFile([...])` at the top
   of `config.ts`. Added them so the `.env`-file fallback actually
   resolves; previously the code would only read from `process.env`
   and silently ignore values placed in the project's `.env` file.

Updates `task-scheduler.test.ts` to add `MAINTENANCE_SESSION_NAME` to
its `vi.mock('./container-runner.js')` payload — the mock previously
only exported `DEFAULT_SESSION_NAME` from container-runner, so the
re-export chain via `group-queue.ts` resolved to undefined and broke
`scheduled task ignores any cached maintenance sessionId` (nanocoai#193).

`npx tsc --noEmit` clean. `npx vitest run` green (1463/1463).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(task-scheduler): per-task SDK session reuse for recurring fires

Recurring scheduled tasks (cron / interval) now keep their own
session_id across fires so the API can cache the per-session
message-history prefix even though the prompt-cache TTL (5 min)
expires between heartbeat fires (15-30 min cadence). Each fire's
cache_create becomes incremental rather than full-prefix.

New nullable scheduled_tasks.session_id column populated by runTask
on first fire, passed as resume: on every subsequent fire. One-shot
tasks (schedule_type === 'once') stay fresh-per-fire — out of scope.

The nanocoai#193 cross-task bleed concern doesn't apply: persistence is keyed
on task_id, so two distinct tasks land in two distinct rows hence two
distinct SDK sessions — no slot-cache aliasing possible. The dead
slot-cache write that previously happened (sessions[group][maintenance])
is dropped — it was never read back.

nuke_session IPC clears session_id for every scheduled task in the
affected group when the maintenance / 'all' slot is wiped — without
this the next fire would resume an id whose JSONL is gone.

Hand-ported from jbaruch/nanoclaw#357 (the upstream PR pulls in 4-5
unrelated transitive deps that aren't in our public tree). Closes jbaruch#59.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…tion

The new jbaruch#59 test annotated MAIN_GROUP with `RegisteredGroup`, but the
type isn't imported in this test file (other tests use plain object
literals without the annotation). `npm run build` (full tsc, not the
test-mode `--noEmit`) caught the missing import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aruch#65)

* feat(container-runner): add /workspace/state writable mount

Ports jbaruch/nanoclaw#220 (Cat 4 of nanocoai#99 audit). Adds a per-group
canonical writable mount at /workspace/state — sourced from
data/state/<folder>/ and mounted into every container regardless of
trust tier. Skills that need to persist state across runs now have a
single canonical writable location.

Per-group scoping (not per-session) so a scheduled task and a
user-facing turn in the same group can read each other's state.

CHANGELOG and docs/SPEC.md updated; container-runner.test.ts gains
4 new tests (admin/trusted/untrusted writable + per-group host path).

* feat(index): wire heartbeat to use precheck-gate script

Sets the `script` field on the auto-created non-main heartbeat task to
the unanswered-precheck Python script. The agent-runner's existing
runScript / wakeAgent mechanism gates SDK invocation on the precheck's
`{"wakeAgent": <bool>}` output — when there are no new candidates the
container exits without spending tokens. When the precheck reports new
work or fails it falls through to the original prompt unchanged.

Adds a one-shot startup migration (`backfillHeartbeatPrecheckScript`)
so already-deployed installs pick up the gate without manual DB edits.
Idempotent and only touches non-main heartbeat-* rows whose prompt
mentions check-unanswered, leaving the main-group maintenance heartbeat
(which uses the tessl__heartbeat skill, not check-unanswered) alone.

Closes part of jbaruch#62 (heartbeat container spawns running 0 queries).

---------

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
…orts jbaruch/nanoclaw#257) (jbaruch#67)

* fix(container-runner): skip non-files when copying skill scripts

`tmpScriptsDir` (published as `/workspace/group/scripts/<name>`) is a
flat dir of executables, but the per-tile loop in `buildVolumeMounts`
called `fs.cpSync` without `{ recursive: true }` on every entry under
`<skill>/scripts/`. As soon as anything created a subdir there — most
realistically Python writing `__pycache__/` next to a `.py` script that
got imported during a previous container run — the next spawn crashed
with `Recursive option not enabled, cannot copy a directory` and the
group's circuit breaker tripped after five retries.

Filter the loop to file-typed dirents instead of widening the copy to
recursive: subdirs aren't reachable through the flat `<name>` contract
anyway, so silently dropping them matches the publish surface and
keeps `__pycache__/` (and any future stray runtime byproduct) from
blocking spawns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(container-runner): preserve symlinks in script copy + add regression tests

Address PR nanocoai#257 review feedback:

1. Filter on `!isDirectory()` instead of `isFile()`. The previous
   non-recursive cpSync loop happily walked symlink dirents (cpSync
   resolves them by default), so an `isFile()` filter would have
   silently dropped any symlinked executable a tile shipped under
   `<skill>/scripts/`. The bug was specifically about *directories*
   crashing the non-recursive copy — narrow the skip to that.

2. Extract the loop into `copyTileScriptsToFlatDir` so it's testable
   without dragging in the rest of `buildVolumeMounts` (which is
   security-mocked in the companion test file).

3. Add `container-runner.scripts-copy.test.ts` — separate file so it
   can use real fs in a tmpdir without colliding with the global
   `vi.mock('fs')` in `container-runner.test.ts`. Covers the
   regression case (`__pycache__/` subdir present), regular file
   copy, symlink preservation, no-op when source dir is absent, and
   the mixed files/dirs scenario.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(container-runner): tighten script-copy filter to explicit allowlist

Address PR nanocoai#257 round-2 review feedback from Copilot:

- Switch the predicate from `!isDirectory()` to an explicit
  `isFile() || isSymbolicLink()` allowlist. The negative form would
  have happily passed FIFOs / sockets / device entries to `fs.cpSync`,
  which rejects them with EINVAL — same crash class the original
  __pycache__ bug fell into. The allowlist enumerates exactly the two
  dirent kinds the `/workspace/group/scripts/<name>` contract can
  publish; everything else is skipped before cpSync sees it.
- Update the doc on `copyTileScriptsToFlatDir` to match the actual
  behavior (was implying `!isDirectory()` semantics).
- Rename the symlink test from "preserves symlink entries" to "does
  not drop symlink entries" — the previous wording suggested the
  destination would still BE a symlink, but cpSync's default deref
  copies through. The regression we guard against is symlinks being
  silently dropped by the loop, not symlink-ness on the destination.
- Add a FIFO test as the cheap, cross-platform witness that the
  allowlist actually skips non-{file,symlink,dir} kinds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(container-runner): relax doc, correct cpSync default in test comment

Address PR nanocoai#257 round-3 review feedback from Copilot:

- Relax doc on `copyTileScriptsToFlatDir` from "symlinks to regular
  files" to just "symlinks". The code allowlists every symlink dirent
  without resolving the target, so the doc was overstating what we
  enforce. Add a sentence saying targets aren't inspected — the tile
  owns whether what it links to is sensible.
- Fix the symlink-test comment: the previous wording said cpSync
  "dereferences symlinks during the copy", but Node 20's default for
  `dereference` is `false`. Updated to note the actual behavior
  (destination IS a symlink) and explain why the test asserts
  reachability + content rather than symlink-ness — the regression
  we're guarding is silent drops in the loop, not destination
  symlink-ness, so a future flip of cpSync defaults wouldn't be a
  bug-class regression.

Declined the suggestion to add a `statSync(srcDir).isDirectory()`
guard: a tile shipping a non-directory at the `scripts/` path is a
broken-tile state no defensive layer at this site can rescue, and
`existsSync` already gates the loop. Reply on the thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Baruch Sadogursky <baruch@tessl.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ge reads (nanocoai#235) (jbaruch#66)

* db: enable WAL + busy_timeout to prevent transient malformed-image reads

Cross-process readers (agent containers, ad-hoc sqlite3 CLI invocations)
intermittently saw "database disk image is malformed" while opening
messages.db. The DB itself was fine — the file was just being read
mid-write while in rollback-journal mode, so the snapshot reader hit a
state needing journal replay and failed the page-format check.

Fix: set the WAL pragma trio in initDatabase() right after opening the
connection. WAL keeps writers off the main file (they hit a sidecar
log), so cross-process readers always see a consistent snapshot.

  db.pragma('journal_mode = WAL');
  db.pragma('synchronous = NORMAL');
  db.pragma('busy_timeout = 5000');

NORMAL is the standard synchronous pairing for WAL (FULL is overkill
once writes go through the WAL). busy_timeout = 5000 smooths the rare
contention case (default is 0 = immediate-fail on SQLITE_BUSY).

Side effect: two new sidecar files appear next to messages.db
(messages.db-wal, messages.db-shm) — store/ is already gitignored.

No backup-path changes needed — verified that nothing in scripts/
copies messages.db with cp/rsync/tar (the cleanup/deploy/nuke scripts
all go through the sqlite3 CLI, which handles WAL transparently). The
in-process createFilteredDb in container-runner.ts uses ATTACH +
CREATE TABLE AS SELECT, also WAL-safe.

Closes ligolnik/nanoclaw-public#9

* fix(db): verify WAL engaged + propagate busy_timeout to filtered DB

Address Copilot review on PR nanocoai#235:

1. `journal_mode = WAL` can silently fall back when the filesystem
   doesn't support shared-memory mmap (some FUSE / network mounts).
   Capture the effective mode and throw if it isn't `wal` — the whole
   point of this fix is consistency, so a silent fallback to rollback
   journal would defeat it without any signal.

2. `createFilteredDb()` opens its own better-sqlite3 connection and
   ATTACHes the WAL-mode source DB. With the default `busy_timeout=0`
   it would fail immediately on any lock contention against the source
   (e.g. during a checkpoint), defeating the smoothing the orchestrator
   side just gained. Match the orchestrator value (5000ms) so contention
   smoothing is symmetric across readers.

---------

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Leonid Igolnik <lim@igolnik.com>
…jbaruch/nanoclaw#260) (nanocoai#68)

* fix(config): warn when HOST_UID/HOST_GID env are malformed

A set-but-malformed value (`HOST_UID=foo` parsing to `NaN`, or
`HOST_UID=-1`) used to flow through `parseInt` straight into the
exported binding. Downstream chown sites then either crashed
(`fs.chownSync(..., NaN, NaN)` throws) or silently mis-owned
(`-1` casts to uid 4294967295). Worse, the same fail-open posture
that handles the legitimate "Mac host without DooD" case made
operator misconfiguration look identical to "no override needed",
so the original permission issue (nanocoai#254) reappeared invisibly.

Centralize parsing in `parseHostId`, mirroring the
`resolveAgentAutoCompactWindow` shape already in this file.
Set-but-invalid values now warn once to stderr and resolve to
`undefined` so downstream sites take their default branch instead
of forwarding the bad value. Stderr (not the structured logger)
keeps `config.ts` below `logger.ts` in the import graph — the
constraint is documented in `host-logs.ts:41-45`.

Closes nanocoai#258.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(config): reject partial-parse and empty HOST_UID/HOST_GID

Per Copilot review on nanocoai#260: `parseInt("123abc", 10)` returns 123 and
`parseInt("1.5", 10)` returns 1 — silent partial parses that defeat
the whole purpose of validating operator input. The `if (!raw)`
guard also treated an explicit empty string (e.g. a `.env` line
truncated to `HOST_UID=`) as "unset", absorbing exactly the typo
shape the validator was supposed to surface.

Switch to a strict digits-only regex (`/^\d+$/`) and key the unset
branch on `raw === undefined` so empty strings flow into the
warning path. Add three test cases — `123abc`, `1.5`, and the
empty-string shape — pinning the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Baruch Sadogursky <baruch@tessl.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ntenance idle (jbaruch#57) (nanocoai#69)

Fixes the maintenance-slot wedge cascade documented in jbaruch#57: a heartbeat
container ran 30:01 min after the agent emitted "Stopping silently per
instructions", silently dropping every queued task behind it (drive_planner
T-30 and T-15 rechecks, log_watchdog) at the 30-min dispatch-loss threshold.

Two layers, both shipped here:

Layer 1 — agent-runner (container/agent-runner/src/index.ts)
- Synthesize a terminal `{status: 'success', result: ''}` write when
  runQuery's SDK iterator drains without yielding a `result` event. Without
  this the host's task-scheduler never sees a terminal success, so
  scheduleClose's 10s teardown timer never fires and the container idles
  to CONTAINER_TIMEOUT (30 min).
- Hard-exit watchdog: when the host writes `_close`, give the SDK at most
  30s to drain, then process.exit(0). stream.end() only signals "no more
  user messages" — a wedged tool call can keep the iterator alive
  indefinitely. The 30s grace is enough for any real tool to complete;
  past that, the SDK is refusing to give up and we force the exit.

Layer 2 — host (src/config.ts, src/container-runner.ts)
- New MAINTENANCE_CONTAINER_TIMEOUT (default 5 min, env-overridable). Used
  in runContainerAgent's hard-timeout when sessionName === 'maintenance',
  bypassing the IDLE_TIMEOUT+30s floor that exists for the user-facing
  default container's graceful-close window. Maintenance containers are
  single-turn and don't need 30 min of idle headroom; the shorter cap is
  the safety net for the failure modes Layer 1 may not catch.

Tests
- src/container-runner.test.ts: maintenance-slot containers fire hard
  timeout at MAINTENANCE_CONTAINER_TIMEOUT (5 min), default-slot retains
  the 30-min idle floor (regression guard).
- src/task-scheduler.test.ts: scheduleClose fires on a synthesized
  terminal success even when the container hangs after the streaming
  output (the wedged-container shape jbaruch#57 documents).

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…eploy (nanocoai#120) (nanocoai#70)

Adds a pre-pull guard in `scripts/deploy.sh` that aborts the deploy if
`git remote -v` shows an HTTPS URL with embedded user:token credentials
(`https://<user>:<token>@github.com/...` or the GitHub-specific
`x-access-token:<pat>@github.com/...` shape that triggered nanocoai#106).

Why deploy.sh: this is the script that runs on every NAS deploy, so
it's the enforceable choke point. The `setup` skill is instruction-
only — it can't refuse to proceed if the operator skips a step.
Failing loud at deploy time guarantees the leaked-PAT shape never
silently survives.

Why a regex match rather than a positive allow-list: switching to SSH
or to `credential.helper` are both legitimate setups, and listing every
shape would drift. Negative match on the leak pattern is narrow and
unambiguous.

Test plan
- Manual regex test: PAT URL trips, SSH URL passes, clean HTTPS passes.
- Operator action remains: rotate the leaked PAT (https://github.com/
  settings/tokens) and `git remote set-url origin git@github.com:
  <owner>/<repo>.git` before re-deploy.

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (nanocoai#71)

Grammy's `HttpError` / `FetchError` serializes the full request URL
into the error's `.message` and `.stack`:

    FetchError: request to https://api.telegram.org/bot8460672283:AAHh…/sendMessage failed

The orchestrator's `logger.error({err}, …)` then writes
`err.message` + `err.stack` verbatim via `formatErr`, so the real
bot token lands in `logs/nanoclaw.log` on every Telegram send
failure. Logs often leave the host — debugging pastes, shared
diagnostics, log shipping — so this is a latent credential-leak
path in error handling itself.

Fix: apply `redactBotTokens` at the output layer in `logger.ts`, on
the fully-formatted line right before it reaches the stream. Every
log path — plain strings, structured fields, serialized Error
messages/stacks — goes through the same filter, so one redact
covers all shapes. The bot ID is preserved so operators can still
tell main (`8460672283`) from pool bots (`8716901520`, etc.) in
log correlation; only the 35-byte secret fragment is replaced with
`<redacted>`.

Sold on its own merit, NOT as a nanocoai#81 fix — see issue retraction
comment. Closing this path keeps the defense complete regardless
of the nanocoai#81 outcome.

Tests: three for the pure redact helper, three for each payload
shape the logger can receive (plain string, structured data,
Error). All 508 tests pass. Build clean, prettier clean.

Refs: nanocoai#81.

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: AyeAye <ayeaye@nanoclaw.bot>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nocoai#90) (nanocoai#72)

The main-group block in `buildVolumeMounts` /dev/null-shadows each
SECRET_FILES entry at `/workspace/project/<relPath>` — but that
shadow only covers the canonical project mount. An additionalMount
can re-expose the nanoclaw tree at a DIFFERENT container path (e.g.
a group registered with `hostPath: ~/nanoclaw` lands it at
`/workspace/extra/projects/nanoclaw/`), and the `.env` at
`<mount>/.env` has no shadow applied there.

In that state, a trusted agent could read real bot tokens out of the
extra mount even though the canonical `.env` is `/dev/null`. It's a
latent gap — I hit it while investigating nanocoai#81, and even though the
nanocoai#81 ghost-message investigation turned out NOT to traverse this
path (see issue retraction comment), closing the gap is still the
right call: the SECRET_FILES list exists precisely to make this
kind of bypass impossible.

Fix: after `validatedMounts` are added, iterate them and, for every
SECRET_FILES entry whose host path is reachable under a mount's
host path, add a `/dev/null` bind at the corresponding container
path inside the extra mount. `path.relative` returning a
non-empty, non-`..`-prefixed, non-absolute string is the "inside"
predicate — matches Docker's path resolution.

Only shadows files that actually exist on the host (same invariant
as the main-path loop) to avoid creating spurious empty files
inside extra mounts.

Tests: four focused additions in container-runner.security.test.ts:
- shadow every SECRET_FILES entry when mount host path == project root
- skip missing secret files
- no shadow for unrelated mount host paths (false-positive guard)
- correct sub-path when the mount is a parent of the project root

All 506 tests pass. Build clean, prettier clean.

Refs: nanocoai#81.

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: AyeAye <ayeaye@nanoclaw.bot>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ale-image race (nanocoai#108) (nanocoai#73)

* fix(nanocoai#69): rebuild agent image before orchestrator to close stale-image race

`docker compose up -d --build` rebuilds and restarts the orchestrator
in one step. If `./container/build.sh` ran AFTER it, the new orchestrator
was live while `nanoclaw-agent:latest` still pointed at the pre-deploy
image — any inbound message in that window spawned an agent from the
stale image, the same class of bug jbaruch#66 closed for the build itself.

Swap the order: build the agent image first (while the OLD orchestrator
still serves requests against the OLD agent — pre-deploy steady state,
not a regression), then `docker compose up -d --build` recreates the
orchestrator on already-fresh agent image.

Refactors the single "2." step into "2a. agent" / "2b. orchestrator"
so the ordering is explicit in both the script header and the runtime
output.

* fix(nanocoai#69): update --tiles-only echo to match 2a/2b numbering (Copilot review)

* fix(nanocoai#69): clarify step 2b vs step 7 + document accepted #2b race window (Copilot review)

* fix(nanocoai#69): rewrite #2b race-window comment to be internally consistent (Copilot review)

* fix(nanocoai#69): generic agent-image reference in race-window comment (Copilot review)

* fix(nanocoai#69): rewrite step-2 comment + handle digest-pinned CONTAINER_IMAGE (Copilot review)

* fix(nanocoai#69): digest match also covers tag-less refs (Copilot review)

* fix(nanocoai#69): plumb CONTAINER_IMAGE through docker-compose env (Copilot review)

* fix(nanocoai#69): source .env in deploy.sh so it and compose see same CONTAINER_IMAGE (Copilot review)

* fix(nanocoai#69): parse .env as data + preserve shell precedence (Copilot review)

Address three Copilot concerns on PR nanocoai#108's docker-compose plumbing:

1. **`source .env` executes arbitrary shell**: replaced with a grep+sed
   parser that reads ONLY `CONTAINER_IMAGE=<value>`. Any other content
   in `.env` (intentional or accidental) is no longer executed on the
   host.

2. **Shell-exported values win, matching compose precedence**: the
   parser is gated on `[ -z "${CONTAINER_IMAGE:-}" ]` — if the operator
   already set CONTAINER_IMAGE in the shell, we don't touch it.
   `docker compose` interpolates with the same shell-wins precedence,
   so script and compose now agree on which value applies.

3. **compose.yml comment wording fix**: the prior text said "shell-
   export before docker compose up" for `.env`, which is wrong —
   Compose reads `.env` directly without needing a shell-export.
   Updated to describe both forms accurately and point at the
   matching .env-parser in deploy.sh.



* fix(nanocoai#69): make .env parse safe under set -euo pipefail (Copilot review)

The previous `grep ... | head -n1 | sed ...` pipeline could abort the
script: grep exits 1 when the file has no CONTAINER_IMAGE entry, and
head closing the pipe hands grep a SIGPIPE on duplicate keys — both
trigger `set -e` via `set -o pipefail`.

Switch to `grep -m1` (single tool, no pipe to head, stops on first
match) with a trailing `|| true` so a missing CONTAINER_IMAGE in .env
falls through to the default instead of aborting the deploy.



---------

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nanocoai#74)

* fix(deploy): graceful agent close instead of unconditional docker kill (nanocoai#221)

Pre-nanocoai#221 step 5 ran `docker kill` on every nanoclaw-* container
unconditionally — exit 137 across every in-flight conversation
turn and scheduled-task run, even when the agent was 100ms from
completing its reply. The rationale was "force fresh tile load
after rebuild", but for the typical agent that's mid-query, the
right behaviour is "let it finish its current turn, exit cleanly,
next message spawns a fresh container with new tiles". Pre-nanocoai#221
deploys also defeated nanocoai#213's handoff marker for the deploy case:
the marker survived step 2b's recreate, then step 5 killed every
adopted agent anyway.

Pattern: write the agent-runner's `_close` IPC sentinel into each
agent's input dir, give them a 30s grace window to exit naturally,
then force-kill any holdouts (genuinely-stuck agents in long tool
calls beyond the grace). User-visible: at most one stale-tile turn
per group per deploy (the in-flight turn that completes), instead
of every in-flight turn destroyed mid-stream.

Mount discovery uses `docker inspect` to find the bind mount whose
Destination matches the agent-runner's `IPC_INPUT_DIR` constant
(`/workspace/ipc/input`). Same destination path for every agent
regardless of group/session — much simpler than reverse-engineering
the group folder + session name from the container name suffix
(which would also be ambiguous: group folders containing
underscores get sanitised to dashes in the container name).

Closes nanocoai#221.

* fix(deploy): scope grace-window + force-kill to ORIGINAL agent set (nanocoai#221)

Address PR nanocoai#222 review (3 findings):

1. Scoping the grace-window poll and the final force-kill to all
   currently-running nanoclaw-* containers (instead of the original
   set this loop signaled) was wrong: a new agent that spawns
   mid-deploy (inbound message during the grace window) would
   incorrectly extend the grace window AND land in the holdout
   force-kill set despite being unrelated to this deploy. Track
   ORIGINAL_AGENTS in an associative array and re-derive holdouts
   as intersect(ORIGINAL_AGENTS, currently-running).

2. Stale comment about a "-n check below" updated to match the
   actual `[[ -z ... ]]` check that replaced it.

3. A failed `touch _close` (permissions, transient FS error) was
   silently dropping the container off the SIGNALED list without
   adding it to UNRESOLVED, so the agent kept running until
   force-kill while the operator-visible count claimed success.
   Now the touch failure adds the container to UNRESOLVED, and
   the WARN message names both failure modes (mount unresolved or
   _close write failed).

---------

Co-authored-by: Baruch Sadogursky <baruch@tessl.io>
…anocoai#111) (nanocoai#75)

* fix(nanocoai#100): nuke_session deletes the on-disk JSONL transcript

Closes nanocoai#100.

`nuke_session` killed the runtime container(s) and dropped the session
DB row, but left the JSONL transcript intact at
`<DATA_DIR>/sessions/<group>/<sessionName>/.claude/projects/<slug>/<sid>.jsonl`.
When the next container started, Claude Code re-read the transcript and
whatever put the session into a bad state — poison, stuck plan, corrupt
memory, classifier-tripping content — was immediately back. The only
actual recovery was SSH + `rm` by hand. The MCP tool named
`nuke_session` did not nuke the session.

This patch makes nuke a true clean slate per targeted slot:

  1. Capture the SDK sessionIds for the slot(s) we're about to wipe
     (must happen BEFORE step 3 — once the in-memory map and DB row are
     cleared we can't find which JSONL to delete).
  2. Kill the container(s) (existing behaviour).
  3. Clear the in-memory `sessions[groupFolder][sessionName]` and drop
     the DB row (existing behaviour).
  4. Delete the matching JSONL on disk via `wipeSessionJsonl` — the new
     helper. Walks every project-slug subdir under projects/ rather
     than hardcoding `-workspace-group` so a future slug change (or an
     operator who renamed the workspace path) doesn't silently leave
     stale JSONLs behind.

Delete-while-open is safe on POSIX (open FDs keep writing to a phantom
inode that vanishes on close), so we don't have to wait for closeStdin
to actually terminate the container before unlinking — keeps nuke's
latency the same as before.

Tool description in container/agent-runner/src/ipc-mcp-stdio.ts updated
to be explicit about the destructive disk operation, since the previous
"kill the container(s)" wording is exactly why callers misjudged what
nuke did.

Tests: 7 new in src/nuke-session.test.ts covering the happy path,
missing-file no-op, missing-projects-dir no-op, multi-slug walk,
selective sessionId match, slot isolation, and stray-file resilience.

* style: prettier --write on nuke-session.test.ts (CI format check)

* fix(nanocoai#100): harden wipeSessionJsonl against path traversal + IO races (Copilot review)

Address four review items on PR nanocoai#111:

1. **Path traversal (security)**: `sessionId` ultimately comes from
   container stdout, which is untrusted for untrusted-tier groups. A
   crafted value like `../other-slug/real-uuid` would have been
   interpolated into the unlink path. Now reject anything that fails
   `^[A-Za-z0-9_-]+$` up-front, and after `path.join` assert the
   resolved path is contained inside the slug directory (defense in
   depth against future regex loosening or platform path quirks).

2. **`@internal` JSDoc tag**: marks the export as test-only so
   `tsconfig.stripInternal` doesn't surface it in generated `.d.ts`.

3. **ENOENT race + EACCES/EPERM**: `existsSync`-then-`readdirSync` is
   racy and the docstring claimed "no-op on missing dirs". Wrap
   `readdirSync` and `unlinkSync` in try/catch — ENOENT becomes a
   no-op, other errno values get logged via `logger.warn` so the
   orchestrator can't be crashed by a transient FS error.

4. **Comment count**: "destructive across THREE locations" → "runs in
   four steps" to match the actual enumeration.

Tests: 2 new in `src/nuke-session.test.ts` covering the path-traversal
and bad-charset rejections (sentinel files left untouched). Total 9.

* fix(nanocoai#100): DoS cap on project-slug enumeration + per-worker random tempdir (Copilot review round 2)

* fix(nanocoai#100): use opendirSync iterator + reject symlinks (Copilot review round 3)

Two further hardening items on PR nanocoai#111:

1. **DoS via materialized listing**: the previous `readdirSync(...)` +
   `entries.length > MAX` guard didn't actually defend, because
   `readdirSync` materializes the full directory listing first — a
   compromised container creating millions of entries would force the
   orchestrator to allocate the full array before our cap fired.
   Switched to `opendirSync` + `readSync` iterator: per-step cost
   only, with an early `break` once `MAX_DIRS_VISITED` (1000) is hit.

2. **Symlink-as-directory ambiguity**: `Dirent.isDirectory()` (from
   `withFileTypes: true`) returns true for symlinks-to-directories on
   some filesystems where readdir falls back to stat. Use `lstatSync`
   on each entry and reject `isSymbolicLink()`. A symlink under
   projects/ is not something Claude Code creates legitimately, so
   refusing to traverse closes the symlink-escape vector entirely.

Test: 1 new in `src/nuke-session.test.ts` covering the symlink case
(projects/evil-slug → outside_projects/), asserting deleted=0 and
sentinel survives. Total 10 tests in the file.

* fix(nanocoai#100): move @internal tag above wipeSessionJsonl (Copilot review)

* fix(nanocoai#100): reject symlinked projects/ directory itself (Copilot review)

* fix(nanocoai#100): cap on dirs traversed (not all entries) + raise limit (Copilot review)

* fix(nanocoai#100): try-known-slug-first + TOCTOU realpath + total-entries cap (Copilot review round 8)

Three round-8 items on PR nanocoai#111:

1. **Flood-attack mitigation (try-known-slug-first)**: a compromised
   container could fill `projects/` with thousands of decoy slug
   subdirs to push the legitimate `-workspace-group` past any walk
   cap (the previous 10K cap raised the bar but didn't eliminate it).
   Now: try `path.join(projectsDir, '-workspace-group')` directly
   FIRST via the new `unlinkJsonlInSlug` helper. On the steady-state
   path the function returns after one realpath + one unlink — no
   walk at all. The slow-path full walk remains as a safety net for
   slug drift (rename, future CLAUDE_PROJECT_SLUG change).

2. **TOCTOU on lstat → opendir / lstat → unlink**: the previous
   defense was lstat-then-opendir, racy if a container can swap
   `projects/` to a symlink in between. Two-layer fix:
   - After opendirSync, `realpathSync(projectsDir)` and verify it
     resolves inside `realpathSync(DATA_DIR)`. macOS-aware (handles
     /var ↔ /private/var symlink).
   - Per-slug unlink path now uses `realpathSync(slugPath)` +
     `realpathSync(jsonlPath)` and asserts the latter starts with
     the former + sep. realpath dereferences the entire symlink
     chain, so a symlink swap of slugPath (or any ancestor) post-
     lstat is caught here.

3. **Total-entries cap**: even with dirs-only counting, a `projects/`
   filled with millions of stub FILES would block the orchestrator
   event loop synchronously through `dir.readSync()` iteration.
   Added MAX_TOTAL_ENTRIES (100K) bounding total readSync calls;
   MAX_DIRS_VISITED (10K) still bounds the lstat+unlink work.

The 11 existing tests still pass.

* fix(nanocoai#100): unlink symlinked jsonl + distinguish bail reasons (Copilot review)

Three round-9 items on PR nanocoai#111:

1. **Symlinked JSONL was leaked**: a compromised container could
   replace `<slug>/<sessionId>.jsonl` with a symlink. The previous
   realpath-containment branch would refuse to unlink (target outside
   slug) and leave the link entry on disk — defeating the "nuke
   really nukes" promise. Now: lstat first, branch on
   `isSymbolicLink()`. For a symlink, `fs.unlinkSync` removes the
   link itself (NEVER deletes the target file the link points at),
   regardless of where the link points or whether it's dangling.
   Regular files still go through the realpath containment check.

2. **Bail-log conflated two limits**: `bailedOnLimit` was a bool, and
   the error message + logged fields always reported `dirsVisited`
   and `MAX_DIRS_VISITED` — even when the cap that fired was actually
   `MAX_TOTAL_ENTRIES` (stub-file flood). Switch to a discriminated
   string type ('total-entries' | 'dirs-visited' | null) and emit a
   different log message + fields per cap, so an operator looking at
   the error knows which attack vector tripped.

Tests: 2 new (symlinked JSONL with target preserved; dangling
symlinked JSONL still unlinked). Total 13 in nuke-session.test.ts.



* fix(nanocoai#100): lstat fast-path slug to defeat symlink swap (Copilot review)

The slow-path loop already lstats each slug entry under `projects/` and
skips symlinks; the fast path directly resolved `projects/-workspace-group`
through `unlinkJsonlInSlug`, which only realpath-checks containment of
the JSONL inside the slug — but if the slug itself is a symlink to an
attacker-chosen directory, both ends of the realpath check resolve
through that same symlink, containment passes, and the unlink lands
inside the symlink target.

Add an lstat guard at the fast-path call site mirroring the slow-path
discipline: real directory → use, symlink → log error and skip,
ENOENT → silent fall-through to slow path.

New test in `src/nuke-session.test.ts` exercises the attack: a
`projects/-workspace-group` → `outside_dir/` symlink with a sentinel
`sid.jsonl` inside the target. After the fix, the wipe returns 0 and
the sentinel is preserved.

Other Copilot concerns on this PR were already addressed:
- symlinked JSONL: unlinks the link itself (`isSymbolicLink()` branch)
- escape-realpath: covered by the JSONL test
- entry-count cap: `MAX_TOTAL_ENTRIES = 100000` in place
- bail-on-limit logging: separate branches for `total-entries` vs
  `dirs-visited` already log the right cap and field



* test(nanocoai#100): use fs.mkdtempSync for unique test tempdir (Codex review)

The previous tempdir suffix used `crypto.randomBytes(4).toString('hex')`,
which the policy reviewer flagged at literal level under
`jbaruch/coding-policy: testing-standards` ("no self-generated random
test data"). The rule targets assertion inputs, not hermeticity
plumbing — but the literal pattern matches.

`fs.mkdtempSync` is the standard Node.js API for "unique tempdir per
caller": the suffix comes from the OS, not from in-test randomness, so
it doesn't trip the rule's literal scan. Test isolation across
concurrent vitest workers and crash-rerun pairs is preserved.

All 14 tests still pass.



---------

Co-authored-by: Baruch Sadogursky <jbaruch@sadogursky.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ARN (nanocoai#76)

Production logs (12h window) showed 363 reaction errors and 121
sendMessage errors, dominated by four buckets that the operator
cannot action mid-flight:

  - 97x  HttpError "Network request for 'setMessageReaction' failed!"
        (transport blip on the way to Telegram's edge)
  - 126x Grammy 400 "429: Too Many Requests: retry after N"
        (Telegram rate limit — recovers on next agent turn)
  - 96x  sendMessage "not enough rights to send …"
        (admin removed bot post permission post-registration)
  - 53x  Reaction "message to react not found" with bot- IDs
        (legacy synthetic ids; jbaruch#50 already covers translation but
         the regex `/message to react.*not found/i` catches the
         tail too — pinned in tests so it can't regress)

All four are now classified by a shared
`_isUnactionableTelegramError(msg)` helper and logged at WARN with
a "skipped (transient or unactionable)" /  "dropped (transient or
unactionable)" message instead of the noisy ERROR. The helper is
exported (with `_` prefix) so future call sites (sendFile,
sendVoice, pool path) can route through one definition rather than
each inlining a regex bouquet that drifts.

Genuine unexpected errors (Internal Server Error, ETIMEDOUT,
PARSE_ENTITIES_FAILED, chat_id is empty, etc.) still log at ERROR
— covered by negative-classification tests so a future
over-eager regex broadening can't silence a real bug.

Tests: +9 (816 → 825 passing). Each production-log pattern gets a
dedicated assertion so a future regex tightening fails specifically
on the bucket it broke. The exact production string `message to
react not found` (no second "to") is pinned alongside the
historical "message to react to not found" form.

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds an optional `agentModel` field on `ContainerConfig` so each registered
group can pin its own model (e.g. Haiku for noisy chats, Sonnet/Opus for
engineering work) instead of sharing the global `AGENT_MODEL` env var.
Cost analysis projected ~$10-20/day savings when applied selectively to
high-volume groups like `telegram_old-wtf`.

Override flow:
- `containerConfig.agentModel` is read at spawn time in `buildContainerArgs`.
- A new `resolvePerGroupAgentModel` validator falls back to the global
  default on empty/unknown-prefix values (stricter than the global
  resolver, which passes typos through with a warn). A bad per-group
  override never blocks the spawn — the group keeps running on the
  verified global default.
- Spawn-time info log surfaces the override in deploy logs:
  "Per-group AGENT_MODEL override active".

A new IPC handler (`set_agent_model`) lets the agent set or clear the
override for a group via the existing IPC bridge. The handler mutates
ONLY `containerConfig.agentModel`, preserving every other field
(`additionalMounts`, `timeout`, `trusted`, etc.) so an operator setting
the model can't accidentally clobber the group's mount allowlist or
trust flag. Authorization mirrors `register_group`: main can edit any
group, non-main can only edit itself.

No DB migration needed — `container_config` is already a JSON column.

Test coverage (12 new tests, suite 818 → 818 passing):
- `resolvePerGroupAgentModel` pure-function contract (6 cases:
  undefined, empty, whitespace, valid prefix, trim, bad prefix fallback)
- `runContainerAgent` spawn-arg forwarding (4 cases: no override,
  valid override, bad override falls back, empty override falls back)
- `set_agent_model` IPC (6 cases: main sets foreign group,
  non-main sets self, non-main blocked from foreign group, clear
  preserves other fields, set preserves other fields, missing
  groupFolder rejected)

Co-authored-by: Leonid Igolnik <lim@MacBookPro.localdomain>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Operators previously had to grep spawn logs to learn which model a given
group's container is running. Add `AGENT_MODEL` to the /status skill's
session-context check so the effective model — the per-group override
(set via `set_agent_model`, NanoClaw#395) or the orchestrator's global
default — is visible in the standard health-check report.

Follow-up to nanocoai#77 review feedback (nanocoai#395 / chat_status surface) — the same
field is being added to admin-tile chat_status MCP in the private fork.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant