Re-armable atime sensor + --sensor selector (#28, #100)#160
Re-armable atime sensor + --sensor selector (#28, #100)#160LiorFink00 wants to merge 35 commits into
Conversation
Adds FIFO_MODE probe (probe_fifo_mode), BAITCACHE dir, cache_path helper, and a FIFO branch in plant() that creates a named pipe and caches content for the upcoming serve loop. Regular-file path remains as FIFO_MODE=0 fallback. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t imports (#100) - probe_fifo_mode(): add `unset _probe` to prevent the temp var from persisting as a global in POSIX sh - plant() FIFO branch: rm -f "$cf" before returning on mkfifo failure so no stale cache file is left behind - tests/test_agent_fifo.py: drop unused `socket` and `time` imports Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add attribute() stub, serve_fifo() (held-fd read-detection loop), watch_fifo() supervisor, and update start_watcher() to prefer watch_fifo when FIFO_MODE=1. A reader opening the bait FIFO unblocks exec 3>"$fifo", serves cached content, then fires a callback with event_type=open after debounce check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace stub attribute() with real implementation that full-scans lsof -nP, matches the FIFO by inode (BSD stat -f %i / GNU stat -c %i), filters for read-mode fd (field 4 ending in 'r') not owned by our serve subshell, then resolves pid -> process and os_user best-effort. Adds TDD test test_callback_includes_reader_pid_and_user (RED -> GREEN). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Add remove_fifos(): reads MANIFEST_FILE and rm -f's any path that is a FIFO - Call remove_fifos at startup (FIFO_MODE=1) to clear orphaned pipes from a prior hard-kill - Wire remove_fifos into EXIT/INT/TERM traps so a clean exit leaves no writer-less FIFOs - stop_watcher is NOT modified (runs on every reconcile; must not remove kept baits) - TDD: test_clean_exit_removes_fifos and test_startup_sweeps_a_stale_fifo RED→GREEN Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…#100) Replace the prior test body which used the current-deployment bait path (so plant()'s own EEXIST guard would remove it, bypassing the sweep) with an orphan FIFO path that is in the manifest but NOT in the current deployment set. Only the startup sweep can remove an orphan; without it the test fails. RED/GREEN verified: commented-out sweep → FAIL, restored → PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rify (#100) Removes watch_fs_usage; reverses the atime stat probe order so Linux reads real access time (stat -c %X first; stat -f %a was statfs free-blocks on Linux). verify_planted treats a regular file where a FIFO should be as tampering. Updates architecture.md and the legacy plant/state tests to expect FIFO bait. Known follow-up (next commit): verify-pass re-plant must restart the FIFO watcher so a re-planted pipe is served — otherwise it blocks readers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When verify_planted successfully re-plants a missing FIFO bait, nothing previously restarted the watcher, so the new pipe had no serve_fifo loop writing to it. Any opener then blocked forever on open() and no callback ever fired. Fix: introduce a REPLANTED flag, set it on a successful re-plant inside verify_planted, and restart the watcher (stop_watcher + start_watcher) at the end of each sync cycle when FIFO_MODE=1 and something was re-planted. REPLANTED is cleared before each call so it only reflects the current cycle. Also fix a pre-existing deadlock in tests/test_agent_plant.py::test_refreshes_its_own_bait: in FIFO mode the planted path is a named pipe, so Path.write_text() blocks forever (open O_WRONLY on a pipe with no reader). Fix: unlink the FIFO before writing a regular file to simulate stale bait. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
) Adds test_two_agents_one_host_both_detect: starts two concurrent agents each with their own per-bait FIFO, reads both pipes, asserts >= 2 event_type=open callbacks reach the stub server — proving no single-consumer collision (the original kdebug/fs_usage shared-sensor bug) is possible with per-bait FIFOs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
C1: move remove_fifos() to after acquire_singleton() so a duplicate invocation (MDM re-push with the same --state-file) exits at the mutex rather than deleting the live agent's shared bait FIFOs. Also: - m1: delete dead READ_OPS constant and is_read_op() (only removed watch_fs_usage used them) - m2: chmod 700 $BAITCACHE after mkdir -p in plant() to prevent deployment-ID enumeration via directory listing - m3: rm -rf $BAITCACHE in self_destruct before rmdir so fake creds are not left on the endpoint after decommission - m5: replace \t in attribute() awk inode regex with [[:space:]] for POSIX portability - m6: split test_atime_stat_order_prefers_portable_access_time into two distinct site-pinning assertions plus a wrong-order guard - C1 regression test: test_duplicate_install_does_not_sweep_live_agents_fifo verified RED before fix, GREEN after Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I1: rewrite header comment to document FIFO named-pipe as primary
(unprivileged) sensor and atime poll as mkfifo-unavailable fallback;
drop "run as root for fs_usage" from example.
I2: update routes.py install script docstring and inline comment, and
deploy.py _install_command comment: root is only needed to plant in
system paths (e.g. /etc/ssh), NOT for read detection.
I3: change watch_atime log line from "fs_usage unavailable" to
"mkfifo unavailable".
I4: fix three stale inline comments — target-user expansion header
("for fs_usage" → "for system-path planting"), stop_watcher docstring
("fs_usage/grep children" → "serve_fifo children"), and _fire NOTE
("fs_usage grep filter" → "dep_index_for_line re-matches the path").
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Kept FIFO sensor (macOS-only, gated by probe_fifo_mode platform check) - Kept inotify sensor (Linux) from origin/main; dropped fs_usage entirely - Fallback chain: macOS → FIFO, Linux → inotify, else → atime poll - Added WATCH_STOP_FLAG to start_watcher/stop_watcher (from origin/main) - watch_fifo now degrades to atime poll on unexpected exit (mirrors inotify) - Tests: test_agent_fifo.py skips on non-Darwin; FIFO assertions in test_agent_plant.py and test_agent_state_report.py are platform-aware Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
probe_fifo_mode short-circuits with `[ "$(platform)" = "darwin" ] || return` on non-macOS. A bare `return` yields the exit status of the preceding test, which is 1 on Linux - and since run() calls probe_fifo_mode bare under `set -e`, the whole agent aborted immediately (exit 1, no output). This broke every agent test on the Linux CI runner while passing on macOS (where the darwin test is true). Return 0 explicitly; same for the mkfifo-missing branch. Verified in a python:3.12-slim (dash) container: full suite 250 passed, 9 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The FIFO read sensor is macOS-only (#100) and its tests are skipif!=Darwin, so an ubuntu-only CI never exercises them. Add macos-latest to the matrix (Linux still covers inotify/atime). This would have caught the Linux probe_fifo_mode regression that shipped because the FIFO/Linux paths had no matrixed coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the atime regular-file primary detection layer from the validated layered design (#100): a normal regular-file bait whose atime is armed to the past, so a read bumps it; the agent fires AND re-arms, making every subsequent read detectable too (the old watch_atime fired at most once under APFS relatime). Detection-only — no pid — but honors all sensor constraints (regular file, no kdebug, no mount, no privilege), so it covers the FIFO sensor's blind spots (statSync-guarded / mmap / scan-only readers) and the "normal file" requirement the FIFO can't. - `--sensor auto|fifo|atime` (default auto). `atime` forces a regular-file bait + the atime watcher on any platform (incl. Linux), making the layer selectable and testable. - Centralize the atime read in read_atime() (GNU %X before BSD %a — the #28 fix, now DRY across the one call site) and arm/re-arm via arm_atime(). - Tests: deterministic re-arm cycle driven by os.utime() (relatime-policy independent), cross-platform. Update the #28 stat-order guard for the refactor to a single helper. 40 agent tests pass; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
… Linux) arm_atime used `touch -a` which creates the file if missing. watch_atime arms every dep path, so a failed-plant dep (no bait on disk) got an empty file created at its path - verify_planted then saw it as planted and never re-planted it. Broke test_replant_is_bounded on Linux (atime sensor); macOS uses FIFO so it passed there. Add `-c` (--no-create) so arming only touches existing bait. Verified in a Debian/dash container: full agent suite 32 passed, 8 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve agent/thumper_agent.sh: keep the FIFO sensor rewrite, fold in #162's --help/--version + AGENT_VERSION, and drop the dead READ_OPS (fs_usage is gone). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Propagate main's #162 (--help/--version) up the stack; agent arg-parsing auto-merged cleanly with #160's --sensor. Verified: 40 agent tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
AnguIar
left a comment
There was a problem hiding this comment.
Finding 1 — Switching from FIFO to atime mode after a hard-kill hangs the agent (line 375)
The agent can run in two modes: FIFO (named pipes) and atime (regular files). If a FIFO-mode agent is force-killed, the named pipes stay on disk. On restart with --sensor atime, the FIFO cleanup step is skipped (FIFO_MODE=0, so the remove_fifos guard fails). When the agent tries to plant baits, it writes content into those leftover named pipe paths with curl. Writing to a named pipe with no reader blocks forever — the agent hangs at startup with no error.
Finding 2 — Re-planted atime bait fires a false-positive alert (line 904)
The atime sensor works by setting each bait's "last accessed" timestamp to the year 2000. A read jumps it to the present; the agent detects the jump and fires. If a bait is deleted mid-run and re-planted, the new file's timestamp is set to now, not year 2000 — nobody arms it. The watcher still holds its old baseline (year 2000), sees the jump from 2000 to 2025, and fires a ghost alert with no actual read.
Finding 3 — --sensor fifo accepted but silently ignored on Linux (line 918)
The --sensor flag accepts auto, fifo, and atime. Passing --sensor atime forces atime mode. But --sensor fifo doesn't force FIFO mode — it falls through to auto-detection. On macOS that happens to pick FIFO anyway, but on Linux auto-detection picks inotify or atime. The agent runs a different sensor than requested with no warning.
- F1: sweep stale FIFOs at startup regardless of sensor mode, and have plant() rm a leftover FIFO before `curl -o` into it. A FIFO→atime restart after a hard-kill no longer hangs writing into a no-reader pipe. - F2 (+ #164 F1): restart the watcher on ANY re-plant, not just FIFO_MODE=1, so a re-planted atime bait is re-armed (no ghost alert from the stale year-2000 baseline) and a re-planted FIFO is re-served on Linux too. - F3: `--sensor fifo` now FORCES a pipe wherever mkfifo works (incl. Linux/CI) via a platform-agnostic mkfifo_works() probe, instead of silently falling through to auto-detection; errors out loudly if mkfifo is unavailable. New regression tests: atime mode doesn't hang on a leftover FIFO; --sensor fifo forces a named pipe. 42 agent tests pass; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
…server #164 F2: an operator's explicit `--sensor fifo|atime` is an intentional override and must win over the server's per-deployment `sensor` field. effective_sensor precedence is now: explicit --sensor -> per-deployment -> platform default, so `--sensor atime` reliably opts a host out of FIFOs even when the server sends sensor=fifo. (#164 F1 — re-planted FIFO unserved on Linux — is fixed by the mode-independent watcher restart propagated from the #160 fixes.) New test: --sensor atime plants a regular file even when the server asks for fifo. 44 agent tests pass; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
- F1: verify_planted now RECOVERS a FIFO replaced by a regular file (plant() rm's the impostor + re-creates the pipe, REPLANTED restarts the watcher), instead of reporting "failed" forever and going permanently blind. Replacing the bait was a stronger attack than deleting it (which already self-heals). - F2: --simulate sweeps its FIFOs before exiting (the cleanup traps are armed later), so it never leaves a no-reader pipe that blocks real open()s forever. - F3: watch_fifo re-serves on unexpected exit instead of "degrading to atime" - atime can't detect FIFO reads (open() blocks on the writerless pipe, atime never moves), so the old fallback looked healthy while detecting zero. New tests: tampered FIFO is recovered; --simulate leaves no FIFO. 40 agent tests pass; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
The test bumped atime for "read #2" in a race with the agent's re-arm. The agent re-arms in two steps (touch atime->past, then re-read the baseline); a bump landing in that ms-wide window is captured as the new baseline and missed, so the second read intermittently went undetected on loaded CI (same commit passed on the PR run, failed on the push run). Production is unaffected - DEBOUNCE_SECS coalesces reads that close together. Wait one poll cycle for the re-arm to settle before the next read. 12/12 stable locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Merge current main (brings #177's ruff config + auto-merges #178's interval validation cleanly). test_agent_fifo.py is a #123-introduced file in the old compact style, so #177 never touched it; ruff-format it (E701/E702), rename the ambiguous loop var l->ln (E741), split the import line (E401). Tree-wide ruff clean; agent suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Resolve the arg-parsing conflict: combine #178's is_uint interval validation with --sensor. Resolve test_agent_fifo.py to #160's read_atime() stat-order assertion (its agent has the helper). ruff-format test_agent_atime.py and drop an unused subprocess.run result (F841). Tree-wide ruff clean; 48 agent tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Clean merge of the reconciled #160 (main/#178/#177 already resolved below). ruff-format + fix test_agent_mixed.py. Tree-wide ruff clean; 50 agent tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
|
@AnguIar all of your findings are addressed, and the branch is now reconciled with current |
F3: watch_fifo is now a poll-supervisor that checks each writer's liveness (kill -0) and restarts a SINGLE dead writer whose FIFO still exists. The bare `wait` only recovered when ALL writers died - a lone writer death left that bait writerless and silently blind (verify only checks FIFO existence, not writer liveness) until the next full re-plant restart. F2: --simulate now also removes the cached fake-credential content it planted, not just the FIFOs (test-only mode shouldn't leave either behind). New tests: --simulate leaves no cache; watch_fifo uses per-writer restart (not bare-wait). 44 agent tests pass; shellcheck + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
Branch protection requires a status check named `test`, but the matrix job emitted per-leg names (`test (ubuntu-latest)` / `test (macos-latest)`), so the required check never appeared and approved PRs on this branch were BLOCKED forever. Rename the matrix job to `test-matrix` and add a lightweight aggregate `test` job that fails unless every leg passes. The required check name is now stable and independent of the matrix, and main-based PRs (single `test` job) keep satisfying the same required check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
# Conflicts: # .github/workflows/ci.yml
plant() created the FIFO with mkfifo and only recorded it in the manifest on the next line. A clean-exit signal (INT/TERM) delivered in that gap ran the teardown trap's remove_fifos against a manifest that did not yet list the path, leaving the FIFO behind - a slow-runner race that surfaced as a flaky macOS CI failure in test_clean_exit_removes_fifos (passes locally, fails under CI load). Record before mkfifo (record_planted is idempotent) and forget on mkfifo failure so the manifest never lists a phantom path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy
# Conflicts: # agent/thumper_agent.sh
# Conflicts: # agent/thumper_agent.sh # tests/test_agent_fifo.py
…ant 1/3) (#164) * feat(agent): per-deployment sensor — run FIFO + atime baits together (#100) Increment 1 of dual-plant: each deployment record carries an optional 6th `sensor` field (fifo|atime|inotify). The agent now plants and watches EACH bait under its own sensor, so a FIFO bait (canonical, definitive pid) and an atime bait (companion, normal-file detection) run side by side from one agent — the foundation for always deploying the pair. Backward-compatible: an absent field falls back to today's global behavior (`effective_sensor` → per-deployment value, else the auto-probe/--sensor default), so this merges safely before the server sends pairs. The existing homogeneous FIFO/atime/inotify paths are byte-for-byte unchanged; the new `watch_mixed` dispatcher only engages when the server sends explicit sensors. - Parse the `sensor` field -> `dep_sensor_$i`; `effective_sensor`, `has_explicit_sensors` helpers. - `plant()` keys FIFO-vs-regular on the deployment's sensor, not global FIFO_MODE. - Refactor `watch_atime` -> `atime_poll "<indices>"` so a subset can be polled; `watch_atime()` keeps polling all (homogeneous + degradation fallback). - `watch_mixed`: FIFO baits served individually, the rest atime-polled as a group. - verify's FIFO-tamper check keys on the deployment's sensor too. New test: a fifo+atime pair from one agent — both plant as the right type and both fire. Full suite: 261 passed, 1 skipped; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy * agent: arm_atime must not create the bait file (fixes re-plant cap on Linux) Same fix as the atime-rearm branch: `touch -a` creates the file if missing, so arming a failed-plant dep's path left an empty file behind and verify_planted stopped re-planting it. Broke test_replant_is_bounded on Linux. Add `-c`. Verified in a Debian/dash container: agent suite 32 passed, 9 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): address Roee's #164 review — explicit --sensor wins over server #164 F2: an operator's explicit `--sensor fifo|atime` is an intentional override and must win over the server's per-deployment `sensor` field. effective_sensor precedence is now: explicit --sensor -> per-deployment -> platform default, so `--sensor atime` reliably opts a host out of FIFOs even when the server sends sensor=fifo. (#164 F1 — re-planted FIFO unserved on Linux — is fixed by the mode-independent watcher restart propagated from the #160 fixes.) New test: --sensor atime plants a regular file even when the server asks for fifo. 44 agent tests pass; shellcheck -S error clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy * chore: reconcile #164 with main + ruff-clean test_agent_mixed.py (#177) Clean merge of the reconciled #160 (main/#178/#177 already resolved below). ruff-format + fix test_agent_mixed.py. Tree-wide ruff clean; 50 agent tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018DARDAxeg4NM8FKoyGMQZy --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the atime regular-file primary detection layer from the validated layered design (#100), stacked on the FIFO companion (#123).
Why
The FIFO sensor (#123) gives deterministic pid for raw-readers but is a named pipe — a worm that
statSync().isFile()-guards,mmaps, or only scan-discovers a path walks past it, and it doesn't satisfy the "normal file" requirement. The atime layer covers exactly those gaps: a normal regular-file bait, detection under every constraint (no kdebug, no mount, no privilege).What
touch -a -t), so a read bumps it; the agent fires and re-arms, so every subsequent read is detectable. The oldwatch_atimefired at most once under APFS relatime — the core bug.--sensor auto|fifo|atime(defaultauto).atimeforces a regular-file bait + the atime watcher on any platform (incl. Linux), making the layer selectable and testable.read_atime()helper centralizes the Agent atime fallback sensor is broken #28 stat-order fix (GNU%Xbefore BSD%a) — DRY across the one call site;arm_atime()arms/re-arms.Validated
On macOS 26.5.1: armed→read→bump→fire→re-arm→read→fire confirmed; relatime bumps atime on read.
Tests
tests/test_agent_atime.py(2): plant-as-regular-file + arm, and the re-arm cycle — driven byos.utime()so it's deterministic regardless of the filesystem's relatime policy, cross-platform.shellcheck -S errorclean.Closes part of #28 (re-armable + primary-layer role). Implements the atime layer of #100 / #124.
🤖 Generated with Claude Code
Relates to #28 (broken atime sensor) and #100 (sensor refactor).