You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(worker): fetch inventory policy from control plane at agentd startup (#30)
* feat(core): add InventoryPolicySnapshot for worker policy fetch
* feat(config): inventory_policy_snapshot accessor + with_inventory_policy_snapshot apply
* feat(worker-api): LoadInventoryPolicy request/response variant + client method
* feat(api): serve LoadInventoryPolicy from worker_control endpoint
* feat(worker): fetch inventory policy from control plane at agentd startup
Worker now fetches the live allowed_host_suffixes / allowed_hosts /
allowed_cidrs / allowed_ports from anyscan-api at startup before
claiming any port-scans, and refreshes on a configurable cadence
(default 300s, env AGENT_INVENTORY_REFRESH_SECONDS).
This unblocks the prod scan #16 failure mode where workers fell back to
InventoryConfig::default() (allowed_host_suffixes=["localhost"]),
causing the streaming follow-on flusher to drop every internet IP via
config.normalize_target_definition's host_is_allowed gate even after
the API allowlist had been widened.
Local /etc/agentd/runtime.env values remain a fallback when the
control plane is unreachable; fetch failures log warn! and keep the
prior policy in memory. Workers do NOT crash on fetch failure.
* test(worker): make inventory refresh interval test pure to avoid env-var races
The previous test used unsafe std::env::set_var/remove_var which
flaked under cargo test's default multithreaded executor (set_var
races with other threads reading env). Extract a pure
parse_inventory_refresh_interval(Option<&str>) that takes the raw env
value as a parameter; the env-reading wrapper is one line and
trivially correct. The test now drives the pure function with no
shared mutable state.
* fix(worker): always run inventory refresh + register before run_once fetch
Two review issues from codex on PR #30:
1. run_daemon: the periodic refresh check was placed near the bottom of
the loop, after every claim arm that calls 'continue'. On a busy
worker repeatedly claiming bootstrap_jobs / port_scans / runs the
refresh block was never reached, so control-plane allowlist changes
never propagated until the worker idled — defeating the whole point
of the periodic refresh for the heavy-traffic case. Move the refresh
to immediately after seed_bootstrap_inventory /
queue_due_schedules_with_events (which always run), before any
continue-able claim attempt. Bonus: claims spawned in the same
iteration now operate against the freshest fetched policy.
2. run_once: the initial fetch ran before register_worker_or_bail. But
non-register /api/worker/control requests in worker_control require
an already-registered worker token; a fresh agent's first
LoadInventoryPolicy returned 401, fell through to local fallback,
and run_once never refreshed again. Register first, then fetch —
matches the run_daemon ordering already in place.
* fix(worker): hoist inventory refresh above all continue-able loop branches
Codex P2 follow-up: the previous fix (dde4324) placed the refresh after
seed_bootstrap_inventory / queue_due_schedules_with_events, which fixed
the bootstrap-job / port-scan / run claim shadowing — but two earlier
branches still continue above it: claim_next_pending_remote_command and
the remote_update scheduling block. A worker continuously receiving
remote debug commands or remote updates would loop forever without
re-evaluating the refresh predicate.
Hoist the refresh to the very top of each loop iteration, immediately
after the try_join_next completion-collection (which doesn't continue).
Now every iteration runs the refresh predicate exactly once regardless
of which downstream fast-path fires.
---------
Co-authored-by: skullcmd <skullcmd@anyvm.tech>
0 commit comments