You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move counters from Iris to Zephyr per Russell's feedback
Counters are now a Zephyr-only concept. Instead of file-based I/O through
the Iris heartbeat/DB path, counters accumulate in-memory on each Zephyr
worker and flow to the coordinator via the existing heartbeat RPC.
- Remove all counter code from Iris: proto fields, DB column, migration,
transitions, service aggregation, worker monitor, task_attempt
- Add zephyr/counters.py with increment() / get_counters() API backed by
WorkerContext (pure in-memory, zero I/O per increment)
- Extend WorkerContext protocol with increment_counter/get_counter_snapshot
- Wire counters through ZephyrWorker heartbeat → ZephyrCoordinator state
- Add counters to JobStatus dataclass and get_status() aggregation
- Log counters in coordinator periodic status lines for agent visibility
- Only send counters when values change (avoid steady-state DB/RPC churn)
- Reset counters per-task in _execute_shard to prevent cross-task leakage
- Update babysit-zephyr and babysit-job skills with counter monitoring docs
Co-authored-by: Rafal Wojdyla <ravwojdyla@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Counters are sent to the coordinator via the worker heartbeat (every 5s) and only transmitted when values change — no overhead for idle workers.
84
+
71
85
Fetch via the Iris CLI:
72
86
```bash
73
87
uv run iris --config lib/iris/examples/marin.yaml rpc controller get-task-logs \
@@ -115,10 +129,15 @@ After submitting, monitor in escalating stages:
115
129
3. Get the run command (or reuse the previous one).
116
130
4. Submit and resume monitoring.
117
131
132
+
## Monitoring Counters
133
+
134
+
When babysitting a Zephyr job, check coordinator logs for counter lines. Counters give you insight into pipeline throughput (e.g. `documents_processed`, `bytes_written`, `validation_errors`). If counters stop advancing while shards are still in-flight, this may indicate a straggler or stuck worker — escalate to debug-zephyr-job.
135
+
118
136
## When to Escalate
119
137
120
138
Escalate to **debug-zephyr-job** when:
121
139
- A stage is stuck (no shard progress for an extended period)
122
140
- Stragglers are holding up a stage (few in-flight, 0 queued, most workers idle)
123
141
- Workers are failing repeatedly with the same error
142
+
- Counters stop advancing while tasks remain in-flight
124
143
- For controller issues (e.g., RPCs timing out), use the **debug-iris-controller** skill
0 commit comments