Describe the bug
The K8s LogCollector does not collect logs for cache-copy (and likely cache-probe) task pods. The ResourceCollector tracks them correctly (kubectl top calls appear in controller logs), but no kubectl logs calls are made for the same pods. The get-task-logs RPC returns zero entries.
To Reproduce
- Run a tokenize job with cache-copy on the K8s provider (e.g.
nemotron_data.py on CoreWeave).
- Wait for cache-copy coordinator and worker pods to reach Running state.
- Query logs:
iris rpc controller get-task-logs --id <cache-copy-coord-job-id>
- Observe: empty response with
cursor: 0, no log entries.
- Verify the pod IS producing logs:
kubectl logs <pod> -c task --tail=5 shows output.
- Verify
ResourceCollector IS tracking the pod: controller process logs show kubectl top <pod> calls.
- Verify
LogCollector is NOT tracking the pod: no kubectl logs <pod> calls in controller process logs.
Expected behavior
LogCollector should track all task pods that _track_pod is called for, including those from nested zephyr pipelines (cache-probe, cache-copy).
Additional context
The _track_pod method (tasks.py:784-790) calls both log_collector.track() and resource_collector.track(). The resource collector works, but the log collector silently drops these pods. The log_store is wired correctly at controller.py:970.
These pods have correct labels (iris.managed=true, iris.runtime=iris-kubernetes) and appear in the _poll_pods managed pod list. 125 other pods ARE being log-fetched, but zero cache-copy/cache-probe pods are.
Suspected cause: a race condition or ordering issue where nested child jobs' tasks are polled before the LogCollector is ready, or the LogCollector's _pods dict silently rejects duplicate keys from re-tracking on subsequent sync cycles.
Describe the bug
The K8s
LogCollectordoes not collect logs for cache-copy (and likely cache-probe) task pods. TheResourceCollectortracks them correctly (kubectl topcalls appear in controller logs), but nokubectl logscalls are made for the same pods. Theget-task-logsRPC returns zero entries.To Reproduce
nemotron_data.pyon CoreWeave).iris rpc controller get-task-logs --id <cache-copy-coord-job-id>cursor: 0, no log entries.kubectl logs <pod> -c task --tail=5shows output.ResourceCollectorIS tracking the pod: controller process logs showkubectl top <pod>calls.LogCollectoris NOT tracking the pod: nokubectl logs <pod>calls in controller process logs.Expected behavior
LogCollectorshould track all task pods that_track_podis called for, including those from nested zephyr pipelines (cache-probe, cache-copy).Additional context
The
_track_podmethod (tasks.py:784-790) calls bothlog_collector.track()andresource_collector.track(). The resource collector works, but the log collector silently drops these pods. Thelog_storeis wired correctly atcontroller.py:970.These pods have correct labels (
iris.managed=true,iris.runtime=iris-kubernetes) and appear in the_poll_podsmanaged pod list. 125 other pods ARE being log-fetched, but zero cache-copy/cache-probe pods are.Suspected cause: a race condition or ordering issue where nested child jobs' tasks are polled before the LogCollector is ready, or the LogCollector's
_podsdict silently rejects duplicate keys from re-tracking on subsequent sync cycles.