Skip to content

[iris] K8s LogCollector silently skips log collection for nested zephyr pipeline pods #4414

@ravwojdyla-agent

Description

@ravwojdyla-agent

Describe the bug

The K8s LogCollector does not collect logs for cache-copy (and likely cache-probe) task pods. The ResourceCollector tracks them correctly (kubectl top calls appear in controller logs), but no kubectl logs calls are made for the same pods. The get-task-logs RPC returns zero entries.

To Reproduce

  1. Run a tokenize job with cache-copy on the K8s provider (e.g. nemotron_data.py on CoreWeave).
  2. Wait for cache-copy coordinator and worker pods to reach Running state.
  3. Query logs: iris rpc controller get-task-logs --id <cache-copy-coord-job-id>
  4. Observe: empty response with cursor: 0, no log entries.
  5. Verify the pod IS producing logs: kubectl logs <pod> -c task --tail=5 shows output.
  6. Verify ResourceCollector IS tracking the pod: controller process logs show kubectl top <pod> calls.
  7. Verify LogCollector is NOT tracking the pod: no kubectl logs <pod> calls in controller process logs.

Expected behavior

LogCollector should track all task pods that _track_pod is called for, including those from nested zephyr pipelines (cache-probe, cache-copy).

Additional context

The _track_pod method (tasks.py:784-790) calls both log_collector.track() and resource_collector.track(). The resource collector works, but the log collector silently drops these pods. The log_store is wired correctly at controller.py:970.

These pods have correct labels (iris.managed=true, iris.runtime=iris-kubernetes) and appear in the _poll_pods managed pod list. 125 other pods ARE being log-fetched, but zero cache-copy/cache-probe pods are.

Suspected cause: a race condition or ordering issue where nested child jobs' tasks are polled before the LogCollector is ready, or the LogCollector's _pods dict silently rejects duplicate keys from re-tracking on subsequent sync cycles.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-generatedCreated by automation/agentbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions