Skip to content

[iris] Prune dead resource_usage paths; query iris.task in JobDetail#5453

Merged
rjpower merged 1 commit into
mainfrom
prune-resource-usage-paths
May 5, 2026
Merged

[iris] Prune dead resource_usage paths; query iris.task in JobDetail#5453
rjpower merged 1 commit into
mainfrom
prune-resource-usage-paths

Conversation

@rjpower
Copy link
Copy Markdown
Collaborator

@rjpower rjpower commented May 5, 2026

Drop the worker-pushed and controller-stored ResourceUsage plumbing that the finelog stats migration left behind: WorkerTaskStatus.resource_usage, TaskUpdate.resource_usage, JobStatus.resource_usage, and the GetJobStatusResponse min/max aggregates are reserved. Rewire the k8s ResourceCollector to write IrisTaskStat rows directly to iris.task so CoreWeave pods stay observable, and point the JobDetail dashboard at the same namespace so per-task MEM/CPU columns populate again.

Drop the worker-pushed and controller-stored ResourceUsage plumbing that
the finelog stats migration left behind: WorkerTaskStatus.resource_usage,
TaskUpdate.resource_usage, JobStatus.resource_usage, and the
GetJobStatusResponse min/max aggregates are reserved. Rewire the k8s
ResourceCollector to write IrisTaskStat rows directly to iris.task so
CoreWeave pods stay observable, and point the JobDetail dashboard at
the same namespace so per-task MEM/CPU columns populate again.
@rjpower rjpower added the agent-generated Created by automation/agent label May 5, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 5, 2026

Claude finished @rjpower's task in 5m 19s —— View job


Code review

No issues found. Checked for bugs and AGENTS.md compliance.

Scope verified

  • WorkerTaskStatus.resource_usage (proto field 8), JobStatus.resource_usage (8), TaskStatus.resource_history (20), and GetJobStatusResponse.resource_min/resource_max (3, 4) are all properly reserved. No remaining writers anywhere in lib/iris/src or lib/iris/dashboard/src.
  • TaskStatus.resource_usage (field 12) is intentionally retained — TaskAttempt.to_proto() still populates it for Worker.GetTaskStatus/ListTasks, and WorkerTaskDetail.vue still consumes it. The PR description and the new comment in job.proto correctly call this out.
  • K8sTaskProvider.task_stats_table is wired in controller.py after LogClient.connect; _ensure_resource_collector() returns None when the table is unset, so test paths without finelog stay green (verified in updated conftest.py).
  • ResourceCollector._fetch_one mirrors worker/task_attempt._emit_task_stat — same Timestamp.now().epoch_seconds() → naive-UTC datetime construction, same build_task_stat helper, same Table.write (thread-safe via _cond).
  • Dashboard JobDetail.vue matches existing TaskDetail.vue / WorkerDetail.vue patterns: same useLogServerStatsRpc('Query', …) shape, same decodeArrowIpc consumption, manual '…' SQL escape (no param binding on QueryRequest), and QUALIFY row_number() OVER (PARTITION BY task_id ORDER BY ts DESC) = 1 correctly returns the latest sample per task.
  • apply_heartbeats_batch no longer needs a steady-state pass; transition_updates filter only retains state changes or error/exit_code carries. update.container_id is still consumed in _apply_task_transitions, but in practice container_id is set on the same heartbeat as the BUILDING/RUNNING transition, so dropping pure-container_id steady-state updates is harmless.
    · Branch

@rjpower rjpower requested a review from yonromai May 5, 2026 23:04
@rjpower rjpower enabled auto-merge (squash) May 5, 2026 23:04
Copy link
Copy Markdown
Contributor

@yonromai yonromai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Approved. The cleanup path is coherent and CI is green. I left one P2 inline comment about matching stats rows to the current attempt before using them in JobDetail; I do not think it needs to block this PR.

Validation:

  • gh pr checks 5453 --repo marin-community/marin --watch=false: green/pass, with expected path-filter skips.
  • cd lib/iris && uv run --group dev python -m pytest -n1 --tb=short tests/cluster/providers/k8s/test_provider.py: 53 passed in 42.92s.
  • cd lib/iris/dashboard && npm ci && npm run build:check: failed on existing TypeScript errors outside this PR changed files; no JobDetail.vue error was reported.

Generated with Codex.

SELECT task_id, attempt_id, cpu_millicores, memory_mb, memory_peak_mb
FROM "iris.task"
WHERE task_id IN (${list})
QUALIFY row_number() OVER (PARTITION BY task_id ORDER BY ts DESC) = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 P2: This drops the attempt dimension when choosing the latest resource row. iris.task rows are per attempt, but the query keeps only the latest row per task_id and taskUsageMap then applies it to whatever the controller says is the current task state. On a retry/preemption, a previous attempt can remain the latest row until the new attempt emits a sample, so a running retry can briefly or permanently show the failed attempt’s MEM/CPU values. Please keep attempt identity in the selection, for example by returning the latest row per (task_id, attempt_id) and only mapping rows whose attempt_id matches TaskStatus.currentAttemptId.

Generated with Codex.

@rjpower rjpower merged commit 5966ee1 into main May 5, 2026
39 of 40 checks passed
@rjpower rjpower deleted the prune-resource-usage-paths branch May 5, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants