[iris] Prune dead resource_usage paths; query iris.task in JobDetail by rjpower · Pull Request #5453 · marin-community/marin

rjpower · 2026-05-05T21:33:34Z

Drop the worker-pushed and controller-stored ResourceUsage plumbing that the finelog stats migration left behind: WorkerTaskStatus.resource_usage, TaskUpdate.resource_usage, JobStatus.resource_usage, and the GetJobStatusResponse min/max aggregates are reserved. Rewire the k8s ResourceCollector to write IrisTaskStat rows directly to iris.task so CoreWeave pods stay observable, and point the JobDetail dashboard at the same namespace so per-task MEM/CPU columns populate again.

claude · 2026-05-05T21:33:50Z

Claude finished @rjpower's task in 5m 19s —— View job

Code review

No issues found. Checked for bugs and AGENTS.md compliance.

Scope verified

WorkerTaskStatus.resource_usage (proto field 8), JobStatus.resource_usage (8), TaskStatus.resource_history (20), and GetJobStatusResponse.resource_min/resource_max (3, 4) are all properly reserved. No remaining writers anywhere in lib/iris/src or lib/iris/dashboard/src.
TaskStatus.resource_usage (field 12) is intentionally retained — TaskAttempt.to_proto() still populates it for Worker.GetTaskStatus/ListTasks, and WorkerTaskDetail.vue still consumes it. The PR description and the new comment in job.proto correctly call this out.
K8sTaskProvider.task_stats_table is wired in controller.py after LogClient.connect; _ensure_resource_collector() returns None when the table is unset, so test paths without finelog stay green (verified in updated conftest.py).
ResourceCollector._fetch_one mirrors worker/task_attempt._emit_task_stat — same Timestamp.now().epoch_seconds() → naive-UTC datetime construction, same build_task_stat helper, same Table.write (thread-safe via _cond).
Dashboard JobDetail.vue matches existing TaskDetail.vue / WorkerDetail.vue patterns: same useLogServerStatsRpc('Query', …) shape, same decodeArrowIpc consumption, manual '…' SQL escape (no param binding on QueryRequest), and QUALIFY row_number() OVER (PARTITION BY task_id ORDER BY ts DESC) = 1 correctly returns the latest sample per task.
apply_heartbeats_batch no longer needs a steady-state pass; transition_updates filter only retains state changes or error/exit_code carries. update.container_id is still consumed in _apply_task_transitions, but in practice container_id is set on the same heartbeat as the BUILDING/RUNNING transition, so dropping pure-container_id steady-state updates is harmless.
· Branch

yonromai

🤖 Approved. The cleanup path is coherent and CI is green. I left one P2 inline comment about matching stats rows to the current attempt before using them in JobDetail; I do not think it needs to block this PR.

Validation:

gh pr checks 5453 --repo marin-community/marin --watch=false: green/pass, with expected path-filter skips.
cd lib/iris && uv run --group dev python -m pytest -n1 --tb=short tests/cluster/providers/k8s/test_provider.py: 53 passed in 42.92s.
cd lib/iris/dashboard && npm ci && npm run build:check: failed on existing TypeScript errors outside this PR changed files; no JobDetail.vue error was reported.

Generated with Codex.

yonromai · 2026-05-05T23:21:26Z

+SELECT task_id, attempt_id, cpu_millicores, memory_mb, memory_peak_mb
+FROM "iris.task"
+WHERE task_id IN (${list})
+QUALIFY row_number() OVER (PARTITION BY task_id ORDER BY ts DESC) = 1


🤖 P2: This drops the attempt dimension when choosing the latest resource row. iris.task rows are per attempt, but the query keeps only the latest row per task_id and taskUsageMap then applies it to whatever the controller says is the current task state. On a retry/preemption, a previous attempt can remain the latest row until the new attempt emits a sample, so a running retry can briefly or permanently show the failed attempt’s MEM/CPU values. Please keep attempt identity in the selection, for example by returning the latest row per (task_id, attempt_id) and only mapping rows whose attempt_id matches TaskStatus.currentAttemptId.

Generated with Codex.

rjpower added the agent-generated Created by automation/agent label May 5, 2026

rjpower requested a review from yonromai May 5, 2026 23:04

rjpower enabled auto-merge (squash) May 5, 2026 23:04

yonromai approved these changes May 5, 2026

View reviewed changes

rjpower merged commit 5966ee1 into main May 5, 2026
39 of 40 checks passed

rjpower deleted the prune-resource-usage-paths branch May 5, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[iris] Prune dead resource_usage paths; query iris.task in JobDetail#5453

[iris] Prune dead resource_usage paths; query iris.task in JobDetail#5453
rjpower merged 1 commit into
mainfrom
prune-resource-usage-paths

rjpower commented May 5, 2026

Uh oh!

claude Bot commented May 5, 2026 •

edited

Loading

Uh oh!

yonromai left a comment

Uh oh!

yonromai May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rjpower commented May 5, 2026

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

yonromai left a comment

Choose a reason for hiding this comment

Uh oh!

yonromai May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented May 5, 2026 •

edited

Loading