[iris] Prune dead resource_usage paths; query iris.task in JobDetail#5453
Conversation
Drop the worker-pushed and controller-stored ResourceUsage plumbing that the finelog stats migration left behind: WorkerTaskStatus.resource_usage, TaskUpdate.resource_usage, JobStatus.resource_usage, and the GetJobStatusResponse min/max aggregates are reserved. Rewire the k8s ResourceCollector to write IrisTaskStat rows directly to iris.task so CoreWeave pods stay observable, and point the JobDetail dashboard at the same namespace so per-task MEM/CPU columns populate again.
|
Claude finished @rjpower's task in 5m 19s —— View job Code reviewNo issues found. Checked for bugs and AGENTS.md compliance. Scope verified
|
yonromai
left a comment
There was a problem hiding this comment.
🤖 Approved. The cleanup path is coherent and CI is green. I left one P2 inline comment about matching stats rows to the current attempt before using them in JobDetail; I do not think it needs to block this PR.
Validation:
gh pr checks 5453 --repo marin-community/marin --watch=false: green/pass, with expected path-filter skips.cd lib/iris && uv run --group dev python -m pytest -n1 --tb=short tests/cluster/providers/k8s/test_provider.py: 53 passed in 42.92s.cd lib/iris/dashboard && npm ci && npm run build:check: failed on existing TypeScript errors outside this PR changed files; no JobDetail.vue error was reported.
Generated with Codex.
| SELECT task_id, attempt_id, cpu_millicores, memory_mb, memory_peak_mb | ||
| FROM "iris.task" | ||
| WHERE task_id IN (${list}) | ||
| QUALIFY row_number() OVER (PARTITION BY task_id ORDER BY ts DESC) = 1 |
There was a problem hiding this comment.
🤖 P2: This drops the attempt dimension when choosing the latest resource row. iris.task rows are per attempt, but the query keeps only the latest row per task_id and taskUsageMap then applies it to whatever the controller says is the current task state. On a retry/preemption, a previous attempt can remain the latest row until the new attempt emits a sample, so a running retry can briefly or permanently show the failed attempt’s MEM/CPU values. Please keep attempt identity in the selection, for example by returning the latest row per (task_id, attempt_id) and only mapping rows whose attempt_id matches TaskStatus.currentAttemptId.
Generated with Codex.
Drop the worker-pushed and controller-stored ResourceUsage plumbing that the finelog stats migration left behind: WorkerTaskStatus.resource_usage, TaskUpdate.resource_usage, JobStatus.resource_usage, and the GetJobStatusResponse min/max aggregates are reserved. Rewire the k8s ResourceCollector to write IrisTaskStat rows directly to iris.task so CoreWeave pods stay observable, and point the JobDetail dashboard at the same namespace so per-task MEM/CPU columns populate again.