Skip to content

feat: distinguish externally killed tasks from code failures (#84)#187

Open
Harshil-Malisetty wants to merge 5 commits intoNetflix:masterfrom
Harshil-Malisetty:fix/killed-task-status
Open

feat: distinguish externally killed tasks from code failures (#84)#187
Harshil-Malisetty wants to merge 5 commits intoNetflix:masterfrom
Harshil-Malisetty:fix/killed-task-status

Conversation

@Harshil-Malisetty
Copy link
Copy Markdown

@Harshil-Malisetty Harshil-Malisetty commented Mar 21, 2026

Description of the Change

Closes #84

All terminal task failures currently render as identical red "Failed" states, conflating two fundamentally different failure modes:

  • Code failures: Python exception, the user's code is the cause
  • Killed tasks: SIGKILL from OOM killer, Kubernetes eviction, external process termination, infrastructure is the cause

This PR adds a distinct killed status with amber color and warning icon so users can immediately distinguish whether to debug their code or their infrastructure.

The distinguishing signal lives in the backend SQL (companion PR to Netflix/metaflow-service Netflix/metaflow-service#468 ). When a task fails via Python exception, Metaflow writes attempt_ok = "False" in the finally block of task.py. When killed with SIGKILL, the finally block never runs, so no attempt_ok metadata is written. The backend now derives killed from this absence and exposes it in the task status field.


Files changed

  • src/types.ts: added killed to TaskStatus union
  • src/utils/style.ts: amber color and warning icon for killed status
  • src/components/Timeline/taskdataUtils.ts: killed in RowCounts and getStepStatus
  • src/components/Timeline/useTaskData.ts: killed: 0 in initial counts
  • src/pages/Home/ResultGroup/TimelinePreview.tsx: killed in zeroCounts
  • src/components/TaskListingHeader/components/StatusLights.tsx: killed status light
  • src/components/TaskListingHeader/components/CustomSettings.tsx: killed filter option
  • src/translations/en.ts: filter-killed translation

Note: Run-level status on the home screen continues to show "Failed" for both failure modes. This is correct since the run itself failed in both cases. The killed or failed distinction is exposed at the task level in the timeline view and filter, which is where users investigate failure causes.


Alternate Designs

Considered deriving killed status on the frontend by checking attempt_ok metadata absence in useTaskMetadata. Rejected because metadata loads asynchronously after the task, causing a flicker and creating two sources of truth. Backend derivation via SQL is consistent with how the existing codebase handles all other status derivation, and matches Sakari's stated preference for backend as single source of truth.


Possible Drawbacks

The killed status appears after the heartbeat threshold passes (about 60 seconds by default). During this window the task shows as running. This is existing behavior for all failed tasks and is not introduced by this change.


Verification Process

Ran two flows locally against a full metaflow-service Docker stack:

FailFlow

  • raises ValueError("deliberate failure")
  • curl result: "status": "failed"
  • UI: red dot, Failed (1) in filter, not counted as killed

KillFlow

  • calls os.kill(os.getpid(), signal.SIGKILL)
  • curl result after 90s: "status": "killed"
  • UI: amber dot, Killed (1) in filter, Failed (0), correctly not counted as failed

Screenshots

FailFlow (terminal)
FailFlow

KillFlow (terminal)
API

FailFlow UI timeline
KillFlow

KillFlow UI timeline and dropbox
Extra


Release Notes

Externally killed tasks (SIGKILL, OOM killer, Kubernetes eviction) now display with a distinct amber warning indicator instead of the red failed indicator, allowing users to immediately distinguish infrastructure failures from code failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distinguish "killed by orchestrator" and actual failures

1 participant