feat: distinguish externally killed tasks from code failures (#84)#187
Open
Harshil-Malisetty wants to merge 5 commits intoNetflix:masterfrom
Open
feat: distinguish externally killed tasks from code failures (#84)#187Harshil-Malisetty wants to merge 5 commits intoNetflix:masterfrom
Harshil-Malisetty wants to merge 5 commits intoNetflix:masterfrom
Conversation
dc22e3f to
c4b7d5b
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of the Change
Closes #84
All terminal task failures currently render as identical red "Failed" states, conflating two fundamentally different failure modes:
This PR adds a distinct killed status with amber color and warning icon so users can immediately distinguish whether to debug their code or their infrastructure.
The distinguishing signal lives in the backend SQL (companion PR to Netflix/metaflow-service Netflix/metaflow-service#468 ). When a task fails via Python exception, Metaflow writes
attempt_ok = "False"in the finally block of task.py. When killed with SIGKILL, the finally block never runs, so noattempt_okmetadata is written. The backend now deriveskilledfrom this absence and exposes it in the task status field.Files changed
src/types.ts: addedkilledto TaskStatus unionsrc/utils/style.ts: amber color and warning icon for killed statussrc/components/Timeline/taskdataUtils.ts: killed in RowCounts and getStepStatussrc/components/Timeline/useTaskData.ts: killed: 0 in initial countssrc/pages/Home/ResultGroup/TimelinePreview.tsx: killed in zeroCountssrc/components/TaskListingHeader/components/StatusLights.tsx: killed status lightsrc/components/TaskListingHeader/components/CustomSettings.tsx: killed filter optionsrc/translations/en.ts: filter-killed translationNote: Run-level status on the home screen continues to show "Failed" for both failure modes. This is correct since the run itself failed in both cases. The killed or failed distinction is exposed at the task level in the timeline view and filter, which is where users investigate failure causes.
Alternate Designs
Considered deriving killed status on the frontend by checking
attempt_okmetadata absence inuseTaskMetadata. Rejected because metadata loads asynchronously after the task, causing a flicker and creating two sources of truth. Backend derivation via SQL is consistent with how the existing codebase handles all other status derivation, and matches Sakari's stated preference for backend as single source of truth.Possible Drawbacks
The killed status appears after the heartbeat threshold passes (about 60 seconds by default). During this window the task shows as running. This is existing behavior for all failed tasks and is not introduced by this change.
Verification Process
Ran two flows locally against a full metaflow-service Docker stack:
FailFlow
ValueError("deliberate failure")"status": "failed"KillFlow
os.kill(os.getpid(), signal.SIGKILL)"status": "killed"Screenshots
FailFlow (terminal)

KillFlow (terminal)

FailFlow UI timeline

KillFlow UI timeline and dropbox

Release Notes
Externally killed tasks (SIGKILL, OOM killer, Kubernetes eviction) now display with a distinct amber warning indicator instead of the red failed indicator, allowing users to immediately distinguish infrastructure failures from code failures.