feat: derive killed status for externally terminated tasks #468
Open
Harshil-Malisetty wants to merge 2 commits intoNetflix:masterfrom
Open
feat: derive killed status for externally terminated tasks #468Harshil-Malisetty wants to merge 2 commits intoNetflix:masterfrom
Harshil-Malisetty wants to merge 2 commits intoNetflix:masterfrom
Conversation
|
Interesting approach. One thing I noticed: the SQL and refiner seem to have slightly different definitions of "killed." The SQL checks task_ok_location IS NULL (no _task_ok artifact row in the DB at all), but the refiner reclassifies failed to killed when the _task_ok row exists but its value is None (unreadable from datastore). These are different failure modes. On very old tasks with corrupted artifact data, the refiner path could mark them as "killed" when they're really just missing data. Might be worth aligning the two conditions so they catch the same cases. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to Netflix/metaflow-ui issue 84 Netflix/metaflow-ui#187
Adds a new killed task status to distinguish tasks terminated externally (SIGKILL, OOM killer, Kubernetes eviction) from tasks that failed due to Python exceptions. The fix could have been implemented on the frontend by checking for attempt_ok absence in the task metadata response. This was considered and rejected for two reasons. First, task metadata loads asynchronously after the task row itself, which would cause a visible flicker from killed back to failed during the loading window. Second, it would create two sources of truth for task status, the SQL-derived status field and the frontend-derived killed flag, which conflicts with how every other status in the codebase is handled. Deriving killed in the SQL status CASE statement in task.py keeps the backend as the single source of truth, consistent with Sakari's stated design preference.
Problem
When a Metaflow task terminates abnormally, the UI currently shows a red "Failed" indicator regardless of the cause. But there are two fundamentally different failure modes — a Python exception means the user's code has a bug, while an external kill (SIGKILL, OOM killer, Kubernetes eviction) means the infrastructure failed. Conflating these forces engineers to manually inspect raw metadata just to know where to start debugging, which defeats the purpose of having a monitoring UI.
Investigation
The distinguishing signal is already in the backend. When a task fails via a Python exception, Metaflow writes attempt_ok = "False" in the finally block of task.py. When a task is killed externally with SIGKILL, the finally block never executes, leaving three observable differences in the data:
No attempt_ok metadata entry is written
No _task_ok artifact exists at any location
last_heartbeat_ts is present but the heartbeat eventually times out
This means the backend already has everything needed to distinguish the two cases — no new schema changes or API endpoints are required.
Verified by running both flow types locally and diffing the metadata API responses:
Code failure (ValueError)
Killed task (SIGKILL)
Solution
Added a new WHEN clause to the SQL status CASE statement in services/ui_backend_service/data/db/tables/task.py, placed before the existing heartbeat-based failed condition:
WHEN attempt.attempt_ok IS NULL
AND attempt.task_ok_location IS NULL
AND {table_name}.last_heartbeat_ts IS NOT NULL
AND (extract(epoch from now()) - {table_name}.last_heartbeat_ts) > {heartbeat_threshold}
AND {finished_at_column} IS NULL
THEN 'killed'
This correctly identifies tasks where:
No new database columns, no new API endpoints, no performance impact. It uses existing indexed columns already queried in the lateral join.
Also updated task_refiner.py to be aware of the killed status so it does not incorrectly override it during postprocessing.
Files changed
Verification
KillFlow: SIGKILL task
FailFlow: ValueError task