[Feature][history server] support endpoint /api/v0/tasks/timeline#4437
[Feature][history server] support endpoint /api/v0/tasks/timeline#4437AndySung320 wants to merge 29 commits intoray-project:masterfrom
/api/v0/tasks/timeline#4437Conversation
|
Manual test:
Missing Profile EventsBased on testing, the following task detail events are often not present in the collected logs:
The root cause is still under investigation. Additional EventsThe history server may include extra events not shown in Dashboard:
For best results, wait 1-2 minutes after job completion before deleting the cluster. |
7889ccb to
bca027a
Compare
| return fmt.Errorf("event does not have 'taskProfileEvents'") | ||
| } | ||
|
|
||
| var profileData struct { |
There was a problem hiding this comment.
Would it be better not to use anonymous struct?
There was a problem hiding this comment.
yes, i will refactor it.
| } | ||
|
|
||
| // Convert events to ProfileEventRaw format | ||
| var rawEvents []types.ProfileEventRaw |
There was a problem hiding this comment.
nit:
| var rawEvents []types.ProfileEventRaw | |
| var rawEvents = make([]types.ProfileEventRaw, 0, len(profileData.ProfileEvents.Events)) |
There was a problem hiding this comment.
thanks for pointing it out~
| existingKeys := make(map[eventKey]bool) | ||
| for _, e := range t.ProfileData.Events { | ||
| existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = true | ||
| } | ||
|
|
||
| for _, e := range rawEvents { | ||
| key := eventKey{e.EventName, e.StartTime, e.EndTime} | ||
| if !existingKeys[key] { | ||
| t.ProfileData.Events = append(t.ProfileData.Events, e) | ||
| existingKeys[key] = true |
There was a problem hiding this comment.
nit:
| existingKeys := make(map[eventKey]bool) | |
| for _, e := range t.ProfileData.Events { | |
| existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = true | |
| } | |
| for _, e := range rawEvents { | |
| key := eventKey{e.EventName, e.StartTime, e.EndTime} | |
| if !existingKeys[key] { | |
| t.ProfileData.Events = append(t.ProfileData.Events, e) | |
| existingKeys[key] = true | |
| existingKeys := make(map[eventKey]struct{}) | |
| for _, e := range t.ProfileData.Events { | |
| existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = struct{}{} | |
| } | |
| for _, e := range rawEvents { | |
| key := eventKey{e.EventName, e.StartTime, e.EndTime} | |
| if _, ok := existingKeys[key]; !ok { | |
| t.ProfileData.Events = append(t.ProfileData.Events, e) | |
| existingKeys[key] = struct{}{} |
| args, ok := event["args"].(map[string]any) | ||
| gg.Expect(ok).To(BeTrue(), "process_name should have args") |
There was a problem hiding this comment.
It seems the ok means the type assertion works or not, which might be a bit confused with the following error message. May I know what is aim for?
There was a problem hiding this comment.
Good catch! The ok variable checks whether the type assertion succeeded (i.e., whether event["args"] is a map[string]any), but the error message "process_name should have args" might be misleading since it could be interpreted as checking for the existence of args rather than its type.
I can refactor this to make it more explicit. Same as line:407
argsAny, exists := event["args"]
gg.Expect(exists).To(BeTrue(), "process_name should have 'args' field")
args, ok := argsAny.(map[string]any)
gg.Expect(ok).To(BeTrue(), "process_name args should be a map[string]any")| @@ -223,11 +224,23 @@ func (h *EventHandler) storeEvent(eventMap map[string]any) error { | |||
| taskMap.CreateOrMergeAttempt(currTask.TaskID, currTask.AttemptNumber, func(t *types.Task) { | |||
| // Merge definition fields (preserve existing Events if any) | |||
There was a problem hiding this comment.
| // Merge definition fields (preserve existing Events if any) | |
| // Merge definition fields (preserve existing Events, ProfileData, and identifiers if any) |
nit
| } | ||
|
|
||
| // Direct mapping for known event names | ||
| switch eventName { |
There was a problem hiding this comment.
I think this is referenced from: https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L25-L25?
Could we add a comment here to show that this part is following the above link in ray?
There was a problem hiding this comment.
sure. i will add it
| if strings.HasPrefix(profEvent.EventName, "task::") { | ||
| if extraData != nil { | ||
| if name, ok := extraData["name"].(string); ok { | ||
| args["name"] = name | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Determine event name for display | ||
| eventName := profEvent.EventName | ||
| displayName := profEvent.EventName | ||
|
|
||
| // For overall task events like "task::slow_task", use the full name from extraData | ||
| if strings.HasPrefix(profEvent.EventName, "task::") && extraData != nil { | ||
| if name, ok := extraData["name"].(string); ok { | ||
| displayName = name | ||
| } | ||
| } |
There was a problem hiding this comment.
I think those two if block can be merged? Can we just do
// Determine event name for display
eventName := profEvent.EventName
displayName := profEvent.EventName
// For overall task events like "task::slow_task", extract name from extraData
if strings.HasPrefix(profEvent.EventName, "task::") && extraData != nil {
if name, ok := extraData["name"].(string); ok {
args["name"] = name
displayName = name
}
}| Name: "thread_name", | ||
| PID: pid, | ||
| TID: &tidVal, | ||
| Phase: "M", |
There was a problem hiding this comment.
Just curious, what does phase "M" and "X" mean? Could we document it in ChromeTraceEvent struct? As it's not obvious without a comment
There was a problem hiding this comment.
"M" means Metadata event (e.g., process_name, thread_name)
"X" means Complete event (duration events with start and end times)
I will add the comment in ChromeTraceEvent struct.
| continue | ||
| } | ||
| nodeIP := task.ProfileData.NodeIPAddress | ||
| workerID := task.ProfileData.ComponentID |
There was a problem hiding this comment.
In Ray Dashboard, we filter out component type that's not worker or driver. Do we need to add it here?
There was a problem hiding this comment.
yes, thanks for pointing this out.
we should filter it out to align with Ray Dashboard.
| StartTime int64 `json:"start_time"` // nanoseconds | ||
| EndTime int64 `json:"end_time"` // nanoseconds |
There was a problem hiding this comment.
Could we add a comment showing why they are in nanoseconds?
I think it's here: https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/src/ray/core_worker/profile_event.cc#L51-L51
There was a problem hiding this comment.
sure, i will add a comment
| continue | ||
| } | ||
| nodeIP := task.ProfileData.NodeIPAddress | ||
| workerID := task.ProfileData.ComponentID |
There was a problem hiding this comment.
In Ray Dashboard, they use component type + component id to form cluster id, which I think is similar to our workerID here? Could you explain why we just use workerID rather than cluster id that consider different component types?
There was a problem hiding this comment.
You’re right that in Ray Dashboard they form a single logical id from component_type and component_id (e.g. "worker:xyz", "driver:abc"), which is effectively the “cluster id” for that component.
In current implementation I only used ComponentID (the raw id) as the key and didn’t take ComponentType into account. I didn’t consider that distinction at the time.
Using the composite key is safer (no collision if the same id is reused across driver/worker on a node) and matches the Dashboard. I’ll update our code with that and use component_type + ":" + component_id for both the internal key and the displayed thread name.
| if existingProfileData != nil { | ||
| t.ProfileData = existingProfileData | ||
| } | ||
| if existingNodeID != "" { | ||
| t.NodeID = existingNodeID | ||
| } | ||
| if existingWorkerID != "" { | ||
| t.WorkerID = existingWorkerID | ||
| } |
There was a problem hiding this comment.
What does these lines mean ? It seems to get existingProfileData from t.ProfileData and set it back.
There was a problem hiding this comment.
These lines preserve ProfileData, NodeID, and WorkerID that may have been set by earlier events (TASK_PROFILE_EVENT or TASK_LIFECYCLE_EVENT) before the TASK_DEFINITION_EVENT arrives. Since events are processed concurrently and order is not guaranteed, we save these values before overwriting with *t = currTask (line 230), then restore them to prevent data loss.
This follows the same pattern as ACTOR_DEFINITION_EVENT handling.
| // Skip if clusterID is empty (consistent with first pass) | ||
| if clusterID == "" { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Empty string check for clusterID is always false
Low Severity
The check if clusterID == "" can never be true because clusterID is constructed as ComponentType + ":" + ComponentID. Even when both fields are empty, clusterID will be ":", not an empty string. The comment says "Skip if clusterID is empty" but this code path is unreachable. Similarly, the check if clusterID != "" at line 871 always evaluates to true. If the intent is to skip tasks with empty ComponentID, the check needs to validate ComponentID directly rather than checking the concatenated string.
Additional Locations (1)
| if name, ok := extraData["name"].(string); ok { | ||
| displayName = name | ||
| args["name"] = name | ||
| } |
There was a problem hiding this comment.
Missing empty string check for displayName assignment
Low Severity
The condition at line 979 checks if name, ok := extraData["name"].(string); ok without verifying that name is non-empty. This is inconsistent with line 626 which correctly checks ok && name != "". If extraData["name"] is an empty string, displayName will become empty instead of keeping the original profEvent.EventName as a fallback. This would result in trace events with empty names in the Chrome Tracing visualization.
| } | ||
| if existingWorkerID != "" { | ||
| t.WorkerID = existingWorkerID | ||
| } |
There was a problem hiding this comment.
TASK_DEFINITION_EVENT overwrites FuncOrClassName set by profile event
Medium Severity
When TASK_PROFILE_EVENT arrives before TASK_DEFINITION_EVENT, the FuncOrClassName and Name fields extracted from the profile event's extraData are lost. The TASK_DEFINITION_EVENT handler preserves Events, ProfileData, NodeID, and WorkerID before doing *t = currTask, but doesn't preserve FuncOrClassName or Name. Since profile events contain more detailed naming information (e.g., task::Counter.increment format for actor methods), this causes the timeline to show less informative or empty function names when events arrive out of order.
Additional Locations (1)
| // Check if all Fs (no actor) | ||
| if actorPortion == "ffffffffffffffffffffffff" { | ||
| return "" | ||
| } |
There was a problem hiding this comment.
Case-sensitive hex comparison may fail for uppercase IDs
Low Severity
The extractActorIDFromTaskID function checks if a task has no associated actor by comparing actorPortion == "ffffffffffffffffffffffff" (all lowercase). This comparison is case-sensitive, so if Ray ever outputs task IDs with uppercase hex characters (e.g., "FFFFFFFFFFFFFFFFFFFFFFFF"), the check would fail. This would cause tasks without actors to incorrectly have an actor_id value set in the timeline output instead of nil.
ddbc3d0 to
af41ebf
Compare
| jobID := req.QueryParameter("job_id") | ||
| download := req.QueryParameter("download") | ||
|
|
||
| timeline := s.eventHandler.GetTasksTimeline(clusterNameID, jobID) |
There was a problem hiding this comment.
Missing session name in cluster key for timeline lookup
High Severity
The getTasksTimeline function constructs clusterNameID as clusterName + "_" + clusterNamespace, but task data is stored using utils.BuildClusterSessionKey(clusterName, clusterNamespace, sessionName) which includes the session name. This key mismatch causes GetTasksTimeline to always return an empty timeline since it can't find any tasks. Other task endpoints like getTaskSummarize and getTasks correctly use utils.BuildClusterSessionKey.
| for _, task := range tasks { | ||
| if task.ProfileData == nil { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Inconsistent filtering causes orphan metadata events
Low Severity
The first pass that collects node/worker mappings only checks task.ProfileData == nil, while the second pass that generates trace events additionally checks len(task.ProfileData.Events) == 0. This discrepancy causes process_name and thread_name metadata events to be emitted for nodes/workers that have no actual trace events, producing orphaned metadata entries in the Chrome Tracing output.
Additional Locations (1)
|
hi @Future-Outlier PTAL ~ |
| type eventKey struct { | ||
| EventName string | ||
| StartTime int64 | ||
| EndTime int64 | ||
| } |
There was a problem hiding this comment.
Just curious, only using EventName as key is not enough?
There was a problem hiding this comment.
I think this key is sufficient because:
- Deduplication is scoped to a single task attempt (
TaskID+AttemptNumberalready isolate different tasks) - Ray's profiling semantics: a task cannot have two events with the same name at the same time
- Consistent with
TASK_LIFECYCLE_EVENTdeduplication pattern (State + Timestamp)
| StartTime int64 | ||
| EndTime int64 | ||
| } | ||
| existingKeys := make(map[eventKey]struct{}) |
There was a problem hiding this comment.
| existingKeys := make(map[eventKey]struct{}) | |
| existingKeys := make(map[eventKey]struct{}, len(t.ProfileData.Events)+len(rawEvents)) |
I think we can pre-allocate the size to prevent resizing, as we already know the maximum element we will have
|
|
||
| // Extract func_or_class_name from extraData if available | ||
| for _, e := range rawEvents { | ||
| if strings.HasPrefix(e.EventName, "task::") && e.ExtraData != "" { |
There was a problem hiding this comment.
nit: I think we may extract task:: into constant like TaskPrefix or something else, as I saw this in multiple places
Just minor, no need to be in this PR
There was a problem hiding this comment.
yea, we can do it in other PR .
| // Skip if clusterID is empty | ||
| if clusterID == ":" { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Will there be a case that ComponentType has value but ComponentID is empty?
There was a problem hiding this comment.
Based on the log that I've seen for now, I haven't seen any case that ComponentType has value while ComponentID is empty and vise versa.
If it did happen, the impact would be minimal (something like a display issue).
I think we can fix it if this issue does occur in the future. what do you think ?
| // Build PID/TID mappings | ||
| // PID: Node IP -> numeric ID | ||
| // TID: clusterID (componentType:componentId) -> numeric ID per node |
There was a problem hiding this comment.
It seems like the PID and TID here are not getting from Ray Event, but created here by counter (0, 1, 2, ...)? Is it what Ray Dashboard do or is it just an alternative here as we currently cannot get PID and TID from Ray exported event?
There was a problem hiding this comment.
Yes, PID and TID here are created by counters (0, 1, 2, …), not read from the Ray event payload. This is the same approach Ray Dashboard uses.
Ray’s task profile events do not include pid or tid at all; they only have node_ip_address, component_type, and component_id (WorkerID). The Dashboard’s timeline is built by chrome_tracing_dump() in ray/_private/profiling.py.
The PID logic matches Ray’s, but the TID logic slightly differs: Ray uses globally unique TIDs, while our implementation with Go assigns TIDs per node. Do we need to switch to a globally unique TID to align with Ray?
There was a problem hiding this comment.
Yes, I think it's better for us to totally follow how Ray Dashboard API did if possible. While this is minor, I think it's fine to do in a follow-up PR.
You can create a refactor issue to deal with this and following comments:
| } | ||
| } | ||
|
|
||
| func extractActorIDFromTaskID(taskIDHex string) string { |
There was a problem hiding this comment.
Could you provide where you get the rules used in this function (link to the Ray code) for future reference?
There was a problem hiding this comment.
sure, i will add a comment above this function.
we can trace back to src/ray/design_docs/id_specification.md which shows TaskID = 8B unique + 16B ActorID; ActorID = 12B unique + 4B JobID.
|
|
||
| ws.Route(ws.GET("/v0/tasks/timeline").To(s.getTasksTimeline).Filter(s.CookieHandle). | ||
| Doc("get tasks timeline"). | ||
| Param(ws.QueryParameter("job_id", "filter by job_id")). |
There was a problem hiding this comment.
Should we also add download here?
There was a problem hiding this comment.
yes, thanks for pointing this out, i will add it
| return | ||
| } | ||
|
|
||
| resp.Header().Set("Content-Type", "application/json") |
There was a problem hiding this comment.
Could we use .Produces(restful.MIME_JSON) in the router level instead to follow how other endpoints did?
There was a problem hiding this comment.
sure, thanks for the suggestion.
|
Hi, @AndySung320 |
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
…olution Signed-off-by: AndySung320 <andysung0320@gmail.com>
…econd Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
9f8d4a6 to
ce1a652
Compare
There was a problem hiding this comment.
- both dead and live cluster work in my env
- can you add tests for query parameters
- download
- job_id
- For all variables that include DTO in their names, I’d like to replace them.
Could you help open an issue and assign it to yourself?
You can work on it as a separate PR. - trust @machichima and @fscnick 's detailed review
- I checked ray's source code, the endpoint's response is the same as ray's dashboard, so high level LGTM
machichima
left a comment
There was a problem hiding this comment.
LGTM! Thank you
Just a comment for follow-up issue: #4437 (comment)
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
| // This shouldn't happen if first pass worked correctly, | ||
| // but skip to avoid null TID | ||
| continue | ||
| } |
There was a problem hiding this comment.
Inconsistent TID pointer pattern compared to nearby code
Low Severity
The code at line 1152 takes the address of tid directly (tidPtr = &tid) without first copying to a local variable. However, the metadata event generation code just 40 lines above (lines 1113-1118) uses an explicit copy pattern (tidVal := tid) before taking the address. This inconsistency within the same function is a code smell. While Go's escape analysis makes both patterns work correctly, the explicit copy pattern is more defensive and consistent with the existing code style.
Additional Locations (1)
| // Skip if clusterID is empty | ||
| if clusterID == ":" { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Empty ComponentID check is unreachable due to prior filter
Medium Severity
The check clusterID == ":" is unreachable and provides no protection. Since componentType is filtered to only "worker" or "driver" at lines 1074-1076, clusterID (built as ComponentType + ":" + ComponentID) will always be "worker:..." or "driver:...", never just ":". If ComponentID is empty, tasks would incorrectly share the same TID in timeline visualization (e.g., all workers with empty ComponentID on a node would appear as one thread).
Additional Locations (1)
| } | ||
| if t.JobID == "" { | ||
| t.JobID = profileData.JobID | ||
| } |
There was a problem hiding this comment.
Missing AttemptNumber initialization in profile event handler
Medium Severity
The TASK_PROFILE_EVENT handler sets TaskID and JobID on the task if empty, but does not set AttemptNumber. While profileData.AttemptNumber is used for storage indexing via CreateOrMergeAttempt, the Task.AttemptNumber field remains at its default value (0). The timeline endpoint later outputs task.AttemptNumber, resulting in incorrect attempt_number values for tasks that only have profile events without a corresponding definition event.
…er handling Signed-off-by: AndySung320 <andysung0320@gmail.com>
| t.TaskID = profileData.TaskID | ||
| } | ||
| if t.JobID == "" { | ||
| t.JobID = profileData.JobID |
There was a problem hiding this comment.
JobID format inconsistency breaks timeline job filtering
High Severity
The TASK_PROFILE_EVENT handler stores profileData.JobID directly without converting it from base64 to hex format. Other event handlers (JOB_DEFINITION_EVENT at line 579, DRIVER_JOB_LIFECYCLE_EVENT at line 639) convert JobID using utils.ConvertBase64ToHex. This means tasks created from TASK_PROFILE_EVENT have base64-encoded JobIDs while the /api/jobs endpoint returns hex-encoded JobIDs. When users filter timeline by job_id from /api/jobs, the string comparison in GetTasksByJobID fails because the formats don't match. The e2e test works around this by manually converting hex to base64, but API consumers would hit this inconsistency.
There was a problem hiding this comment.
I think the ID encoding issue will need more discussion,
so we may want to keep the current behavior for now ?
… to hex Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>




Why are these changes needed?
This PR implements the
/api/v0/tasks/timelineendpoint for History Server to support task execution timeline visualization.Currently, Ray Dashboard provides timeline data only for live clusters. Once a cluster is deleted, this profiling information is lost. This implementation enables users to query task timelines from historical data stored in MinIO.
The timeline data is formatted in Chrome Tracing Format.
Manual test: #4437 (comment)
Related issue number
Closes #4390
Checks