Skip to content

[Feature][history server] support endpoint /api/v0/tasks/timeline#4437

Open
AndySung320 wants to merge 29 commits intoray-project:masterfrom
AndySung320:feature/history-sever-task-timeline
Open

[Feature][history server] support endpoint /api/v0/tasks/timeline#4437
AndySung320 wants to merge 29 commits intoray-project:masterfrom
AndySung320:feature/history-sever-task-timeline

Conversation

@AndySung320
Copy link
Contributor

@AndySung320 AndySung320 commented Jan 25, 2026

Why are these changes needed?

This PR implements the /api/v0/tasks/timeline endpoint for History Server to support task execution timeline visualization.

Currently, Ray Dashboard provides timeline data only for live clusters. Once a cluster is deleted, this profiling information is lost. This implementation enables users to query task timelines from historical data stored in MinIO.

The timeline data is formatted in Chrome Tracing Format.

Manual test: #4437 (comment)

Related issue number

Closes #4390

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@AndySung320
Copy link
Contributor Author

AndySung320 commented Jan 25, 2026

Manual test:

  1. Following setup guide here, but skip "7. Delete Ray Cluster (Trigger Log Upload)" step
  2. Get timeline from the live cluster
SESSION="live"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks/timeline" 
test1
  1. Delete the RayCluster to trigger the upload
  2. Get timeline from the dead cluster
curl "http://localhost:8080/clusters"  #to get the session number
SESSION="YOUR SESSION"
curl -c ~/cookies.txt "http://localhost:8080/enter_cluster/default/raycluster-historyserver/$SESSION"
curl -b ~/cookies.txt "http://localhost:8080/api/v0/tasks/timeline"
test3

Missing Profile Events

Based on testing, the following task detail events are often not present in the collected logs:

  • task:deserialize_arguments
  • task:execute
  • task:store_outputs

The root cause is still under investigation.

Additional Events

The history server may include extra events not shown in Dashboard:

  • submit_task events from the driver node
    Note: In the Ray Dashboard API, ProfileEvents is defined as optional (not repeated) in the Protobuf. When events are merged with MergeFrom, profile events from different component types overwrite each other. Because worker events usually arrive later than driver events, the driver’s profile events get overwritten by the worker’s.
    In contrast, the event export API emits driver and worker as separate RayEvents, so they are not merged and are sent independently. Thus, that's why we could see the extra events.

For best results, wait 1-2 minutes after job completion before deleting the cluster.

@AndySung320 AndySung320 marked this pull request as ready for review January 26, 2026 22:22
@AndySung320 AndySung320 force-pushed the feature/history-sever-task-timeline branch from 7889ccb to bca027a Compare January 26, 2026 22:54
return fmt.Errorf("event does not have 'taskProfileEvents'")
}

var profileData struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better not to use anonymous struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i will refactor it.

}

// Convert events to ProfileEventRaw format
var rawEvents []types.ProfileEventRaw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
var rawEvents []types.ProfileEventRaw
var rawEvents = make([]types.ProfileEventRaw, 0, len(profileData.ProfileEvents.Events))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for pointing it out~

Comment on lines 610 to 619
existingKeys := make(map[eventKey]bool)
for _, e := range t.ProfileData.Events {
existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = true
}

for _, e := range rawEvents {
key := eventKey{e.EventName, e.StartTime, e.EndTime}
if !existingKeys[key] {
t.ProfileData.Events = append(t.ProfileData.Events, e)
existingKeys[key] = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
existingKeys := make(map[eventKey]bool)
for _, e := range t.ProfileData.Events {
existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = true
}
for _, e := range rawEvents {
key := eventKey{e.EventName, e.StartTime, e.EndTime}
if !existingKeys[key] {
t.ProfileData.Events = append(t.ProfileData.Events, e)
existingKeys[key] = true
existingKeys := make(map[eventKey]struct{})
for _, e := range t.ProfileData.Events {
existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = struct{}{}
}
for _, e := range rawEvents {
key := eventKey{e.EventName, e.StartTime, e.EndTime}
if _, ok := existingKeys[key]; !ok {
t.ProfileData.Events = append(t.ProfileData.Events, e)
existingKeys[key] = struct{}{}

Comment on lines 391 to 392
args, ok := event["args"].(map[string]any)
gg.Expect(ok).To(BeTrue(), "process_name should have args")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the ok means the type assertion works or not, which might be a bit confused with the following error message. May I know what is aim for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The ok variable checks whether the type assertion succeeded (i.e., whether event["args"] is a map[string]any), but the error message "process_name should have args" might be misleading since it could be interpreted as checking for the existence of args rather than its type.

I can refactor this to make it more explicit. Same as line:407

argsAny, exists := event["args"]
gg.Expect(exists).To(BeTrue(), "process_name should have 'args' field")

args, ok := argsAny.(map[string]any)
gg.Expect(ok).To(BeTrue(), "process_name args should be a map[string]any")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 218d97b .

@@ -223,11 +224,23 @@ func (h *EventHandler) storeEvent(eventMap map[string]any) error {
taskMap.CreateOrMergeAttempt(currTask.TaskID, currTask.AttemptNumber, func(t *types.Task) {
// Merge definition fields (preserve existing Events if any)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Merge definition fields (preserve existing Events if any)
// Merge definition fields (preserve existing Events, ProfileData, and identifiers if any)

nit

}

// Direct mapping for known event names
switch eventName {
Copy link
Collaborator

@machichima machichima Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is referenced from: https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L25-L25?
Could we add a comment here to show that this part is following the above link in ray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i will add it

Comment on lines 963 to 980
if strings.HasPrefix(profEvent.EventName, "task::") {
if extraData != nil {
if name, ok := extraData["name"].(string); ok {
args["name"] = name
}
}
}

// Determine event name for display
eventName := profEvent.EventName
displayName := profEvent.EventName

// For overall task events like "task::slow_task", use the full name from extraData
if strings.HasPrefix(profEvent.EventName, "task::") && extraData != nil {
if name, ok := extraData["name"].(string); ok {
displayName = name
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those two if block can be merged? Can we just do

  // Determine event name for display
  eventName := profEvent.EventName
  displayName := profEvent.EventName

  // For overall task events like "task::slow_task", extract name from extraData
  if strings.HasPrefix(profEvent.EventName, "task::") && extraData != nil {
      if name, ok := extraData["name"].(string); ok {
          args["name"] = name
          displayName = name
      }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated in 18a6f71

Name: "thread_name",
PID: pid,
TID: &tidVal,
Phase: "M",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does phase "M" and "X" mean? Could we document it in ChromeTraceEvent struct? As it's not obvious without a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"M" means Metadata event (e.g., process_name, thread_name)
"X" means Complete event (duration events with start and end times)
I will add the comment in ChromeTraceEvent struct.

continue
}
nodeIP := task.ProfileData.NodeIPAddress
workerID := task.ProfileData.ComponentID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ray Dashboard, we filter out component type that's not worker or driver. Do we need to add it here?

https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L166-L167

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for pointing this out.
we should filter it out to align with Ray Dashboard.

Comment on lines +107 to +110
StartTime int64 `json:"start_time"` // nanoseconds
EndTime int64 `json:"end_time"` // nanoseconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i will add a comment

continue
}
nodeIP := task.ProfileData.NodeIPAddress
workerID := task.ProfileData.ComponentID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Ray Dashboard, they use component type + component id to form cluster id, which I think is similar to our workerID here? Could you explain why we just use workerID rather than cluster id that consider different component types?

https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L164-L164

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right that in Ray Dashboard they form a single logical id from component_type and component_id (e.g. "worker:xyz", "driver:abc"), which is effectively the “cluster id” for that component.
In current implementation I only used ComponentID (the raw id) as the key and didn’t take ComponentType into account. I didn’t consider that distinction at the time.

Using the composite key is safer (no collision if the same id is reused across driver/worker on a node) and matches the Dashboard. I’ll update our code with that and use component_type + ":" + component_id for both the internal key and the displayed thread name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 18a6f71

Comment on lines 235 to 262
if existingProfileData != nil {
t.ProfileData = existingProfileData
}
if existingNodeID != "" {
t.NodeID = existingNodeID
}
if existingWorkerID != "" {
t.WorkerID = existingWorkerID
}
Copy link
Collaborator

@fscnick fscnick Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does these lines mean ? It seems to get existingProfileData from t.ProfileData and set it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines preserve ProfileData, NodeID, and WorkerID that may have been set by earlier events (TASK_PROFILE_EVENT or TASK_LIFECYCLE_EVENT) before the TASK_DEFINITION_EVENT arrives. Since events are processed concurrently and order is not guaranteed, we save these values before overwriting with *t = currTask (line 230), then restore them to prevent data loss.
This follows the same pattern as ACTOR_DEFINITION_EVENT handling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation.

// Skip if clusterID is empty (consistent with first pass)
if clusterID == "" {
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty string check for clusterID is always false

Low Severity

The check if clusterID == "" can never be true because clusterID is constructed as ComponentType + ":" + ComponentID. Even when both fields are empty, clusterID will be ":", not an empty string. The comment says "Skip if clusterID is empty" but this code path is unreachable. Similarly, the check if clusterID != "" at line 871 always evaluates to true. If the intent is to skip tasks with empty ComponentID, the check needs to validate ComponentID directly rather than checking the concatenated string.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 83ba2dc

if name, ok := extraData["name"].(string); ok {
displayName = name
args["name"] = name
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing empty string check for displayName assignment

Low Severity

The condition at line 979 checks if name, ok := extraData["name"].(string); ok without verifying that name is non-empty. This is inconsistent with line 626 which correctly checks ok && name != "". If extraData["name"] is an empty string, displayName will become empty instead of keeping the original profEvent.EventName as a fallback. This would result in trace events with empty names in the Chrome Tracing visualization.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 83ba2dc

}
if existingWorkerID != "" {
t.WorkerID = existingWorkerID
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TASK_DEFINITION_EVENT overwrites FuncOrClassName set by profile event

Medium Severity

When TASK_PROFILE_EVENT arrives before TASK_DEFINITION_EVENT, the FuncOrClassName and Name fields extracted from the profile event's extraData are lost. The TASK_DEFINITION_EVENT handler preserves Events, ProfileData, NodeID, and WorkerID before doing *t = currTask, but doesn't preserve FuncOrClassName or Name. Since profile events contain more detailed naming information (e.g., task::Counter.increment format for actor methods), this causes the timeline to show less informative or empty function names when events arrive out of order.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in ddbc3d0

// Check if all Fs (no actor)
if actorPortion == "ffffffffffffffffffffffff" {
return ""
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitive hex comparison may fail for uppercase IDs

Low Severity

The extractActorIDFromTaskID function checks if a task has no associated actor by comparing actorPortion == "ffffffffffffffffffffffff" (all lowercase). This comparison is case-sensitive, so if Ray ever outputs task IDs with uppercase hex characters (e.g., "FFFFFFFFFFFFFFFFFFFFFFFF"), the check would fail. This would cause tasks without actors to incorrectly have an actor_id value set in the timeline output instead of nil.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 6e3e214

@AndySung320 AndySung320 force-pushed the feature/history-sever-task-timeline branch from ddbc3d0 to af41ebf Compare February 4, 2026 05:02
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

jobID := req.QueryParameter("job_id")
download := req.QueryParameter("download")

timeline := s.eventHandler.GetTasksTimeline(clusterNameID, jobID)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing session name in cluster key for timeline lookup

High Severity

The getTasksTimeline function constructs clusterNameID as clusterName + "_" + clusterNamespace, but task data is stored using utils.BuildClusterSessionKey(clusterName, clusterNamespace, sessionName) which includes the session name. This key mismatch causes GetTasksTimeline to always return an empty timeline since it can't find any tasks. Other task endpoints like getTaskSummarize and getTasks correctly use utils.BuildClusterSessionKey.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in b56e672

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

for _, task := range tasks {
if task.ProfileData == nil {
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent filtering causes orphan metadata events

Low Severity

The first pass that collects node/worker mappings only checks task.ProfileData == nil, while the second pass that generates trace events additionally checks len(task.ProfileData.Events) == 0. This discrepancy causes process_name and thread_name metadata events to be emitted for nodes/workers that have no actual trace events, producing orphaned metadata entries in the Chrome Tracing output.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 0439a98

@AndySung320
Copy link
Contributor Author

hi @Future-Outlier PTAL ~

Comment on lines +787 to +791
type eventKey struct {
EventName string
StartTime int64
EndTime int64
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, only using EventName as key is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this key is sufficient because:

  1. Deduplication is scoped to a single task attempt (TaskID + AttemptNumber already isolate different tasks)
  2. Ray's profiling semantics: a task cannot have two events with the same name at the same time
  3. Consistent with TASK_LIFECYCLE_EVENT deduplication pattern (State + Timestamp)

StartTime int64
EndTime int64
}
existingKeys := make(map[eventKey]struct{})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
existingKeys := make(map[eventKey]struct{})
existingKeys := make(map[eventKey]struct{}, len(t.ProfileData.Events)+len(rawEvents))

I think we can pre-allocate the size to prevent resizing, as we already know the maximum element we will have

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 7249ace


// Extract func_or_class_name from extraData if available
for _, e := range rawEvents {
if strings.HasPrefix(e.EventName, "task::") && e.ExtraData != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we may extract task:: into constant like TaskPrefix or something else, as I saw this in multiple places

Just minor, no need to be in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we can do it in other PR .

Comment on lines 1084 to 1087
// Skip if clusterID is empty
if clusterID == ":" {
continue
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a case that ComponentType has value but ComponentID is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the log that I've seen for now, I haven't seen any case that ComponentType has value while ComponentID is empty and vise versa.
If it did happen, the impact would be minimal (something like a display issue).
I think we can fix it if this issue does occur in the future. what do you think ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG!

Comment on lines +1059 to +1061
// Build PID/TID mappings
// PID: Node IP -> numeric ID
// TID: clusterID (componentType:componentId) -> numeric ID per node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the PID and TID here are not getting from Ray Event, but created here by counter (0, 1, 2, ...)? Is it what Ray Dashboard do or is it just an alternative here as we currently cannot get PID and TID from Ray exported event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PID and TID here are created by counters (0, 1, 2, …), not read from the Ray event payload. This is the same approach Ray Dashboard uses.
Ray’s task profile events do not include pid or tid at all; they only have node_ip_address, component_type, and component_id (WorkerID). The Dashboard’s timeline is built by chrome_tracing_dump() in ray/_private/profiling.py.

The PID logic matches Ray’s, but the TID logic slightly differs: Ray uses globally unique TIDs, while our implementation with Go assigns TIDs per node. Do we need to switch to a globally unique TID to align with Ray?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's better for us to totally follow how Ray Dashboard API did if possible. While this is minor, I think it's fine to do in a follow-up PR.

You can create a refactor issue to deal with this and following comments:

}
}

func extractActorIDFromTaskID(taskIDHex string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide where you get the rules used in this function (link to the Ray code) for future reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i will add a comment above this function.
we can trace back to src/ray/design_docs/id_specification.md which shows TaskID = 8B unique + 16B ActorID; ActorID = 12B unique + 4B JobID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 7249ace


ws.Route(ws.GET("/v0/tasks/timeline").To(s.getTasksTimeline).Filter(s.CookieHandle).
Doc("get tasks timeline").
Param(ws.QueryParameter("job_id", "filter by job_id")).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add download here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for pointing this out, i will add it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 9f8d4a6

return
}

resp.Header().Set("Content-Type", "application/json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use .Produces(restful.MIME_JSON) in the router level instead to follow how other endpoints did?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks for the suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 9f8d4a6

@Future-Outlier
Copy link
Member

Hi, @AndySung320
do you mind help me merge master?

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
…olution

Signed-off-by: AndySung320 <andysung0320@gmail.com>
…econd

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
@AndySung320 AndySung320 force-pushed the feature/history-sever-task-timeline branch from 9f8d4a6 to ce1a652 Compare February 5, 2026 06:05
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. both dead and live cluster work in my env
  2. can you add tests for query parameters
    1. download
    2. job_id
  3. For all variables that include DTO in their names, I’d like to replace them.
    Could you help open an issue and assign it to yourself?
    You can work on it as a separate PR.
  4. trust @machichima and @fscnick 's detailed review
  5. I checked ray's source code, the endpoint's response is the same as ray's dashboard, so high level LGTM

Copy link
Collaborator

@machichima machichima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you

Just a comment for follow-up issue: #4437 (comment)

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

// This shouldn't happen if first pass worked correctly,
// but skip to avoid null TID
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent TID pointer pattern compared to nearby code

Low Severity

The code at line 1152 takes the address of tid directly (tidPtr = &tid) without first copying to a local variable. However, the metadata event generation code just 40 lines above (lines 1113-1118) uses an explicit copy pattern (tidVal := tid) before taking the address. This inconsistency within the same function is a code smell. While Go's escape analysis makes both patterns work correctly, the explicit copy pattern is more defensive and consistent with the existing code style.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 811d9b6

// Skip if clusterID is empty
if clusterID == ":" {
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty ComponentID check is unreachable due to prior filter

Medium Severity

The check clusterID == ":" is unreachable and provides no protection. Since componentType is filtered to only "worker" or "driver" at lines 1074-1076, clusterID (built as ComponentType + ":" + ComponentID) will always be "worker:..." or "driver:...", never just ":". If ComponentID is empty, tasks would incorrectly share the same TID in timeline visualization (e.g., all workers with empty ComponentID on a node would appear as one thread).

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 811d9b6

}
if t.JobID == "" {
t.JobID = profileData.JobID
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing AttemptNumber initialization in profile event handler

Medium Severity

The TASK_PROFILE_EVENT handler sets TaskID and JobID on the task if empty, but does not set AttemptNumber. While profileData.AttemptNumber is used for storage indexing via CreateOrMergeAttempt, the Task.AttemptNumber field remains at its default value (0). The timeline endpoint later outputs task.AttemptNumber, resulting in incorrect attempt_number values for tasks that only have profile events without a corresponding definition event.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 811d9b6

…er handling

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

t.TaskID = profileData.TaskID
}
if t.JobID == "" {
t.JobID = profileData.JobID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobID format inconsistency breaks timeline job filtering

High Severity

The TASK_PROFILE_EVENT handler stores profileData.JobID directly without converting it from base64 to hex format. Other event handlers (JOB_DEFINITION_EVENT at line 579, DRIVER_JOB_LIFECYCLE_EVENT at line 639) convert JobID using utils.ConvertBase64ToHex. This means tasks created from TASK_PROFILE_EVENT have base64-encoded JobIDs while the /api/jobs endpoint returns hex-encoded JobIDs. When users filter timeline by job_id from /api/jobs, the string comparison in GetTasksByJobID fails because the formats don't match. The e2e test works around this by manually converting hex to base64, but API consumers would hit this inconsistency.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ID encoding issue will need more discussion,
so we may want to keep the current behavior for now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 701b0c6

… to hex

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] support endpoint /api/v0/tasks/timeline

4 participants