Skip to content
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
db17d4f
feat: add ChromeTraceEvent type for timeline API
AndySung320 Jan 25, 2026
3ce0d4d
feat: implement GetTasksTimeline for /api/v0/tasks/timeline endpoint
AndySung320 Jan 25, 2026
c12d717
feat: add /api/v0/tasks/timeline endpoint
AndySung320 Jan 25, 2026
06df4cc
remove logging...
AndySung320 Jan 25, 2026
16c372d
add e2e test for endpoint api/v0/tasks/timeline
AndySung320 Jan 26, 2026
12f65ef
fix GetTasksTimeline, correct accidental deletion during conflict res…
AndySung320 Jan 27, 2026
270f836
test: clarify schema assertions for trace events
AndySung320 Jan 27, 2026
164bff1
refactor: replace anonymous struct with named type
AndySung320 Jan 27, 2026
4a7e8dd
fix: preserve ProfileData when handling TASK_DEFINITION_EVENT
AndySung320 Jan 27, 2026
95d041c
docs: add comments for TASK_PROFILE_EVENT DTO
AndySung320 Jan 27, 2026
b6d574f
add comment, correct timestamp conversion from microseconds to millis…
AndySung320 Jan 28, 2026
48d6dda
fix timestamp conversion
AndySung320 Jan 28, 2026
654fb87
fix: avoid null tid in trace events when workerID is empty
AndySung320 Jan 28, 2026
6b78b07
docs: add comments to ChromeTraceEvent struct
AndySung320 Jan 30, 2026
caf17c1
fix logic to match Ray Dashboard implementation
AndySung320 Jan 30, 2026
910ea55
add download parameter
AndySung320 Jan 30, 2026
fbcee06
set Content-Type for JSON response
AndySung320 Jan 30, 2026
13fb028
fix minor
AndySung320 Jan 30, 2026
f0381bf
preserve FuncOrClassName and Name when handling TASK_DEFINITION_EVENT
AndySung320 Jan 30, 2026
0d6e6c5
update getTasksTimeline to use clusterSessionKey
AndySung320 Feb 4, 2026
8e26017
minor fix of extractActorIDFromTaskID
AndySung320 Feb 4, 2026
b6aee6a
Fix inconsistent filtering of tasks without profile events
AndySung320 Feb 4, 2026
fe2674d
add comment for extractActorIDFromTaskID
AndySung320 Feb 5, 2026
ce1a652
Declare JSON response type at router level
AndySung320 Feb 5, 2026
5441ec8
e2e: add timeline endpoint tests for job_id filtering and download mode
AndySung320 Feb 6, 2026
cc5eb86
modify test comment
AndySung320 Feb 6, 2026
811d9b6
fix task attempt number, unreachable ComponentID check, and TID point…
AndySung320 Feb 6, 2026
701b0c6
fix id conversion (task_id, job_id, worker_id, driver_id) from base64…
AndySung320 Feb 6, 2026
ba135b0
remove hexToBase64 fuction
AndySung320 Feb 6, 2026
7610784
better handle base64 to hex to align job endpoint's logic
Future-Outlier Feb 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
366 changes: 365 additions & 1 deletion historyserver/pkg/eventserver/eventserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ import (
"io"
"regexp"
"sort"
"strconv"
"strings"
"sync"
"time"
Expand Down Expand Up @@ -226,12 +227,38 @@ func (h *EventHandler) storeEvent(eventMap map[string]any) error {

taskMap := h.ClusterTaskMap.GetOrCreateTaskMap(currentClusterName)
taskMap.CreateOrMergeAttempt(currTask.TaskID, currTask.AttemptNumber, func(t *types.Task) {
// Merge definition fields (preserve existing Events if any)
// Merge definition fields (preserve existing Events, ProfileData, and identifiers if any)
existingEvents := t.Events
existingProfileData := t.ProfileData
existingNodeID := t.NodeID
existingWorkerID := t.WorkerID
existingFuncOrClassName := t.FuncOrClassName
existingName := t.Name

*t = currTask

// Restore lifecycle-derived fields (from TASK_LIFECYCLE_EVENT)
if len(existingEvents) > 0 {
t.Events = existingEvents
t.State = existingEvents[len(existingEvents)-1].State
if existingNodeID != "" {
t.NodeID = existingNodeID
}
if existingWorkerID != "" {
t.WorkerID = existingWorkerID
}
}

// Restore profile-derived fields (from TASK_PROFILE_EVENT)
// All three come from the same event, so check together
if existingProfileData != nil {
t.ProfileData = existingProfileData
if existingFuncOrClassName != "" {
t.FuncOrClassName = existingFuncOrClassName
}
if existingName != "" {
t.Name = existingName
}
}
Comment on lines 266 to 274
Copy link
Collaborator

@fscnick fscnick Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does these lines mean ? It seems to get existingProfileData from t.ProfileData and set it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines preserve ProfileData, NodeID, and WorkerID that may have been set by earlier events (TASK_PROFILE_EVENT or TASK_LIFECYCLE_EVENT) before the TASK_DEFINITION_EVENT arrives. Since events are processed concurrently and order is not guaranteed, we save these values before overwriting with *t = currTask (line 230), then restore them to prevent data loss.
This follows the same pattern as ACTOR_DEFINITION_EVENT handling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TASK_DEFINITION_EVENT overwrites FuncOrClassName set by profile event

Medium Severity

When TASK_PROFILE_EVENT arrives before TASK_DEFINITION_EVENT, the FuncOrClassName and Name fields extracted from the profile event's extraData are lost. The TASK_DEFINITION_EVENT handler preserves Events, ProfileData, NodeID, and WorkerID before doing *t = currTask, but doesn't preserve FuncOrClassName or Name. Since profile events contain more detailed naming information (e.g., task::Counter.increment format for actor methods), this causes the timeline to show less informative or empty function names when events arrive out of order.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in ddbc3d0

})

Expand Down Expand Up @@ -694,6 +721,105 @@ func (h *EventHandler) storeEvent(eventMap map[string]any) error {
j.EndTime = lastStateTransition.Timestamp
}
})
case types.TASK_PROFILE_EVENT:
taskProfileEvent, ok := eventMap["taskProfileEvents"]
if !ok {
return fmt.Errorf("event does not have 'taskProfileEvents'")
}
jsonBytes, err := json.Marshal(taskProfileEvent)
if err != nil {
return err
}

var profileData types.TaskProfileEventDTO
if err := json.Unmarshal(jsonBytes, &profileData); err != nil {
logrus.Errorf("Failed to unmarshal TASK_PROFILE_EVENT: %v", err)
return err
}

if profileData.TaskID == "" || len(profileData.ProfileEvents.Events) == 0 {
logrus.Debugf("TASK_PROFILE_EVENT has no taskId or events, skipping")
return nil
}

// Convert events to ProfileEventRaw format
var rawEvents = make([]types.ProfileEventRaw, 0, len(profileData.ProfileEvents.Events))
for _, e := range profileData.ProfileEvents.Events {
startTime, err := strconv.ParseInt(e.StartTime, 10, 64)
if err != nil {
logrus.Warnf("Failed to parse StartTime '%s': %v", e.StartTime, err)
continue
}
endTime, err := strconv.ParseInt(e.EndTime, 10, 64)
if err != nil {
logrus.Warnf("Failed to parse EndTime '%s': %v", e.EndTime, err)
continue
}

rawEvents = append(rawEvents, types.ProfileEventRaw{
EventName: e.EventName,
StartTime: startTime,
EndTime: endTime,
ExtraData: e.ExtraData,
})
}

taskMap := h.ClusterTaskMap.GetOrCreateTaskMap(currentClusterName)
taskMap.CreateOrMergeAttempt(profileData.TaskID, profileData.AttemptNumber, func(t *types.Task) {
// Ensure core identifiers are set
if t.TaskID == "" {
t.TaskID = profileData.TaskID
}
if t.JobID == "" {
t.JobID = profileData.JobID
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobID format inconsistency breaks timeline job filtering

High Severity

The TASK_PROFILE_EVENT handler stores profileData.JobID directly without converting it from base64 to hex format. Other event handlers (JOB_DEFINITION_EVENT at line 579, DRIVER_JOB_LIFECYCLE_EVENT at line 639) convert JobID using utils.ConvertBase64ToHex. This means tasks created from TASK_PROFILE_EVENT have base64-encoded JobIDs while the /api/jobs endpoint returns hex-encoded JobIDs. When users filter timeline by job_id from /api/jobs, the string comparison in GetTasksByJobID fails because the formats don't match. The e2e test works around this by manually converting hex to base64, but API consumers would hit this inconsistency.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ID encoding issue will need more discussion,
so we may want to keep the current behavior for now ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 701b0c6

}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing AttemptNumber initialization in profile event handler

Medium Severity

The TASK_PROFILE_EVENT handler sets TaskID and JobID on the task if empty, but does not set AttemptNumber. While profileData.AttemptNumber is used for storage indexing via CreateOrMergeAttempt, the Task.AttemptNumber field remains at its default value (0). The timeline endpoint later outputs task.AttemptNumber, resulting in incorrect attempt_number values for tasks that only have profile events without a corresponding definition event.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 811d9b6

// Set AttemptNumber to match the attempt we're merging into
t.AttemptNumber = profileData.AttemptNumber

// Initialize ProfileData if not exists
if t.ProfileData == nil {
t.ProfileData = &types.ProfileData{
ComponentID: profileData.ProfileEvents.ComponentID,
ComponentType: profileData.ProfileEvents.ComponentType,
NodeIPAddress: profileData.ProfileEvents.NodeIPAddress,
}
}

// Merge events with deduplication based on (eventName, startTime, endTime)
type eventKey struct {
EventName string
StartTime int64
EndTime int64
}
Comment on lines +813 to +817
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, only using EventName as key is not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this key is sufficient because:

  1. Deduplication is scoped to a single task attempt (TaskID + AttemptNumber already isolate different tasks)
  2. Ray's profiling semantics: a task cannot have two events with the same name at the same time
  3. Consistent with TASK_LIFECYCLE_EVENT deduplication pattern (State + Timestamp)

existingKeys := make(map[eventKey]struct{}, len(t.ProfileData.Events)+len(rawEvents))
for _, e := range t.ProfileData.Events {
existingKeys[eventKey{e.EventName, e.StartTime, e.EndTime}] = struct{}{}
}
for _, e := range rawEvents {
key := eventKey{e.EventName, e.StartTime, e.EndTime}
if _, ok := existingKeys[key]; !ok {
t.ProfileData.Events = append(t.ProfileData.Events, e)
existingKeys[key] = struct{}{}
}
}

// Extract func_or_class_name from extraData if available
for _, e := range rawEvents {
if strings.HasPrefix(e.EventName, "task::") && e.ExtraData != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we may extract task:: into constant like TaskPrefix or something else, as I saw this in multiple places

Just minor, no need to be in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we can do it in other PR .

var extra map[string]interface{}
if err := json.Unmarshal([]byte(e.ExtraData), &extra); err == nil {
if name, ok := extra["name"].(string); ok && name != "" {
// For actor methods, name might be just "increment" or "get_count"
// But eventName has the full form like "task::Counter.increment"
// Use eventName to get the full func_or_class_name
t.FuncOrClassName = strings.TrimPrefix(e.EventName, "task::")
t.Name = name
}
}
}
}
})

default:
logrus.Infof("Event not supported, skipping: %v", eventMap)
}
Expand Down Expand Up @@ -915,3 +1041,241 @@ func (h *EventHandler) GetJobByJobID(clusterName, jobID string) (types.Job, bool
}
return job.DeepCopy(), true
}

// GetTasksTimeline returns timeline data in Chrome Tracing Format
// Output format matches Ray Dashboard's /api/v0/tasks/timeline endpoint
func (h *EventHandler) GetTasksTimeline(clusterName string, jobID string) []types.ChromeTraceEvent {
var tasks []types.Task
if jobID != "" {
tasks = h.GetTasksByJobID(clusterName, jobID)
} else {
tasks = h.GetTasks(clusterName)
}

if len(tasks) == 0 {
return []types.ChromeTraceEvent{}
}

events := []types.ChromeTraceEvent{}

// Build PID/TID mappings
// PID: Node IP -> numeric ID
// TID: clusterID (componentType:componentId) -> numeric ID per node
Comment on lines +1085 to +1087
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the PID and TID here are not getting from Ray Event, but created here by counter (0, 1, 2, ...)? Is it what Ray Dashboard do or is it just an alternative here as we currently cannot get PID and TID from Ray exported event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PID and TID here are created by counters (0, 1, 2, …), not read from the Ray event payload. This is the same approach Ray Dashboard uses.
Ray’s task profile events do not include pid or tid at all; they only have node_ip_address, component_type, and component_id (WorkerID). The Dashboard’s timeline is built by chrome_tracing_dump() in ray/_private/profiling.py.

The PID logic matches Ray’s, but the TID logic slightly differs: Ray uses globally unique TIDs, while our implementation with Go assigns TIDs per node. Do we need to switch to a globally unique TID to align with Ray?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's better for us to totally follow how Ray Dashboard API did if possible. While this is minor, I think it's fine to do in a follow-up PR.

You can create a refactor issue to deal with this and following comments:

nodeIPToPID := make(map[string]int)
nodeIPToClusterIDToTID := make(map[string]map[string]int) // nodeIP -> clusterID (componentType:componentId) -> tid
pidCounter := 0
tidCounters := make(map[string]int) // per-node tid counter

// First pass: collect all unique nodes and workers
for _, task := range tasks {
if task.ProfileData == nil || len(task.ProfileData.Events) == 0 {
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent filtering causes orphan metadata events

Low Severity

The first pass that collects node/worker mappings only checks task.ProfileData == nil, while the second pass that generates trace events additionally checks len(task.ProfileData.Events) == 0. This discrepancy causes process_name and thread_name metadata events to be emitted for nodes/workers that have no actual trace events, producing orphaned metadata entries in the Chrome Tracing output.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 0439a98

// Only include worker and driver components (consistent with Ray's profiling implementation in profiling.py)
componentType := task.ProfileData.ComponentType
if componentType != "worker" && componentType != "driver" {
continue
}

nodeIP := task.ProfileData.NodeIPAddress
clusterID := task.ProfileData.ComponentType + ":" + task.ProfileData.ComponentID

if nodeIP == "" {
continue
}
if _, exists := nodeIPToPID[nodeIP]; !exists {
nodeIPToPID[nodeIP] = pidCounter
pidCounter++
nodeIPToClusterIDToTID[nodeIP] = make(map[string]int)
tidCounters[nodeIP] = 0
}

if _, exists := nodeIPToClusterIDToTID[nodeIP][clusterID]; !exists {
nodeIPToClusterIDToTID[nodeIP][clusterID] = tidCounters[nodeIP]
tidCounters[nodeIP]++
}
}

// Generate process_name and thread_name metadata events
for nodeIP, pid := range nodeIPToPID {
events = append(events, types.ChromeTraceEvent{
Name: "process_name",
PID: pid,
TID: nil,
Phase: "M",
Args: map[string]interface{}{
"name": "Node " + nodeIP,
},
})

for clusterID, tid := range nodeIPToClusterIDToTID[nodeIP] {
tidVal := tid
events = append(events, types.ChromeTraceEvent{
Name: "thread_name",
PID: pid,
TID: &tidVal,
Phase: "M",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does phase "M" and "X" mean? Could we document it in ChromeTraceEvent struct? As it's not obvious without a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"M" means Metadata event (e.g., process_name, thread_name)
"X" means Complete event (duration events with start and end times)
I will add the comment in ChromeTraceEvent struct.

Args: map[string]interface{}{
"name": clusterID,
},
})
}
}

// Generate trace events from ProfileData
for _, task := range tasks {
if task.ProfileData == nil || len(task.ProfileData.Events) == 0 {
continue
}
// Only include worker and driver components (consistent with Ray's profiling implementation in profiling.py)
componentType := task.ProfileData.ComponentType
if componentType != "worker" && componentType != "driver" {
continue
}

nodeIP := task.ProfileData.NodeIPAddress
clusterID := task.ProfileData.ComponentType + ":" + task.ProfileData.ComponentID

pid, ok := nodeIPToPID[nodeIP]
if !ok {
continue
}
var tidPtr *int
if tid, ok := nodeIPToClusterIDToTID[nodeIP][clusterID]; ok {
tidVal := tid
tidPtr = &tidVal
} else {
// This shouldn't happen if first pass worked correctly,
// but skip to avoid null TID
continue
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent TID pointer pattern compared to nearby code

Low Severity

The code at line 1152 takes the address of tid directly (tidPtr = &tid) without first copying to a local variable. However, the metadata event generation code just 40 lines above (lines 1113-1118) uses an explicit copy pattern (tidVal := tid) before taking the address. This inconsistency within the same function is a code smell. While Go's escape analysis makes both patterns work correctly, the explicit copy pattern is more defensive and consistent with the existing code style.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 811d9b6


for _, profEvent := range task.ProfileData.Events {
// Convert nanoseconds to microseconds
startTimeUs := float64(profEvent.StartTime) / 1000.0
durationUs := float64(profEvent.EndTime-profEvent.StartTime) / 1000.0

// Parse extraData for additional fields
var extraData map[string]interface{}
if profEvent.ExtraData != "" {
json.Unmarshal([]byte(profEvent.ExtraData), &extraData)
}

// Determine task_id and func_or_class_name
taskIDForArgs := task.TaskID
funcOrClassName := task.FuncOrClassName

// Try to get from extraData if available (for hex format task_id)
if extraData != nil {
if tid, ok := extraData["task_id"].(string); ok && tid != "" {
taskIDForArgs = tid
}
}

// Build args
actorID := extractActorIDFromTaskID(taskIDForArgs)
args := map[string]interface{}{
"task_id": taskIDForArgs,
"job_id": task.JobID,
"attempt_number": task.AttemptNumber,
"func_or_class_name": funcOrClassName,
"actor_id": nil,
}

if actorID != "" {
args["actor_id"] = actorID
}

// Determine event name for display
eventName := profEvent.EventName
displayName := profEvent.EventName

// For overall task events like "task::slow_task", use the full name from extraData
if strings.HasPrefix(profEvent.EventName, "task::") && extraData != nil {
if name, ok := extraData["name"].(string); ok && name != "" {
displayName = name
args["name"] = name
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing empty string check for displayName assignment

Low Severity

The condition at line 979 checks if name, ok := extraData["name"].(string); ok without verifying that name is non-empty. This is inconsistent with line 626 which correctly checks ok && name != "". If extraData["name"] is an empty string, displayName will become empty instead of keeping the original profEvent.EventName as a fallback. This would result in trace events with empty names in the Chrome Tracing visualization.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 83ba2dc

}

traceEvent := types.ChromeTraceEvent{
Category: eventName,
Name: displayName,
PID: pid,
TID: tidPtr,
Timestamp: &startTimeUs,
Duration: &durationUs,
Color: getChromeTraceColor(eventName),
Args: args,
Phase: "X",
}

events = append(events, traceEvent)
}
}

return events
}

// getChromeTraceColor maps event names to Chrome trace colors
// Based on Ray's _default_color_mapping in profiling.py
func getChromeTraceColor(eventName string) string {
// Handle task::xxx pattern (overall task event)
if strings.HasPrefix(eventName, "task::") {
return "generic_work"
}

// Direct mapping for known event names
// This logic follows Ray's profiling implementation:
// https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L25
switch eventName {
Copy link
Collaborator

@machichima machichima Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is referenced from: https://github.com/ray-project/ray/blob/68d01c4c48a59c7768ec9c2359a1859966c446b6/python/ray/_private/profiling.py#L25-L25?
Could we add a comment here to show that this part is following the above link in ray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. i will add it

case "task:deserialize_arguments":
return "rail_load"
case "task:execute":
return "rail_animation"
case "task:store_outputs":
return "rail_idle"
case "task:submit_task", "task":
return "rail_response"
case "worker_idle":
return "cq_build_abandoned"
case "ray.get":
return "good"
case "ray.put":
return "terrible"
case "ray.wait":
return "vsync_highlight_color"
case "submit_task":
return "background_memory_dump"
case "wait_for_function", "fetch_and_run_function", "register_remote_function":
return "detailed_memory_dump"
default:
return "generic_work"
}
}

// extractActorIDFromTaskID extracts the ActorID from a TaskID following Ray's ID specification.
//
// Design doc: src/ray/design_docs/id_specification.md
// - TaskID: 8B unique + 16B ActorID (total 24 bytes = 48 hex chars)
// - ActorID: 12B unique + 4B JobID (total 16 bytes = 32 hex chars)
//
// For a 48-character hex TaskID, the last 32 hex characters (bytes 16–48)
// correspond to the ActorID. This function further checks the "unique" portion
// of the ActorID (first 24 hex chars) and returns an empty string if it is all Fs,
// which indicates normal/driver tasks with no associated actor.
func extractActorIDFromTaskID(taskIDHex string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide where you get the rules used in this function (link to the Ray code) for future reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i will add a comment above this function.
we can trace back to src/ray/design_docs/id_specification.md which shows TaskID = 8B unique + 16B ActorID; ActorID = 12B unique + 4B JobID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 7249ace

if len(taskIDHex) != 48 {
return "" // can't process if encoded in base64
}

actorPortion := taskIDHex[16:40] // 24 chars for actor id (12 bytes)
jobPortion := taskIDHex[40:48] // 8 chars for job id (4 bytes)

// Check if all Fs (no actor)
if strings.ToLower(actorPortion) == "ffffffffffffffffffffffff" {
return ""
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-sensitive hex comparison may fail for uppercase IDs

Low Severity

The extractActorIDFromTaskID function checks if a task has no associated actor by comparing actorPortion == "ffffffffffffffffffffffff" (all lowercase). This comparison is case-sensitive, so if Ray ever outputs task IDs with uppercase hex characters (e.g., "FFFFFFFFFFFFFFFFFFFFFFFF"), the check would fail. This would cause tasks without actors to incorrectly have an actor_id value set in the timeline output instead of nil.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 6e3e214


return actorPortion + jobPortion
}
Loading
Loading