Skip to content

[Feat] [history server] Add node endpoint#4436

Open
JiangJiaWei1103 wants to merge 48 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-node-endpoint
Open

[Feat] [history server] Add node endpoint#4436
JiangJiaWei1103 wants to merge 48 commits intoray-project:masterfrom
JiangJiaWei1103:epic-4374/add-node-endpoint

Conversation

@JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Jan 24, 2026

Why are these changes needed?

For full context, please refer to this design_doc.

This PR adds support for the /nodes and /nodes/{node_id} endpoints to the history server.

With these endpoints, the history server can reconstruct node state for terminated (dead) clusters, enabling post-mortem analysis through historical event replay. This includes node summaries and snapshots of node-level resources at different time points.

Change Summary

Screenshot 2026-01-26 at 11 41 27 AM

At a high level, this PR introduces changes across two main layers:

History Server Layer

  • Implement the core logic for the /nodes?view=summary and /nodes/{node_id} endpoints
    • Retrieve the target node map from the event server layer
    • Construct and format node summaries and node-level resource snapshots for API responses

Event Server Layer

  • Define node-related data structures and public interfaces
  • Introduce a reusable state transition interface for merging node state transitions
  • Process NODE_DEFINITION_EVENT and NODE_LIFECYCLE_EVENT
  • Build and maintain a ClusterNodeMap, and expose a GetNodeMap helper for consumption by the history server layer

Test Result

Screenshot 2026-01-27 at 11 58 40 AM

Related issue number

Closes #4376.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
@@ -0,0 +1,52 @@
package eventserver
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file defines the interface and processing helpers (e.g., MergeStateTransitions) that can be reused across different events' state transitions.

Comment on lines 20 to 31
// injectCollectorRayClusterID injects the ray-cluster-id argument into all collector containers.
func injectCollectorRayClusterID(containers []corev1.Container, rayClusterID string) {
for i := range containers {
if containers[i].Name == "collector" {
containers[i].Command = append(
containers[i].Command,
fmt.Sprintf("--ray-cluster-id=%s", rayClusterID),
)
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is proposed by @machichima in this PR.

body, err := io.ReadAll(resp.Body)
gg.Expect(err).NotTo(HaveOccurred())
gg.Expect(resp.StatusCode).To(Equal(200),
gg.Expect(resp.StatusCode).To(Equal(http.StatusOK),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use HTTP status code to preserve semantics without hardcoding the code number.

// isLive indicates whether the response is from a live cluster or a dead cluster:
// - isLive: true for a live cluster (current snapshot)
// - isLive: false for a dead cluster (historical replay)
func verifyNodesRespSchema(test Test, g *WithT, nodesResp map[string]any, isLive bool) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that different endpoints will have exclusive schema verification, we should extract them to separate files to avoid messing up the main test file.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Comment on lines 246 to 255
func ConvertBase64ToHex(input string) (string, error) {
bytes, err := base64.StdEncoding.DecodeString(input)
if err != nil {
return "", err
}

hexStr := hex.EncodeToString(bytes)

return hexStr, nil
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion is implemented by @chiayi in this PR.

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review January 26, 2026 03:51
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
JiangJiaWei1103 and others added 5 commits January 30, 2026 09:26
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier and others added 3 commits January 30, 2026 23:15
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Future-Outlier and others added 2 commits January 30, 2026 23:43
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] Node Endpoint

3 participants