[history server] Improve api/v0/logs/file#4456
[history server] Improve api/v0/logs/file#4456machichima wants to merge 38 commits intoray-project:masterfrom
api/v0/logs/file#4456Conversation
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
timeout only test validation, not testing the behavior Signed-off-by: machichima <nary12321@gmail.com>
api/v0/logs/file testapi/v0/logs/file
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
18fc3cb to
e01c113
Compare
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
|
Test parameters manually (without 0. List all available log files
NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=${NODE_ID}"1. Use filename (specify the file directly)NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=${NODE_ID}&filename=raylet.out&lines=100"2. Use node_ip (alternative to node_id)
NODE_IP=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].ip')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_ip=${NODE_IP}&filename=raylet.out"3. Use pid
NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
PID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=${NODE_ID}" | jq -r '.data.result.core_worker[0]' | sed -E 's/.*_([0-9]+)\.log/\1/')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=${NODE_ID}&pid=${PID}&suffix=out"4. Use actor_id (stdout)ACTOR_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/logical/actors" | jq -r '.data.actors | keys[0]')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?actor_id=${ACTOR_ID}&suffix=out"5. Use actor_id (stderr)ACTOR_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/logical/actors" | jq -r '.data.actors | keys[0]')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?actor_id=${ACTOR_ID}&suffix=err"6. Use task_id
TASK_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/tasks" | jq -r '.data.result.result[] | select(.type == "NORMAL_TASK") | .task_id' | head -1)
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?task_id=${TASK_ID}&suffix=out"7. Use task_id and specify attempt_number
TASK_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/tasks" | jq -r '.data.result.result[] | select(.type == "NORMAL_TASK") | .task_id' | head -1)
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?task_id=${TASK_ID}&suffix=out&attempt_number=0"8. Use download_file parameter to download the file
NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=${NODE_ID}&filename=raylet.out&download_filename=newfile.txt" -J -O9. Use filter_ansi_code parameter to filter ANSI codesNODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_id=${NODE_ID}&filename=raylet.out&filter_ansi_code=true"10. Combination test (actor_id + attempt_number + all params)ACTOR_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/logical/actors" | jq -r '.data.actors | keys[0]')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?actor_id=${ACTOR_ID}&attempt_number=0&suffix=out&lines=1000&filter_ansi_code=true"11. Combination test (task_id + attempt_number + all params)
TASK_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/tasks" | jq -r '.data.result.result[] | select(.type == "NORMAL_TASK") | .task_id' | head -1)
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?task_id=${TASK_ID}&attempt_number=0&suffix=out&lines=1000&filter_ansi_code=true"13. Test pid with node_ip
NODE_IP=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].ip')
NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
PID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/logs?node_id=${NODE_ID}" | jq -r '.data.result.core_worker[0]' | sed -E 's/.*_([0-9]+)\.log/\1/')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?node_ip=${NODE_IP}&pid=${PID}&suffix=out"14. Test actor_id with node_ip (actor automatically resolves node_id)ACTOR_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/logical/actors" | jq -r '.data.actors | keys[0]')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?actor_id=${ACTOR_ID}&suffix=out&lines=200"15. Test task_id with node_ip (task automatically resolves node_id)
TASK_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/api/v0/tasks" | jq -r '.data.result.result[] | select(.type == "NORMAL_TASK") | .task_id' | head -1)
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/file?task_id=${TASK_ID}&suffix=out&lines=300" |
Signed-off-by: machichima <nary12321@gmail.com>
api/v0/logs/fileapi/v0/logs/file
Signed-off-by: machichima <nary12321@gmail.com>
|
Test ❯ NODE_ID=$(curl -s -b ~/cookies.txt "http://localhost:8080/nodes?view=summary" | jq -r '.data.summary[0].raylet.nodeId')
curl -b ~/cookies.txt "http://localhost:8080/api/v0/logs/stream?node_id=${NODE_ID}&filename=raylet.out&filter_ansi_code=true&lines=20"
[state-dump] RaySyncer.BroadcastMessage - 2 total (0 active), Execution time: mean = 0.03ms, total = 0.07ms, Queueing time: mean = 0.00ms, max = 0.00ms, min = 0.00ms, total = 0.00ms
[state-dump] RaySyncerRegister - 2 total (0 active), Execution time: mean = 0.00ms, total = 0.00ms, Queueing time: mean = 0.00ms, max = 0.00ms, min = 0.00ms, total = 0.00ms
[state-dump] ReporterService.grpc_client.HealthCheck - 2 total (0 active), Execution time: mean = 0.51ms, total = 1.02ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] - 2 total (0 active), Execution time: mean = 0.03ms, total = 0.07ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.01ms, total = 0.04ms
[state-dump] NodeManager.GCTaskFailureReason - 2 total (1 active), Execution time: mean = 3.40ms, total = 6.80ms, Queueing time: mean = 0.02ms, max = 0.04ms, min = 0.04ms, total = 0.04ms
[state-dump] ReporterService.grpc_client.HealthCheck.OnReplyReceived - 2 total (0 active), Execution time: mean = 0.06ms, total = 0.11ms, Queueing time: mean = 0.43ms, max = 0.84ms, min = 0.02ms, total = 0.85ms
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), Execution time: mean = 0.60ms, total = 1.20ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] MetricsAgentClient.WaitForServerReadyWithRetry - 1 total (0 active), Execution time: mean = 0.17ms, total = 0.17ms, Queueing time: mean = 1000.67ms, max = 1000.67ms, min = 1000.67ms, total = 1000.67ms
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.01ms, total = 0.01ms, Queueing time: mean = 0.03ms, max = 0.03ms, min = 0.03ms, total = 0.03ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.11ms, total = 0.11ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.02ms, total = 0.02ms
[state-dump] Subscriber.HandlePublishedMessage_GCS_NODE_ADDRESS_AND_LIVENESS_CHANNEL - 1 total (0 active), Execution time: mean = 0.05ms, total = 0.05ms, Queueing time: mean = 0.44ms, max = 0.44ms, min = 0.44ms, total = 0.44ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeAddressAndLiveness.OnReplyReceived - 1 total (0 active), Execution time: mean = 0.06ms, total = 0.06ms, Queueing time: mean = 0.02ms, max = 0.02ms, min = 0.02ms, total = 0.02ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.12ms, total = 1.12ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.GetAllNodeAddressAndLiveness - 1 total (0 active), Execution time: mean = 1.74ms, total = 1.74ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 8.97ms, total = 8.97ms, Queueing time: mean = 0.01ms, max = 0.01ms, min = 0.01ms, total = 0.01ms
[state-dump] ray::rpc::JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), Execution time: mean = 0.21ms, total = 0.21ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 0.32ms, total = 0.32ms, Queueing time: mean = 0.00ms, max = -0.00ms, min = 9223372036854.78ms, total = 0.00ms
[state-dump] DebugString() time ms: 0
[state-dump]
|
| if resp.StatusCode != tc.expectedStatus { | ||
| LogWithTimestamp(t, "Test case '%s' failed: expected %d, got %d, body: %s", | ||
| tc.name, tc.expectedStatus, resp.StatusCode, string(body)) | ||
| } | ||
|
|
||
| g.Expect(resp.StatusCode).To(Equal(tc.expectedStatus), | ||
| "Test case '%s' failed: expected %d, got %d", tc.name, tc.expectedStatus, resp.StatusCode) |
There was a problem hiding this comment.
| if resp.StatusCode != tc.expectedStatus { | |
| LogWithTimestamp(t, "Test case '%s' failed: expected %d, got %d, body: %s", | |
| tc.name, tc.expectedStatus, resp.StatusCode, string(body)) | |
| } | |
| g.Expect(resp.StatusCode).To(Equal(tc.expectedStatus), | |
| "Test case '%s' failed: expected %d, got %d", tc.name, tc.expectedStatus, resp.StatusCode) | |
| g.Expect(resp.StatusCode).To(Equal(tc.expectedStatus), | |
| "Test case '%s' failed: expected %d, got %d", tc.name, tc.expectedStatus, resp.StatusCode) |
Thanks Nary! This seems redundant.
There was a problem hiding this comment.
Thank you for point this out! I would prefer keeping the
if resp.StatusCode != tc.expectedStatus {
LogWithTimestamp(t, "Test case '%s' failed: expected %d, got %d, body: %s",
tc.name, tc.expectedStatus, resp.StatusCode, string(body))
}and remove the one below. In this case, when the status code mismatch, the body is printed out so that we can know what error we are getting.
Signed-off-by: machichima <nary12321@gmail.com>
| // 1. Already hex format - returns as-is | ||
| // 2. Base64-encoded - decodes to hex | ||
| // It tries RawURLEncoding first (Ray's default), falling back to StdEncoding if that fails. | ||
| func decodeBase64ToHex(id string) (string, error) { |
There was a problem hiding this comment.
I noticed that we currently have multiple places dealing with Base64 → hex ID conversion:
reader.godefinesdecodeBase64ToHex(...)historyserver/pkg/utils/utils.goalready has a similar helperrouter.goalso has a TODO noting that Ray Base Event IDs are Base64-encoded, while other APIs use hex, and that Base64 IDs can break URL routing
Among these, the implementation in reader.go is actually more robust.
Would it make sense to centralize this logic in pkg/utils and reuse it from reader.go and router.go or some other component?
There was a problem hiding this comment.
Make sense! I update the ConvertBase64ToHex in utils.go with more robust logic, and use it in reader.go in: d9b7dba
Signed-off-by: machichima <nary12321@gmail.com>
…logs-file-test Signed-off-by: machichima <nary12321@gmail.com>
put in utils and use in reader Signed-off-by: machichima <nary12321@gmail.com>
bf3ff58 to
d9b7dba
Compare
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
Signed-off-by: machichima <nary12321@gmail.com>
| }) | ||
|
|
||
| if disposition == "" { | ||
| logrus.Errorf("Failed to format Content-Disposition header for filename %q: %v", downloadFilename, err) |
There was a problem hiding this comment.
Error variable is nil when logged
Low Severity
The error log at line 775 references err from the earlier _getNodeLogFile call, but at this point err is guaranteed to be nil (the function returns early if there was an error). The mime.FormatMediaType function doesn't return an error - it returns an empty string on failure. This results in a misleading log message showing <nil> as the error value.
| func ConvertBase64ToHex(id string) (string, error) { | ||
| // Check if already hex (only [0-9a-f]) | ||
| if matched, _ := regexp.MatchString("^[0-9a-f]+$", id); matched { | ||
| return id, nil |
There was a problem hiding this comment.
Hex detection regex ignores uppercase characters
Low Severity
The regex ^[0-9a-f]+$ only matches lowercase hex characters. If an ID arrives as uppercase hex (e.g., "ABC123DEF"), it won't match and the function attempts Base64 decoding. Since uppercase letters are valid Base64 characters, the decode may "succeed" but produce completely wrong byte data. The hex string "ABCD" represents bytes 0xAB 0xCD, but Base64-decoding "ABCD" produces entirely different bytes.
Signed-off-by: machichima <nary12321@gmail.com>
|
I think right now the following two tests are flaky: TestHistoryServer//v0/logs/file_endpoint_(dead_cluster)/task_id_parameter
TestHistoryServer//v0/logs/file_endpoint_(dead_cluster)/actor_id_parameterAs they relies on Should we remove / comment those tests out for now and open follow-up issue to solve this? |
AndySung320
left a comment
There was a problem hiding this comment.
One thing to note: req.QueryParameter("actor_id") returns a URL-decoded value.
If actor_id contains standard Base64 characters such as + (or /, =), + will be decoded into a space when parsing the query string. This makes the original Base64 unrecoverable.
We can already see this in the test logs:
YL1IU+OndDKZT2iTAgAAAA== becomes YL1IU OndDKZT2iTAgAAAA==, which then causes the lookup to fail.
Maybe we should avoid passing standard Base64 IDs directly via query parameters, or ensure they are URL-encoded / normalized (e.g. to hex) earlier.
Signed-off-by: machichima <nary12321@gmail.com>
|
@AndySung320 Nice catch!! Just updated in 9980d22. I run the test 7 times locally and they all passed ❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 113.564s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 1m 55s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 112.837s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 1m 56s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 116.831s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 2m 0s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 115.684s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 1m 57s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 115.424s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 1m 58s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 116.670s
~/workData/open-source/kuberay/historyserver/test/e2e improve-logs-file-test *4 !1 ?4 2m 0s
❯ go test -run TestHistoryServer
PASS
ok github.com/ray-project/kuberay/historyserver/test/e2e 116.547s
|


Why are these changes needed?
This PR implement TODOs mentioned in: #4387 (comment), which contains following:
Add more test cases for
api/v0/logs/fileand collect test cases with for loop to improve the readability.timeout, attempt_numbe, download_file, filter_ansi_cparametersAdd implementation for
attempt_numbe, download_filename, filter_ansi_codein dead clusterStorageReaderimplementations to addcontextsupport, so we are skipping this now and will do in follow-upAdd support for
node_ip, actor_id, task_id, pid, suffixsubmission_idcannot be implemented now asDriverJobDefinitionEventis missingsubmission_idfield, which we cannot getsubmission_id->driver_node_idmappingnode_idornode_ip,pidorfilenameneed to be provided. Otherwise user can just provideactor_idortask_iddirectlyRelated issue number
Related to #4387
Checks