Skip to content

[History Server Tests] end-to-end Tests for Dead Cluster Endpoints#4470

Open
sb-hakunamatata wants to merge 5 commits intoray-project:masterfrom
sb-hakunamatata:fix-history-server-dead-url-tests-4378
Open

[History Server Tests] end-to-end Tests for Dead Cluster Endpoints#4470
sb-hakunamatata wants to merge 5 commits intoray-project:masterfrom
sb-hakunamatata:fix-history-server-dead-url-tests-4378

Conversation

@sb-hakunamatata
Copy link

@sb-hakunamatata sb-hakunamatata commented Jan 31, 2026

Why are these changes needed?

This PR fixes RayJob submission failures in History Server E2E tests and adds coverage for History Server behavior when the Ray cluster is deleted (dead cluster scenario).

What changed

  • Added a new dead-cluster E2E test suite under tests/e2e to verify History Server endpoints return HTTP 200 and valid JSON after the live Ray cluster is deleted:

    • /api/v0/tasks
    • /api/v0/tasks?filter_keys=job_id...
    • /api/v0/tasks/summarize
    • /logical/actors
    • /logical/actors/{actor_id}
    • /nodes?view=summary
  • Updated the test helper ApplyRayJobAndWaitForCompletion to assign a unique RayJob name on every submission by appending a UUID.

Why

  • The same RayJob manifest is submitted multiple times within a single E2E test.

  • Reusing the same RayJob name caused submission failures due to Kubernetes resource name collisions.

  • Making the RayJob name unique ensures:

    • RayJob submissions succeed reliably
    • Tests remain deterministic and isolated
    • No interference between multiple job submissions in the same namespace

Testing

  • Ran History Server E2E tests locally.
  • Verified RayJob submission succeeds when the same manifest is applied multiple times.
  • Confirmed all History Server endpoints return 200 OK and non-empty JSON responses after the cluster is deleted.
  • Soft Validation for the data returned by those endpoints. (currently none of the endpoint is returning exact same json as of the live server, hence those data validation assertions are set to false)
  • Test Result persent at https://pub.microbin.eu/p/eel-swan-gecko

Rollback plan

  • Revert this PR to restore the previous RayJob naming behavior and remove the new E2E coverage.

Related issue number

Fixes #4378

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@sb-hakunamatata
Copy link
Author

@Future-Outlier please review.

@Future-Outlier Future-Outlier self-assigned this Feb 2, 2026
Copilot AI mentioned this pull request Feb 5, 2026
4 tasks
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the PR.

  1. can you do something like this PR? #4461
  2. don't add a new file, instead, add endpoints to historyserver/test/e2e/historyserver_test.go
  3. the goal of your PR is to add task endpoint, so plz don't include actor
  4. please change the title.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][history server] E2E test for dead cluster task endpoint

2 participants