Skip to content

Feat/async job info query use informer#4484

Draft
fscnick wants to merge 6 commits intoray-project:masterfrom
fscnick:feat/async-job-info-query-use-informer
Draft

Feat/async job info query use informer#4484
fscnick wants to merge 6 commits intoray-project:masterfrom
fscnick:feat/async-job-info-query-use-informer

Conversation

@fscnick
Copy link
Collaborator

@fscnick fscnick commented Feb 5, 2026

Why are these changes needed?

According to #4160 (comment) ,this PR uses the informer cache to list RayJobs and query the JobInfo in advance before calling getJobInfo. Additionally, it makes the query interval and etc configurable.

How it works:

  • There a goroutine to list RayJobs from the informer cache and put them into queue.
  • A group of workers retrieve a RayJob in the queue individually and get the JobInfo stored into a cache storage for later use.

Related issue number

Closes #4069

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
Comment on lines +75 to +83
cacheStorage := otter.Must(&otter.Options[string, *JobInfoCache]{
ExpiryCalculator: otter.ExpiryAccessing[string, *JobInfoCache](cacheExpiry), // Reset timer on reads/writes
OnDeletion: func(e otter.DeletionEvent[string, *JobInfoCache]) {
if !e.WasEvicted() {
return
}
logger.WithName("cacheStorage").Info("Evict cache for key.", "key", e.Key, "cause", e.Cause.String())
},
})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the cache without lru in case of there are too many RayJobs in cluster to compete the limit slots in lru cache.

Additionally, this cache provides auto expiry. The cleanup goroutine could be removed.

continue
}

logger.Info("Listing RayJobs from cache", "total", len(rayJobs))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep this line of log? It might be a bit of noisy that it prints every queryInterval but without it the producer goroutine works almost silently.

Comment on lines +61 to +63
if features.Enabled(features.AsyncJobInfoQuery) {
dashboardClientFunc = dashboardclient.GetCachedDashboardClientFunc()
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By leveraging global variable, it could avoid modifying the existed function signature. Or, should we pass GetCachedDashboardClientFunc() as an argument or somewhat from the upper layer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Make RayJob dashboard polling interval configurable

2 participants