-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix task event loss during shutdown #60247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[core] Fix task event loss during shutdown #60247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix task event loss during shutdown by ensuring that buffered events are flushed before the io_service is stopped. The changes in TaskEventBuffer correctly implement a synchronous wait for gRPC calls to complete. However, the same fix for RayEventRecorder is incomplete, as StopExportingEvents remains asynchronous, which could still lead to event loss. I've also found an issue in one of the new tests. My review includes suggestions to address these points.
0ffd54a to
438e741
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
9cc8050 to
b6b0dc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
396248b to
7bd049c
Compare
c05dd88 to
a8ae6c3
Compare
During shutdown, TaskEventBufferImpl::Stop() and RayEventRecorder were losing buffered events because the io_service was stopped immediately after calling async gRPC flush methods. This fix: - Adds a synchronous flush with configurable timeout in TaskEventBuffer::Stop() - Adds StopExportingEvents() method to RayEventRecorder for graceful shutdown - Calls StopExportingEvents() from GcsServer::Stop() before stopping io_service - Adds new config option task_events_shutdown_flush_timeout_ms (default 5s) Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: ruo <[email protected]>
Changes: - Wait for in-flight gRPC to complete before final flush - Perform final flush to send all buffered events - Wait for flush gRPC to complete before shutdown - Add payload checks to avoid sending empty gRPC requests - Use lambda helper for cleaner wait logic - Fix race condition: signal under mutex to avoid lost wakeup - Fix overlapping gRPC: skip export if gRPC already in progress - Fix re-enable window: use stopping_ flag instead of re-enabling to prevent new events during shutdown flush This ensures no events are lost during shutdown while respecting timeouts to avoid hanging indefinitely. Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: ruo <[email protected]>
a8ae6c3 to
6d38b2d
Compare
Summary
Fixes #60218
During shutdown,
TaskEventBufferImpl::Stop()andRayEventRecorderwere losing buffered events because the io_service was stopped immediately after calling async gRPC flush methods, without waiting for the gRPC calls to complete.This PR:
TaskEventBuffer::Stop()- waits up to 5 seconds (configurable viatask_events_shutdown_flush_timeout_ms) for in-flight gRPC calls to completeStopExportingEvents()method toRayEventRecorderfor graceful shutdownStopExportingEvents()fromGcsServer::Stop()before stopping io_servicetask_events_shutdown_flush_timeout_ms(default 5000ms)Test plan
TestStopFlushesEventsforTaskEventBufferthat verifies events are flushed duringStop()TestStopFlushesEventsforRayEventRecorderthat verifies events are exported duringStopExportingEvents()🤖 Generated with Claude Code