Skip to content

Conversation

@ruoliu2
Copy link

@ruoliu2 ruoliu2 commented Jan 16, 2026

Summary

Fixes #60218

During shutdown, TaskEventBufferImpl::Stop() and RayEventRecorder were losing buffered events because the io_service was stopped immediately after calling async gRPC flush methods, without waiting for the gRPC calls to complete.

This PR:

  • Adds a synchronous flush with configurable timeout in TaskEventBuffer::Stop() - waits up to 5 seconds (configurable via task_events_shutdown_flush_timeout_ms) for in-flight gRPC calls to complete
  • Adds StopExportingEvents() method to RayEventRecorder for graceful shutdown
  • Calls StopExportingEvents() from GcsServer::Stop() before stopping io_service
  • Adds new config option task_events_shutdown_flush_timeout_ms (default 5000ms)

Test plan

  • Added unit test TestStopFlushesEvents for TaskEventBuffer that verifies events are flushed during Stop()
  • Added unit test TestStopFlushesEvents for RayEventRecorder that verifies events are exported during StopExportingEvents()
  • Ray CI tests

🤖 Generated with Claude Code

@ruoliu2 ruoliu2 requested a review from a team as a code owner January 16, 2026 23:37
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix task event loss during shutdown by ensuring that buffered events are flushed before the io_service is stopped. The changes in TaskEventBuffer correctly implement a synchronous wait for gRPC calls to complete. However, the same fix for RayEventRecorder is incomplete, as StopExportingEvents remains asynchronous, which could still lead to event loss. I've also found an issue in one of the new tests. My review includes suggestions to address these points.

@ruoliu2 ruoliu2 marked this pull request as draft January 16, 2026 23:40
@ruoliu2 ruoliu2 force-pushed the ray-event-loss-during-shutdown branch 4 times, most recently from 0ffd54a to 438e741 Compare January 17, 2026 19:17
@ruoliu2 ruoliu2 marked this pull request as ready for review January 17, 2026 20:08
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@ruoliu2 ruoliu2 force-pushed the ray-event-loss-during-shutdown branch 3 times, most recently from 9cc8050 to b6b0dc8 Compare January 17, 2026 22:01
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@ruoliu2 ruoliu2 force-pushed the ray-event-loss-during-shutdown branch 2 times, most recently from 396248b to 7bd049c Compare January 17, 2026 22:25
@ruoliu2 ruoliu2 marked this pull request as draft January 17, 2026 22:27
@ruoliu2 ruoliu2 force-pushed the ray-event-loss-during-shutdown branch 3 times, most recently from c05dd88 to a8ae6c3 Compare January 18, 2026 03:08
ruoliu2 and others added 2 commits January 17, 2026 21:28
During shutdown, TaskEventBufferImpl::Stop() and RayEventRecorder were
losing buffered events because the io_service was stopped immediately
after calling async gRPC flush methods.

This fix:
- Adds a synchronous flush with configurable timeout in TaskEventBuffer::Stop()
- Adds StopExportingEvents() method to RayEventRecorder for graceful shutdown
- Calls StopExportingEvents() from GcsServer::Stop() before stopping io_service
- Adds new config option task_events_shutdown_flush_timeout_ms (default 5s)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: ruo <[email protected]>
Changes:
- Wait for in-flight gRPC to complete before final flush
- Perform final flush to send all buffered events
- Wait for flush gRPC to complete before shutdown
- Add payload checks to avoid sending empty gRPC requests
- Use lambda helper for cleaner wait logic
- Fix race condition: signal under mutex to avoid lost wakeup
- Fix overlapping gRPC: skip export if gRPC already in progress
- Fix re-enable window: use stopping_ flag instead of re-enabling
  to prevent new events during shutdown flush

This ensures no events are lost during shutdown while respecting
timeouts to avoid hanging indefinitely.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: ruo <[email protected]>
@ruoliu2 ruoliu2 force-pushed the ray-event-loss-during-shutdown branch from a8ae6c3 to 6d38b2d Compare January 18, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Ray Events are lost during ray.shutdown() when using Event Export to external collector

1 participant