Skip to content

[core] SetTaskStatus should only be called within the same lock scope where task_entry is retrieved #52770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented May 4, 2025

Why are these changes needed?

This PR reverts #52695 and adds comments to explain where should SetTaskStatus be called.

#52695 updates the value in submissble_tasks_ without acquiring the mutex lock. If multiple threads or coroutines write to the map, a rehash or deletion may occur, causing the pointer to the value to become invalid.

Outdated PR statement

Question

#52695 (comment)

Pointers to values in a flat_hash_map become invalid after a rehash. Additionally, we dereference those pointers in RetryTask, which doesn’t hold a mutex lock. Hence, it’s possible for the pointers to become invalid when other coroutines or threads insert or delete elements from the map, triggering a rehash.

"Iterators, references, and pointers to elements are invalidated on rehash." (reference)

Solution

Changing submissible_tasks_ from absl::flat_hash_map<TaskID, TaskEntry> to absl::flat_hash_map<TaskID, std::unique_ptr<TaskEntry>> requires a lot of changes.

Hence, this PR implements a short-term solution by copying the value (i.e., TaskEntry) while holding the mutex lock where rehash will not be triggered by other threads / coroutine.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@kevin85421
Copy link
Member Author

@dentiny does this solution make sense to you? Thanks!

Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even simplier, you could use std::unordered_map, which guarantees pointer stability.

Internally, flat hash map / swiss table uses open addressing to resolve collision,
while unordered map leverages hash chaining, so each ValueEntry is heap allocated.

Signed-off-by: Kai-Hsun Chen <[email protected]>
@kevin85421 kevin85421 force-pushed the laptop-ray-20250503 branch from 7a62fb1 to 1933eb8 Compare May 4, 2025 23:16
@kevin85421 kevin85421 changed the title [core] Change TaskEntry pointer to optional TaskEntry to avoid the pointer becomes invalid if map rehash happens [core] SetTaskStatus should only be called within the same lock scope where task_entry is retrieved May 4, 2025
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label May 5, 2025
@kevin85421 kevin85421 marked this pull request as ready for review May 5, 2025 08:59
@@ -749,6 +740,11 @@ class TaskManager : public TaskFinisherInterface, public TaskResubmissionInterfa
/// \param attempt_number The attempt number to record the task status change
/// event. If not specified, the attempt number will be the current attempt number of
/// the task.
///
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only change. Others are for reverting.

@kevin85421
Copy link
Member Author

kevin85421 commented May 5, 2025

@dentiny I found that the original solution (copy it) doesn't work because we need to update the task_entry. Hence, I decide to revert the old solution for now until I have time to implement a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants