-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[core] Reset next_task_reply_position
when the actor restarts to avoid submitting the same task twice
#52759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Signed-off-by: Kai-Hsun Chen <[email protected]>
Some changes in this PR revert those introduced in #52249. |
didn't read closely yet, but cpp unit tests? |
void SequentialActorSubmitQueue::OnClientConnected() { | ||
auto head = requests.begin(); | ||
if (head != requests.end()) { | ||
RAY_CHECK(head->first >= next_task_reply_position); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: we might be able to completely remove next_task_reply_position
and out_of_order_completed_tasks
/// Called when client is connected/reconnected. Resets `next_task_reply_position` to | ||
/// the smallest sequence number of the tasks in `requests`. | ||
void OnClientConnected() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this API need to be thread safe? Or is all concurrency control intended to be handled by the caller?
|
||
iter = 10 | ||
for i in range(iter): | ||
time.sleep(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to write the test in a way that doesn't rely on time.sleep
here or inside of f
. if you need help with how to do this, lmk
Use #52833 instead. |
Why are these changes needed?
Currently, we don't update the
next_task_reply_position
correctly. Hence, some tasks may belong to both out_of_order_completed_tasks and requests.That is, when a actor restarts and calls
ConnectActor
, some tasks may be submitted by bothResendOutOfOrderCompletedTasks
(out_of_order_completed_tasks) andSendPendingTasks
(requests).ray/src/ray/core_worker/transport/actor_task_submitter.cc
Lines 334 to 335 in 96a632d
However, the Ray task status transitions to a different state during the first submission (
ResendOutOfOrderCompletedTasks
), so the second task submission (i.e.,SendPendingTasks
) fails on the following line.ray/src/ray/core_worker/task_manager.cc
Line 1436 in 96a632d
Example
Add more details for the above statement. For example,
seq_no
: 0-99) to Actor A.seq_no
(0 ~ 3). The driver callsSequentialActorSubmitQueue::MarkSeqnoCompleted
4 times to updatenext_task_reply_position
to 4.seq_no
: 100 ~ 199)seq_no
(100 ~ 103). However, becausenext_task_reply_position
is 4, these 4 tasks are put intoout_of_order_completed_tasks
.seq_no
: 200 ~ 299). At the same time, the tasks (100 - 103) in theout_of_order_completed_tasks
have already been updated toseq_no
(200 - 203).ConnectActor
out_of_order_completed_tasks
andrequests
.Reproduction
Run the test without this PR
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.