[core] Fixing actor state counter bug#63647
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates GcsActorManager::AddDestroyedActorObservabilityData to decrement the actor state counter for non-DEAD states upon eviction, mirroring the behavior of GcsActor::~GcsActor. The reviewer pointed out a potential safety issue where holding a reference to the front element of sorted_destroyed_actor_observability_list_ could lead to a dangling reference after it is popped, and suggested copying the lightweight ActorID instead.
| const auto &actor_id = sorted_destroyed_actor_observability_list_.front().first; | ||
| gcs_table_storage_->ActorTable().Delete(actor_id, {[](auto) {}, io_context_}); | ||
| destroyed_actor_observability_data_.erase(actor_id); | ||
| const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first; |
There was a problem hiding this comment.
Holding a reference to sorted_destroyed_actor_observability_list_.front().first is risky because the element is popped from the list at the end of this block (sorted_destroyed_actor_observability_list_.pop_front()). While it is not used after the pop in the current implementation, copying the ActorID (which is a lightweight, cheap-to-copy identifier) is much safer and prevents potential dangling reference or use-after-free issues if the code is refactored or reordered in the future.
| const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first; | |
| const auto evict_id = sorted_destroyed_actor_observability_list_.front().first; |
There was a problem hiding this comment.
it was the same in the code before this PR: #63551
therefore keeping it unchanged. also, its not a bug right now
| const auto &actor_id = sorted_destroyed_actor_observability_list_.front().first; | ||
| gcs_table_storage_->ActorTable().Delete(actor_id, {[](auto) {}, io_context_}); | ||
| destroyed_actor_observability_data_.erase(actor_id); | ||
| const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first; |
There was a problem hiding this comment.
Could we describe in the PR description when we'd end up with a non dead actor in the destroyed actor cache?
There was a problem hiding this comment.
added in the description. lmk if you have any additional questions
| destroyed_actor_observability_data_.erase(actor_id); | ||
| const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first; | ||
|
|
||
| // Mirror GcsActor::~GcsActor: decrement the counter for non-DEAD states on eviction. |
There was a problem hiding this comment.
Should we also add a test for this?
There was a problem hiding this comment.
yeah we should. added!
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
14f2819 to
2fe8755
Compare
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
This is followup to this PR: ray-project#63551 There was a bug found by cursor bot in ray-project#63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: ray-project#63551 (comment) A non-DEAD actor can be in the destroyed actor cache in the following way: 1. When initializing GCS after failure, we read persisted data for actors. 2. For each actor, we either put it in registered actors or destroyed actors based on this check: `OnInitializeActorShouldLoad`. We put an actor in destroyed actor cache if: - It is dead and not restartable. - The job it belongs to is dead or the root detached actor it belongs to is dead. So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD. Before ray-project#63551, we used to construct `GcsActor` in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger `~GcsActor` which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: Neelansh Khare <kharen@uci.edu>
This is followup to this PR: #63551
There was a bug found by cursor bot in #63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: #63551 (comment)
A non-DEAD actor can be in the destroyed actor cache in the following way:
OnInitializeActorShouldLoad. We put an actor in destroyed actor cache if:So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD.
Before #63551, we used to construct
GcsActorin Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger~GcsActorwhich would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually.