Skip to content

[core] Fixing actor state counter bug#63647

Merged
edoakes merged 2 commits into
ray-project:masterfrom
karticam:karticam/followup-fix-act-state-counter
May 27, 2026
Merged

[core] Fixing actor state counter bug#63647
edoakes merged 2 commits into
ray-project:masterfrom
karticam:karticam/followup-fix-act-state-counter

Conversation

@karticam

@karticam karticam commented May 26, 2026

Copy link
Copy Markdown
Contributor

This is followup to this PR: #63551

There was a bug found by cursor bot in #63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: #63551 (comment)

A non-DEAD actor can be in the destroyed actor cache in the following way:

  1. When initializing GCS after failure, we read persisted data for actors.
  2. For each actor, we either put it in registered actors or destroyed actors based on this check: OnInitializeActorShouldLoad. We put an actor in destroyed actor cache if:
  • It is dead and not restartable.
  • The job it belongs to is dead or the root detached actor it belongs to is dead.

So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD.

Before #63551, we used to construct GcsActor in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger ~GcsActor which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually.

@karticam karticam requested a review from a team as a code owner May 26, 2026 20:29

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates GcsActorManager::AddDestroyedActorObservabilityData to decrement the actor state counter for non-DEAD states upon eviction, mirroring the behavior of GcsActor::~GcsActor. The reviewer pointed out a potential safety issue where holding a reference to the front element of sorted_destroyed_actor_observability_list_ could lead to a dangling reference after it is popped, and suggested copying the lightweight ActorID instead.

const auto &actor_id = sorted_destroyed_actor_observability_list_.front().first;
gcs_table_storage_->ActorTable().Delete(actor_id, {[](auto) {}, io_context_});
destroyed_actor_observability_data_.erase(actor_id);
const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Holding a reference to sorted_destroyed_actor_observability_list_.front().first is risky because the element is popped from the list at the end of this block (sorted_destroyed_actor_observability_list_.pop_front()). While it is not used after the pop in the current implementation, copying the ActorID (which is a lightweight, cheap-to-copy identifier) is much safer and prevents potential dangling reference or use-after-free issues if the code is refactored or reordered in the future.

Suggested change
const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first;
const auto evict_id = sorted_destroyed_actor_observability_list_.front().first;

@karticam karticam May 26, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was the same in the code before this PR: #63551
therefore keeping it unchanged. also, its not a bug right now

@karticam karticam added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels May 26, 2026
const auto &actor_id = sorted_destroyed_actor_observability_list_.front().first;
gcs_table_storage_->ActorTable().Delete(actor_id, {[](auto) {}, io_context_});
destroyed_actor_observability_data_.erase(actor_id);
const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we describe in the PR description when we'd end up with a non dead actor in the destroyed actor cache?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in the description. lmk if you have any additional questions

destroyed_actor_observability_data_.erase(actor_id);
const auto &evict_id = sorted_destroyed_actor_observability_list_.front().first;

// Mirror GcsActor::~GcsActor: decrement the counter for non-DEAD states on eviction.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a test for this?

@karticam karticam May 27, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we should. added!

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
@karticam karticam force-pushed the karticam/followup-fix-act-state-counter branch from 14f2819 to 2fe8755 Compare May 27, 2026 01:09
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
@edoakes edoakes merged commit a3604b1 into ray-project:master May 27, 2026
6 checks passed
@karticam karticam changed the title Fixing actor state counter bug [core] Fixing actor state counter bug May 27, 2026
Neelansh-Khare pushed a commit to Neelansh-Khare/ray-clone that referenced this pull request Jun 5, 2026
This is followup to this PR:
ray-project#63551

There was a bug found by cursor bot in
ray-project#63551 where, when we evict from
the destroyed actor cache, if the state of actor is not DEAD, actor
state counter will not be decremented. This is the link to the comment:
ray-project#63551 (comment)

A non-DEAD actor can be in the destroyed actor cache in the following
way:
1. When initializing GCS after failure, we read persisted data for
actors.
2. For each actor, we either put it in registered actors or destroyed
actors based on this check: `OnInitializeActorShouldLoad`. We put an
actor in destroyed actor cache if:

- It is dead and not restartable.
- The job it belongs to is dead or the root detached actor it belongs to
is dead.

So when the job is dead, but the actor isn't marked DEAD yet, it might
go to the destroyed cache even when its not DEAD.

Before ray-project#63551, we used to
construct `GcsActor` in Initialize path, which would increment the actor
state counter and then eviction from destroyed actor cache would trigger
`~GcsActor` which would decrement the actor state counter if the actor
state is not DEAD. Now that we don't use GcsActor, we do it manually.

---------

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: Neelansh Khare <kharen@uci.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants