Skip to content

[core] Removing destroyed_actors_ cache#63551

Merged
edoakes merged 3 commits into
ray-project:masterfrom
karticam:karticam/remove-destroyed-actor-cache
May 26, 2026
Merged

[core] Removing destroyed_actors_ cache#63551
edoakes merged 3 commits into
ray-project:masterfrom
karticam:karticam/remove-destroyed-actor-cache

Conversation

@karticam

@karticam karticam commented May 20, 2026

Copy link
Copy Markdown
Contributor

Issue

GcsActorManager::destroyed_actors_ caches dead actors as flat_hash_map<ActorID, shared_ptr<GcsActor>>. Each entry keeps the full GcsActor alive (including task_spec_ and lease_spec_), which leads to increasing memory consumption on the head node. The cap on the number of dead actors that can be accumulated in the cache is 100k.
An instance was observed where GCS memory consumption grew from ~1GB to ~5.5GB over a month due to ~16.3k dead actors.

Fix

The cache is used mainly for observability (eg ray list actors) and some control plane stuff. However all the consumers of GcsActor just end up consuming just the rpc::ActorTableData.

To release the bulky GcsActor when an actor is destroyed, we replace the cache with destroyed_actor_observability_data_, holding a lightweight ActorObservabilityData struct that wraps only rpc::ActorTableData. When DestroyActor runs, the heavy GcsActor is now actually freed.

  • Added GetActorTableData(actor_id) — best-effort lookup returningconst rpc::ActorTableData * from either live or destroyed state.
  • Refactored AddActorInfo and the Gen*Cause death-cause helpers to take const rpc::ActorTableData * instead of const GcsActor *.
  • Migrated all GetActor() callsites and deleted GetActor().
  • Initialize() rehydrates dead actors directly into the lightweight cache and bumps the DEAD counter to preserve the cumulative-deaths gauge across GCS restarts.
  • Test: added a weak_ptr.expired() assertion proving the GcsActor heap object is freed after DestroyActor.

More details can be found here: https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0

@karticam karticam requested a review from a team as a code owner May 20, 2026 20:06
@karticam karticam added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels May 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the GcsActorManager to use a lightweight ActorObservabilityData structure for caching destroyed actors, replacing the previous approach of storing full GcsActor instances. This change significantly reduces memory usage by allowing heavy resources like task and lease specifications to be freed upon actor destruction. The feedback suggests further simplifying the HandleGetActorInfo method by leveraging the new GetActorTableData helper function to improve code conciseness and reuse.

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment on lines 473 to 482
const auto &registered_actor_iter = registered_actors_.find(actor_id);
GcsActor *ptr = nullptr;
if (registered_actor_iter != registered_actors_.end()) {
ptr = registered_actor_iter->second.get();
*reply->mutable_actor_table_data() =
registered_actor_iter->second->GetActorTableData();
} else {
const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id);
if (destroyed_actor_iter != destroyed_actors_.end()) {
ptr = destroyed_actor_iter->second.get();
const auto &observability_iter = destroyed_actor_observability_data_.find(actor_id);
if (observability_iter != destroyed_actor_observability_data_.end()) {
*reply->mutable_actor_table_data() = observability_iter->second.actor_table_data;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for finding actor data can be simplified by using the newly introduced GetActorTableData helper function. This would improve code reuse and make the implementation more concise.

  const auto *actor_data = GetActorTableData(actor_id);
  if (actor_data) {
    *reply->mutable_actor_table_data() = *actor_data;
  }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Yicheng-Lu-llll Yicheng-Lu-llll self-assigned this May 20, 2026
Comment thread src/ray/gcs/actor/gcs_actor_manager.h Outdated
Comment on lines +471 to +475
/// Lightweight observability snapshots of destroyed actors. Stores only
/// `ActorTableData` (not the full `GcsActor` with `task_spec_`/`lease_spec_`)
/// so that the heavy heap state is freed when `registered_actors_` releases
/// its `shared_ptr`.
absl::flat_hash_map<ActorID, ActorObservabilityData>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the indirection with ActorObservabilityData struct, better to just make it <ActorID, ActorTableData>

the name of the map and list already convey that they're for observability only

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
Comment on lines +1793 to +1794
actor_state_counter_->Increment(
{rpc::ActorTableData::DEAD, actor_table_data.class_name()});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this previously done inside of the GcsActor constructor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_);
destroyed_actors_.emplace(actor_id, actor);
sorted_destroyed_actor_list_.emplace_back(
// Rehydrate dead actors directly into the lightweight observability cache

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rehydrate" is a bit confusing here -- isn't it the first time we're inserting into the observability cache? I don't really understand the point about "instead of constructing a full GcsActor we'd immediately discard". That implementation wouldn't really make sense here

@karticam karticam May 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used "rehydrate" earlier since this flow also triggered when we restart GCS after deaths (in case we have persisted data to redis). and yes, agree to the second point.
Rephrased the comment here.

Comment on lines +400 to +404
// Capture a weak_ptr to an actor that will end up in the observability cache
// (actors 10..19 survive eviction; pick 15). If destroyed_actor_observability_data_
// accidentally pins the heavy GcsActor, this weak_ptr would stay alive.
std::weak_ptr<gcs::GcsActor> weak_cached_actor;
ActorID cached_actor_id;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this mechanism for testing feels a little too "clever". we don't really need to test if the observability data cache pins this, the compiler already tells us that since the observability cache only stores the ActorTableData

@karticam karticam May 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude came up with the idea lol
but removed it now. just checking the number of dead actors and cache size in the test now.

Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc
Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc
@edoakes edoakes enabled auto-merge (squash) May 24, 2026 00:19
@github-actions github-actions Bot disabled auto-merge May 24, 2026 00:20
Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated
karticam added 2 commits May 25, 2026 23:23
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
@karticam karticam force-pushed the karticam/remove-destroyed-actor-cache branch from 41dfa0d to 6232cc4 Compare May 26, 2026 06:23
Signed-off-by: Kartica Modi <karticamodi@gmail.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 8827e43. Configure here.

} else {
dead_actors.push_back(actor_id);
auto actor = std::make_shared<GcsActor>(
actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State counter leak for non-DEAD actors during cache eviction

Low Severity

During Initialize, OnInitializeActorShouldLoad can return false for actors in non-DEAD states (e.g. ALIVE actors whose job or owner died). The new code increments actor_state_counter_ for these actors' actual state, but when they are later evicted from destroyed_actor_observability_data_, no decrement occurs. Previously, the GcsActor destructor would decrement the counter for non-DEAD states upon eviction. This causes a permanent inflation of non-DEAD state counters in metrics after GCS restart with orphaned actors.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8827e43. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is valid. Fixed in #63647

@edoakes edoakes merged commit 9ff20ef into ray-project:master May 26, 2026
6 checks passed
edoakes pushed a commit that referenced this pull request May 27, 2026
This is followup to this PR:
#63551

There was a bug found by cursor bot in
#63551 where, when we evict from
the destroyed actor cache, if the state of actor is not DEAD, actor
state counter will not be decremented. This is the link to the comment:
#63551 (comment)

A non-DEAD actor can be in the destroyed actor cache in the following
way:
1. When initializing GCS after failure, we read persisted data for
actors.
2. For each actor, we either put it in registered actors or destroyed
actors based on this check: `OnInitializeActorShouldLoad`. We put an
actor in destroyed actor cache if:

- It is dead and not restartable.
- The job it belongs to is dead or the root detached actor it belongs to
is dead.

So when the job is dead, but the actor isn't marked DEAD yet, it might
go to the destroyed cache even when its not DEAD.

Before #63551, we used to
construct `GcsActor` in Initialize path, which would increment the actor
state counter and then eviction from destroyed actor cache would trigger
`~GcsActor` which would decrement the actor state counter if the actor
state is not DEAD. Now that we don't use GcsActor, we do it manually.

---------

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
@karticam karticam changed the title Removing destroyed_actors_ cache [core] Removing destroyed_actors_ cache May 27, 2026
Neelansh-Khare pushed a commit to Neelansh-Khare/ray-clone that referenced this pull request Jun 5, 2026
## Issue
`GcsActorManager::destroyed_actors_` caches dead actors as
`flat_hash_map<ActorID, shared_ptr<GcsActor>>`. Each entry keeps the
full `GcsActor` alive (including `task_spec_` and `lease_spec_`), which
leads to increasing memory consumption on the head node. The cap on the
number of dead actors that can be accumulated in the cache is 100k.
An instance was observed where GCS memory consumption grew from ~1GB to
~5.5GB over a month due to ~16.3k dead actors.

  ## Fix

The cache is used mainly for observability (eg `ray list actors`) and
some control plane stuff. However all the consumers of `GcsActor` just
end up consuming just the `rpc::ActorTableData`.

To release the bulky `GcsActor` when an actor is destroyed, we replace
the cache with `destroyed_actor_observability_data_`, holding a
lightweight `ActorObservabilityData` struct that wraps only
`rpc::ActorTableData`. When `DestroyActor` runs, the heavy `GcsActor` is
now actually freed.

- Added `GetActorTableData(actor_id)` — best-effort lookup
returning`const rpc::ActorTableData *` from either live or destroyed
state.
- Refactored `AddActorInfo` and the `Gen*Cause` death-cause helpers to
take `const rpc::ActorTableData *` instead of `const GcsActor *`.
- Migrated all `GetActor()` callsites and deleted `GetActor()`.
- `Initialize()` rehydrates dead actors directly into the lightweight
cache and bumps the `DEAD` counter to preserve the cumulative-deaths
gauge across GCS restarts.
- Test: added a `weak_ptr.expired()` assertion proving the `GcsActor`
heap object is freed after `DestroyActor`.

More details can be found here:
https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0

---------

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: Neelansh Khare <kharen@uci.edu>
Neelansh-Khare pushed a commit to Neelansh-Khare/ray-clone that referenced this pull request Jun 5, 2026
This is followup to this PR:
ray-project#63551

There was a bug found by cursor bot in
ray-project#63551 where, when we evict from
the destroyed actor cache, if the state of actor is not DEAD, actor
state counter will not be decremented. This is the link to the comment:
ray-project#63551 (comment)

A non-DEAD actor can be in the destroyed actor cache in the following
way:
1. When initializing GCS after failure, we read persisted data for
actors.
2. For each actor, we either put it in registered actors or destroyed
actors based on this check: `OnInitializeActorShouldLoad`. We put an
actor in destroyed actor cache if:

- It is dead and not restartable.
- The job it belongs to is dead or the root detached actor it belongs to
is dead.

So when the job is dead, but the actor isn't marked DEAD yet, it might
go to the destroyed cache even when its not DEAD.

Before ray-project#63551, we used to
construct `GcsActor` in Initialize path, which would increment the actor
state counter and then eviction from destroyed actor cache would trigger
`~GcsActor` which would decrement the actor state counter if the actor
state is not DEAD. Now that we don't use GcsActor, we do it manually.

---------

Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: Neelansh Khare <kharen@uci.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants