[core] Removing destroyed_actors_ cache by karticam · Pull Request #63551 · ray-project/ray

karticam · 2026-05-20T20:06:02Z

Issue

GcsActorManager::destroyed_actors_ caches dead actors as flat_hash_map<ActorID, shared_ptr<GcsActor>>. Each entry keeps the full GcsActor alive (including task_spec_ and lease_spec_), which leads to increasing memory consumption on the head node. The cap on the number of dead actors that can be accumulated in the cache is 100k.
An instance was observed where GCS memory consumption grew from ~1GB to ~5.5GB over a month due to ~16.3k dead actors.

Fix

The cache is used mainly for observability (eg ray list actors) and some control plane stuff. However all the consumers of GcsActor just end up consuming just the rpc::ActorTableData.

To release the bulky GcsActor when an actor is destroyed, we replace the cache with destroyed_actor_observability_data_, holding a lightweight ActorObservabilityData struct that wraps only rpc::ActorTableData. When DestroyActor runs, the heavy GcsActor is now actually freed.

Added GetActorTableData(actor_id) — best-effort lookup returningconst rpc::ActorTableData * from either live or destroyed state.
Refactored AddActorInfo and the Gen*Cause death-cause helpers to take const rpc::ActorTableData * instead of const GcsActor *.
Migrated all GetActor() callsites and deleted GetActor().
Initialize() rehydrates dead actors directly into the lightweight cache and bumps the DEAD counter to preserve the cumulative-deaths gauge across GCS restarts.
Test: added a weak_ptr.expired() assertion proving the GcsActor heap object is freed after DestroyActor.

More details can be found here: https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0

gemini-code-assist

Code Review

This pull request refactors the GcsActorManager to use a lightweight ActorObservabilityData structure for caching destroyed actors, replacing the previous approach of storing full GcsActor instances. This change significantly reduces memory usage by allowing heavy resources like task and lease specifications to be freed upon actor destruction. The feedback suggests further simplifying the HandleGetActorInfo method by leveraging the new GetActorTableData helper function to improve code conciseness and reuse.

gemini-code-assist · 2026-05-20T20:10:35Z

  const auto &registered_actor_iter = registered_actors_.find(actor_id);
-  GcsActor *ptr = nullptr;
  if (registered_actor_iter != registered_actors_.end()) {
-    ptr = registered_actor_iter->second.get();
+    *reply->mutable_actor_table_data() =
+        registered_actor_iter->second->GetActorTableData();
  } else {
-    const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id);
-    if (destroyed_actor_iter != destroyed_actors_.end()) {
-      ptr = destroyed_actor_iter->second.get();
+    const auto &observability_iter = destroyed_actor_observability_data_.find(actor_id);
+    if (observability_iter != destroyed_actor_observability_data_.end()) {
+      *reply->mutable_actor_table_data() = observability_iter->second.actor_table_data;
    }
  }


This logic for finding actor data can be simplified by using the newly introduced GetActorTableData helper function. This would improve code reuse and make the implementation more concise.

const auto *actor_data = GetActorTableData(actor_id); if (actor_data) { *reply->mutable_actor_table_data() = *actor_data; }

edoakes · 2026-05-21T19:57:30Z

+  /// Lightweight observability snapshots of destroyed actors. Stores only
+  /// `ActorTableData` (not the full `GcsActor` with `task_spec_`/`lease_spec_`)
+  /// so that the heavy heap state is freed when `registered_actors_` releases
+  /// its `shared_ptr`.
+  absl::flat_hash_map<ActorID, ActorObservabilityData>


no need for the indirection with ActorObservabilityData struct, better to just make it <ActorID, ActorTableData>

the name of the map and list already convey that they're for observability only

edoakes · 2026-05-21T19:58:30Z

+      actor_state_counter_->Increment(
+          {rpc::ActorTableData::DEAD, actor_table_data.class_name()});


was this previously done inside of the GcsActor constructor?

edoakes · 2026-05-21T19:59:23Z

-          actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_);
-      destroyed_actors_.emplace(actor_id, actor);
-      sorted_destroyed_actor_list_.emplace_back(
+      // Rehydrate dead actors directly into the lightweight observability cache


"Rehydrate" is a bit confusing here -- isn't it the first time we're inserting into the observability cache? I don't really understand the point about "instead of constructing a full GcsActor we'd immediately discard". That implementation wouldn't really make sense here

Used "rehydrate" earlier since this flow also triggered when we restart GCS after deaths (in case we have persisted data to redis). and yes, agree to the second point.
Rephrased the comment here.

edoakes · 2026-05-21T20:03:35Z

+  // Capture a weak_ptr to an actor that will end up in the observability cache
+  // (actors 10..19 survive eviction; pick 15). If destroyed_actor_observability_data_
+  // accidentally pins the heavy GcsActor, this weak_ptr would stay alive.
+  std::weak_ptr<gcs::GcsActor> weak_cached_actor;
+  ActorID cached_actor_id;


this mechanism for testing feels a little too "clever". we don't really need to test if the observability data cache pins this, the compiler already tells us that since the observability cache only stores the ActorTableData

claude came up with the idea lol
but removed it now. just checking the number of dead actors and cache size in the test now.

Signed-off-by: Kartica Modi <karticamodi@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 8827e43. Configure here.}

cursor · 2026-05-26T17:15:19Z

    } else {
      dead_actors.push_back(actor_id);
-      auto actor = std::make_shared<GcsActor>(
-          actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_);


State counter leak for non-DEAD actors during cache eviction

Low Severity

During Initialize, OnInitializeActorShouldLoad can return false for actors in non-DEAD states (e.g. ALIVE actors whose job or owner died). The new code increments actor_state_counter_ for these actors' actual state, but when they are later evicted from destroyed_actor_observability_data_, no decrement occurs. Previously, the GcsActor destructor would decrement the counter for non-DEAD states upon eviction. This causes a permanent inflation of non-DEAD state counters in metrics after GCS restart with orphaned actors.

Additional Locations (1)

src/ray/gcs/actor/gcs_actor_manager.cc#L1939-L1946

^{Reviewed by Cursor Bugbot for commit 8827e43. Configure here.}

This is valid. Fixed in #63647

This is followup to this PR: #63551 There was a bug found by cursor bot in #63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: #63551 (comment) A non-DEAD actor can be in the destroyed actor cache in the following way: 1. When initializing GCS after failure, we read persisted data for actors. 2. For each actor, we either put it in registered actors or destroyed actors based on this check: `OnInitializeActorShouldLoad`. We put an actor in destroyed actor cache if: - It is dead and not restartable. - The job it belongs to is dead or the root detached actor it belongs to is dead. So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD. Before #63551, we used to construct `GcsActor` in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger `~GcsActor` which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com>

## Issue `GcsActorManager::destroyed_actors_` caches dead actors as `flat_hash_map<ActorID, shared_ptr<GcsActor>>`. Each entry keeps the full `GcsActor` alive (including `task_spec_` and `lease_spec_`), which leads to increasing memory consumption on the head node. The cap on the number of dead actors that can be accumulated in the cache is 100k. An instance was observed where GCS memory consumption grew from ~1GB to ~5.5GB over a month due to ~16.3k dead actors. ## Fix The cache is used mainly for observability (eg `ray list actors`) and some control plane stuff. However all the consumers of `GcsActor` just end up consuming just the `rpc::ActorTableData`. To release the bulky `GcsActor` when an actor is destroyed, we replace the cache with `destroyed_actor_observability_data_`, holding a lightweight `ActorObservabilityData` struct that wraps only `rpc::ActorTableData`. When `DestroyActor` runs, the heavy `GcsActor` is now actually freed. - Added `GetActorTableData(actor_id)` — best-effort lookup returning`const rpc::ActorTableData *` from either live or destroyed state. - Refactored `AddActorInfo` and the `Gen*Cause` death-cause helpers to take `const rpc::ActorTableData *` instead of `const GcsActor *`. - Migrated all `GetActor()` callsites and deleted `GetActor()`. - `Initialize()` rehydrates dead actors directly into the lightweight cache and bumps the `DEAD` counter to preserve the cumulative-deaths gauge across GCS restarts. - Test: added a `weak_ptr.expired()` assertion proving the `GcsActor` heap object is freed after `DestroyActor`. More details can be found here: https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0 --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: Neelansh Khare <kharen@uci.edu>

This is followup to this PR: ray-project#63551 There was a bug found by cursor bot in ray-project#63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: ray-project#63551 (comment) A non-DEAD actor can be in the destroyed actor cache in the following way: 1. When initializing GCS after failure, we read persisted data for actors. 2. For each actor, we either put it in registered actors or destroyed actors based on this check: `OnInitializeActorShouldLoad`. We put an actor in destroyed actor cache if: - It is dead and not restartable. - The job it belongs to is dead or the root detached actor it belongs to is dead. So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD. Before ray-project#63551, we used to construct `GcsActor` in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger `~GcsActor` which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: Neelansh Khare <kharen@uci.edu>

karticam requested a review from a team as a code owner May 20, 2026 20:06

karticam added core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests labels May 20, 2026

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Yicheng-Lu-llll self-assigned this May 20, 2026

edoakes reviewed May 21, 2026

View reviewed changes

Yicheng-Lu-llll reviewed May 22, 2026

View reviewed changes

Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc

Comment thread src/ray/gcs/actor/tests/gcs_actor_manager_test.cc

edoakes approved these changes May 24, 2026

View reviewed changes

edoakes enabled auto-merge (squash) May 24, 2026 00:19

github-actions Bot disabled auto-merge May 24, 2026 00:20

cursor Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/ray/gcs/actor/gcs_actor_manager.cc Outdated

karticam added 2 commits May 25, 2026 23:23

Removing destroyed_actors_ cache

0a4fc11

Signed-off-by: Kartica Modi <karticamodi@gmail.com>

Resolving comments

6232cc4

Signed-off-by: Kartica Modi <karticamodi@gmail.com>

karticam force-pushed the karticam/remove-destroyed-actor-cache branch from 41dfa0d to 6232cc4 Compare May 26, 2026 06:23

Fixing state counter bug

8827e43

Signed-off-by: Kartica Modi <karticamodi@gmail.com>

cursor Bot reviewed May 26, 2026

View reviewed changes

edoakes merged commit 9ff20ef into ray-project:master May 26, 2026
6 checks passed

karticam mentioned this pull request May 26, 2026

[core] Fixing actor state counter bug #63647

Merged

karticam changed the title ~~Removing destroyed_actors_ cache~~ [core] Removing destroyed_actors_ cache May 27, 2026

		actor_state_counter_->Increment(
		{rpc::ActorTableData::DEAD, actor_table_data.class_name()});

Conversation

karticam commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Fix

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karticam May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karticam May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

State counter leak for non-DEAD actors during cache eviction

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karticam commented May 20, 2026 •

edited

Loading

karticam May 23, 2026 •

edited

Loading

karticam May 23, 2026 •

edited

Loading