[core] Removing destroyed_actors_ cache#63551
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the GcsActorManager to use a lightweight ActorObservabilityData structure for caching destroyed actors, replacing the previous approach of storing full GcsActor instances. This change significantly reduces memory usage by allowing heavy resources like task and lease specifications to be freed upon actor destruction. The feedback suggests further simplifying the HandleGetActorInfo method by leveraging the new GetActorTableData helper function to improve code conciseness and reuse.
| const auto ®istered_actor_iter = registered_actors_.find(actor_id); | ||
| GcsActor *ptr = nullptr; | ||
| if (registered_actor_iter != registered_actors_.end()) { | ||
| ptr = registered_actor_iter->second.get(); | ||
| *reply->mutable_actor_table_data() = | ||
| registered_actor_iter->second->GetActorTableData(); | ||
| } else { | ||
| const auto &destroyed_actor_iter = destroyed_actors_.find(actor_id); | ||
| if (destroyed_actor_iter != destroyed_actors_.end()) { | ||
| ptr = destroyed_actor_iter->second.get(); | ||
| const auto &observability_iter = destroyed_actor_observability_data_.find(actor_id); | ||
| if (observability_iter != destroyed_actor_observability_data_.end()) { | ||
| *reply->mutable_actor_table_data() = observability_iter->second.actor_table_data; | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic for finding actor data can be simplified by using the newly introduced GetActorTableData helper function. This would improve code reuse and make the implementation more concise.
const auto *actor_data = GetActorTableData(actor_id);
if (actor_data) {
*reply->mutable_actor_table_data() = *actor_data;
}| /// Lightweight observability snapshots of destroyed actors. Stores only | ||
| /// `ActorTableData` (not the full `GcsActor` with `task_spec_`/`lease_spec_`) | ||
| /// so that the heavy heap state is freed when `registered_actors_` releases | ||
| /// its `shared_ptr`. | ||
| absl::flat_hash_map<ActorID, ActorObservabilityData> |
There was a problem hiding this comment.
no need for the indirection with ActorObservabilityData struct, better to just make it <ActorID, ActorTableData>
the name of the map and list already convey that they're for observability only
| actor_state_counter_->Increment( | ||
| {rpc::ActorTableData::DEAD, actor_table_data.class_name()}); |
There was a problem hiding this comment.
was this previously done inside of the GcsActor constructor?
| actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_); | ||
| destroyed_actors_.emplace(actor_id, actor); | ||
| sorted_destroyed_actor_list_.emplace_back( | ||
| // Rehydrate dead actors directly into the lightweight observability cache |
There was a problem hiding this comment.
"Rehydrate" is a bit confusing here -- isn't it the first time we're inserting into the observability cache? I don't really understand the point about "instead of constructing a full GcsActor we'd immediately discard". That implementation wouldn't really make sense here
There was a problem hiding this comment.
Used "rehydrate" earlier since this flow also triggered when we restart GCS after deaths (in case we have persisted data to redis). and yes, agree to the second point.
Rephrased the comment here.
| // Capture a weak_ptr to an actor that will end up in the observability cache | ||
| // (actors 10..19 survive eviction; pick 15). If destroyed_actor_observability_data_ | ||
| // accidentally pins the heavy GcsActor, this weak_ptr would stay alive. | ||
| std::weak_ptr<gcs::GcsActor> weak_cached_actor; | ||
| ActorID cached_actor_id; |
There was a problem hiding this comment.
this mechanism for testing feels a little too "clever". we don't really need to test if the observability data cache pins this, the compiler already tells us that since the observability cache only stores the ActorTableData
There was a problem hiding this comment.
claude came up with the idea lol
but removed it now. just checking the number of dead actors and cache size in the test now.
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
41dfa0d to
6232cc4
Compare
Signed-off-by: Kartica Modi <karticamodi@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 8827e43. Configure here.
| } else { | ||
| dead_actors.push_back(actor_id); | ||
| auto actor = std::make_shared<GcsActor>( | ||
| actor_table_data, actor_state_counter_, ray_event_recorder_, session_name_); |
There was a problem hiding this comment.
State counter leak for non-DEAD actors during cache eviction
Low Severity
During Initialize, OnInitializeActorShouldLoad can return false for actors in non-DEAD states (e.g. ALIVE actors whose job or owner died). The new code increments actor_state_counter_ for these actors' actual state, but when they are later evicted from destroyed_actor_observability_data_, no decrement occurs. Previously, the GcsActor destructor would decrement the counter for non-DEAD states upon eviction. This causes a permanent inflation of non-DEAD state counters in metrics after GCS restart with orphaned actors.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8827e43. Configure here.
This is followup to this PR: #63551 There was a bug found by cursor bot in #63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: #63551 (comment) A non-DEAD actor can be in the destroyed actor cache in the following way: 1. When initializing GCS after failure, we read persisted data for actors. 2. For each actor, we either put it in registered actors or destroyed actors based on this check: `OnInitializeActorShouldLoad`. We put an actor in destroyed actor cache if: - It is dead and not restartable. - The job it belongs to is dead or the root detached actor it belongs to is dead. So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD. Before #63551, we used to construct `GcsActor` in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger `~GcsActor` which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com>
## Issue `GcsActorManager::destroyed_actors_` caches dead actors as `flat_hash_map<ActorID, shared_ptr<GcsActor>>`. Each entry keeps the full `GcsActor` alive (including `task_spec_` and `lease_spec_`), which leads to increasing memory consumption on the head node. The cap on the number of dead actors that can be accumulated in the cache is 100k. An instance was observed where GCS memory consumption grew from ~1GB to ~5.5GB over a month due to ~16.3k dead actors. ## Fix The cache is used mainly for observability (eg `ray list actors`) and some control plane stuff. However all the consumers of `GcsActor` just end up consuming just the `rpc::ActorTableData`. To release the bulky `GcsActor` when an actor is destroyed, we replace the cache with `destroyed_actor_observability_data_`, holding a lightweight `ActorObservabilityData` struct that wraps only `rpc::ActorTableData`. When `DestroyActor` runs, the heavy `GcsActor` is now actually freed. - Added `GetActorTableData(actor_id)` — best-effort lookup returning`const rpc::ActorTableData *` from either live or destroyed state. - Refactored `AddActorInfo` and the `Gen*Cause` death-cause helpers to take `const rpc::ActorTableData *` instead of `const GcsActor *`. - Migrated all `GetActor()` callsites and deleted `GetActor()`. - `Initialize()` rehydrates dead actors directly into the lightweight cache and bumps the `DEAD` counter to preserve the cumulative-deaths gauge across GCS restarts. - Test: added a `weak_ptr.expired()` assertion proving the `GcsActor` heap object is freed after `DestroyActor`. More details can be found here: https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0 --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: Neelansh Khare <kharen@uci.edu>
This is followup to this PR: ray-project#63551 There was a bug found by cursor bot in ray-project#63551 where, when we evict from the destroyed actor cache, if the state of actor is not DEAD, actor state counter will not be decremented. This is the link to the comment: ray-project#63551 (comment) A non-DEAD actor can be in the destroyed actor cache in the following way: 1. When initializing GCS after failure, we read persisted data for actors. 2. For each actor, we either put it in registered actors or destroyed actors based on this check: `OnInitializeActorShouldLoad`. We put an actor in destroyed actor cache if: - It is dead and not restartable. - The job it belongs to is dead or the root detached actor it belongs to is dead. So when the job is dead, but the actor isn't marked DEAD yet, it might go to the destroyed cache even when its not DEAD. Before ray-project#63551, we used to construct `GcsActor` in Initialize path, which would increment the actor state counter and then eviction from destroyed actor cache would trigger `~GcsActor` which would decrement the actor state counter if the actor state is not DEAD. Now that we don't use GcsActor, we do it manually. --------- Signed-off-by: Kartica Modi <karticamodi@gmail.com> Signed-off-by: Neelansh Khare <kharen@uci.edu>


Issue
GcsActorManager::destroyed_actors_caches dead actors asflat_hash_map<ActorID, shared_ptr<GcsActor>>. Each entry keeps the fullGcsActoralive (includingtask_spec_andlease_spec_), which leads to increasing memory consumption on the head node. The cap on the number of dead actors that can be accumulated in the cache is 100k.An instance was observed where GCS memory consumption grew from ~1GB to ~5.5GB over a month due to ~16.3k dead actors.
Fix
The cache is used mainly for observability (eg
ray list actors) and some control plane stuff. However all the consumers ofGcsActorjust end up consuming just therpc::ActorTableData.To release the bulky
GcsActorwhen an actor is destroyed, we replace the cache withdestroyed_actor_observability_data_, holding a lightweightActorObservabilityDatastruct that wraps onlyrpc::ActorTableData. WhenDestroyActorruns, the heavyGcsActoris now actually freed.GetActorTableData(actor_id)— best-effort lookup returningconst rpc::ActorTableData *from either live or destroyed state.AddActorInfoand theGen*Causedeath-cause helpers to takeconst rpc::ActorTableData *instead ofconst GcsActor *.GetActor()callsites and deletedGetActor().Initialize()rehydrates dead actors directly into the lightweight cache and bumps theDEADcounter to preserve the cumulative-deaths gauge across GCS restarts.weak_ptr.expired()assertion proving theGcsActorheap object is freed afterDestroyActor.More details can be found here: https://docs.google.com/document/d/1ocSw8EdU9dNjNbhIUUySU0YOCOgAQp-BTZ4TY1G6A1E/edit?tab=t.0