Skip to content

[Bug]: BatchEvict Over-Eviction Due to Inflated Target When SSD Offload Is Enabled #2243

@Colors-111

Description

@Colors-111

Problem

When SSD offload is enabled, every eviction cycle over-evicts memory objects:

The eviction target far exceeds what is needed, evicting 70–90% of memory objects in a single cycle and dropping memory usage from ~97% to ~10%, well beyond the intended eviction ratio (5%). As data is continuously offloaded to SSD, the proportion of disk-only objects in metadata keeps growing, making every eviction susceptible to this issue.

Root Cause

BatchEvict uses object_count = sum(shard->metadata.size()) as the denominator for computing the eviction target:

ideal_evict_num = ceil(object_count * evict_ratio_target)

However, metadata.size() includes all objects, many of which are disk-only objects that have already been offloaded to SSD and retain only a LOCAL_DISK replica. These objects:

  • Have no memory replica and can never be evicted
  • Are nonetheless counted in the denominator, severely inflating the eviction target

With SSD offload enabled, as data is continuously written and offloaded, the fraction of disk-only objects only grows, and the inflation factor of the eviction target increases accordingly.

Measured data:

object_count         = 85,622,691  (all metadata)
evictable_count      =  3,776,944  (objects with evictable memory replicas)
non_evictable_count  = 81,845,747  (disk-only, never evictable)

ideal_evict_num_inflated = 4,310,991  (using object_count)
ideal_evict_num_correct  =   190,165  (using evictable_count)
                                        ↑ 22.7x difference

actual_evict_ratio = 0.92  (evicted 92% of memory objects)
target_evict_ratio = 0.05  (target was only 5%)

A target of 4.3M exceeds the evictable pool of 3.7M → all memory objects are evicted.

Design

Add a mem_object_count field to each MetadataShard, tracking the number of objects in that shard that have at least one completed memory replica. Use this in place of metadata.size() as the denominator for eviction target calculation.

Field Definition

struct MetadataShard {
    mutable SharedMutex mutex;
    std::unordered_map<std::string, ObjectMetadata> metadata GUARDED_BY(mutex);
    // ...other fields...
    // Count of objects that have at least one completed memory replica.
    // Used as the denominator for eviction quota, avoiding inflation by disk-only objects.
    long mem_object_count GUARDED_BY(mutex) = 0;
};

Increment: OnMemReplicaCompleted

Called after a memory replica is marked complete. Since mark_complete() is called before this method, we cannot use "already had a completed mem replica" as the check — the just-completed replica would make it always true. Instead, we count how many completed memory replicas the object has now. If exactly 1, this replica just crossed the threshold from 0 to 1, so increment the counter.

void OnMemReplicaCompleted(const ObjectMetadata& metadata) {
    size_t completed_mem_count = metadata.CountReplicas(
        [](const Replica& r) {
            return r.is_memory_replica() && r.is_completed();
        });
    if (completed_mem_count == 1) shard_.mem_object_count++;
}

Call sites (4):

  • PutEnd: after memory replica mark_complete
  • CopyEnd: after target memory replica mark_complete
  • MoveEnd: after target memory replica mark_complete
  • PromotionCommit: after promoted memory replica mark_complete

Decrement: OnMemReplicasEvicted

Called after a memory replica is evicted. Checks whether the object still has a completed memory replica after eviction. If not, all memory replicas of this object have been evicted, so decrement the counter.

void OnMemReplicasEvicted(const ObjectMetadata& metadata) {
    bool still_has_completed_mem =
        metadata.HasReplica([](const Replica& r) {
            return r.is_memory_replica() && r.is_completed();
        });
    if (!still_has_completed_mem) shard_.mem_object_count--;
}

Call site (1):

  • try_evict_or_offload inside BatchEvict: after evicting a memory replica

BatchEvict Quota Calculation Changes

// Before: using metadata.size() (includes disk-only objects)
ideal_evict_num = ceil(object_count * evict_ratio_target);

// After: using mem_object_count (memory objects only)
ideal_evict_num = ceil(shard.mem_object_count * evict_ratio_target);

The second-pass lowerbound also switches to total_mem_object_count:

// Before
target_evict_num = ceil(evictable_count * evict_ratio_lowerbound) - evicted_count - released_discarded_cnt;

// After
target_evict_num = ceil(total_mem_object_count * evict_ratio_lowerbound) - evicted_count - released_discarded_cnt;

The empty-pool check is updated accordingly:

// Before
if (object_count == 0) { need_mem_eviction_ = false; }

// After
if (total_mem_object_count == 0) { need_mem_eviction_ = false; }

Snapshot Restore

mem_object_count is not serialized in snapshots. On restore:

  1. Reset() clears mem_object_count to 0 along with metadata.clear().
  2. DeserializeShard() increments mem_object_count for each restored object that has at least one completed memory replica.

This ensures the counter is correctly reconstructed from the restored metadata without requiring serialization format changes.

Untracked Paths (Acceptable Minor Drift)

The following infrequent paths do not update the counter:

Path Reason
BatchReplicaClear (replica removal during segment unmount) Segment unmount is infrequent; clients re-register after unmount
PutRevoke (write revocation) Only triggered on write failure, extremely rare
CleanupStaleHandles (stale handle cleanup) Infrequent background cleanup

The counter may be slightly over-estimated after these paths, causing the eviction target to be marginally higher than necessary. This is preferable to the previous 22x inflation — slight over-eviction only frees a small amount of extra memory and does not cause write failures, whereas under-estimation leads to insufficient eviction and persistent write failures.

Impact

  • mooncake-store/include/master_service.h: Add mem_object_count to MetadataShard, add OnMemReplicaCompleted / OnMemReplicasEvicted to MetadataShardAccessorRW
  • mooncake-store/src/master_service.cpp:
    • PutEnd, CopyEnd, MoveEnd, PromotionCommit: call OnMemReplicaCompleted
    • BatchEvict: use mem_object_count for quota calculation
    • try_evict_or_offload: call OnMemReplicasEvicted after eviction
    • MetadataSerializer::Reset: clear mem_object_count
    • MetadataSerializer::DeserializeShard: recompute mem_object_count from restored metadata

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions