Problem
When SSD offload is enabled, every eviction cycle over-evicts memory objects:
The eviction target far exceeds what is needed, evicting 70–90% of memory objects in a single cycle and dropping memory usage from ~97% to ~10%, well beyond the intended eviction ratio (5%). As data is continuously offloaded to SSD, the proportion of disk-only objects in metadata keeps growing, making every eviction susceptible to this issue.
Root Cause
BatchEvict uses object_count = sum(shard->metadata.size()) as the denominator for computing the eviction target:
ideal_evict_num = ceil(object_count * evict_ratio_target)
However, metadata.size() includes all objects, many of which are disk-only objects that have already been offloaded to SSD and retain only a LOCAL_DISK replica. These objects:
- Have no memory replica and can never be evicted
- Are nonetheless counted in the denominator, severely inflating the eviction target
With SSD offload enabled, as data is continuously written and offloaded, the fraction of disk-only objects only grows, and the inflation factor of the eviction target increases accordingly.
Measured data:
object_count = 85,622,691 (all metadata)
evictable_count = 3,776,944 (objects with evictable memory replicas)
non_evictable_count = 81,845,747 (disk-only, never evictable)
ideal_evict_num_inflated = 4,310,991 (using object_count)
ideal_evict_num_correct = 190,165 (using evictable_count)
↑ 22.7x difference
actual_evict_ratio = 0.92 (evicted 92% of memory objects)
target_evict_ratio = 0.05 (target was only 5%)
A target of 4.3M exceeds the evictable pool of 3.7M → all memory objects are evicted.
Design
Add a mem_object_count field to each MetadataShard, tracking the number of objects in that shard that have at least one completed memory replica. Use this in place of metadata.size() as the denominator for eviction target calculation.
Field Definition
struct MetadataShard {
mutable SharedMutex mutex;
std::unordered_map<std::string, ObjectMetadata> metadata GUARDED_BY(mutex);
// ...other fields...
// Count of objects that have at least one completed memory replica.
// Used as the denominator for eviction quota, avoiding inflation by disk-only objects.
long mem_object_count GUARDED_BY(mutex) = 0;
};
Increment: OnMemReplicaCompleted
Called after a memory replica is marked complete. Since mark_complete() is called before this method, we cannot use "already had a completed mem replica" as the check — the just-completed replica would make it always true. Instead, we count how many completed memory replicas the object has now. If exactly 1, this replica just crossed the threshold from 0 to 1, so increment the counter.
void OnMemReplicaCompleted(const ObjectMetadata& metadata) {
size_t completed_mem_count = metadata.CountReplicas(
[](const Replica& r) {
return r.is_memory_replica() && r.is_completed();
});
if (completed_mem_count == 1) shard_.mem_object_count++;
}
Call sites (4):
PutEnd: after memory replica mark_complete
CopyEnd: after target memory replica mark_complete
MoveEnd: after target memory replica mark_complete
PromotionCommit: after promoted memory replica mark_complete
Decrement: OnMemReplicasEvicted
Called after a memory replica is evicted. Checks whether the object still has a completed memory replica after eviction. If not, all memory replicas of this object have been evicted, so decrement the counter.
void OnMemReplicasEvicted(const ObjectMetadata& metadata) {
bool still_has_completed_mem =
metadata.HasReplica([](const Replica& r) {
return r.is_memory_replica() && r.is_completed();
});
if (!still_has_completed_mem) shard_.mem_object_count--;
}
Call site (1):
try_evict_or_offload inside BatchEvict: after evicting a memory replica
BatchEvict Quota Calculation Changes
// Before: using metadata.size() (includes disk-only objects)
ideal_evict_num = ceil(object_count * evict_ratio_target);
// After: using mem_object_count (memory objects only)
ideal_evict_num = ceil(shard.mem_object_count * evict_ratio_target);
The second-pass lowerbound also switches to total_mem_object_count:
// Before
target_evict_num = ceil(evictable_count * evict_ratio_lowerbound) - evicted_count - released_discarded_cnt;
// After
target_evict_num = ceil(total_mem_object_count * evict_ratio_lowerbound) - evicted_count - released_discarded_cnt;
The empty-pool check is updated accordingly:
// Before
if (object_count == 0) { need_mem_eviction_ = false; }
// After
if (total_mem_object_count == 0) { need_mem_eviction_ = false; }
Snapshot Restore
mem_object_count is not serialized in snapshots. On restore:
Reset() clears mem_object_count to 0 along with metadata.clear().
DeserializeShard() increments mem_object_count for each restored object that has at least one completed memory replica.
This ensures the counter is correctly reconstructed from the restored metadata without requiring serialization format changes.
Untracked Paths (Acceptable Minor Drift)
The following infrequent paths do not update the counter:
| Path |
Reason |
BatchReplicaClear (replica removal during segment unmount) |
Segment unmount is infrequent; clients re-register after unmount |
PutRevoke (write revocation) |
Only triggered on write failure, extremely rare |
CleanupStaleHandles (stale handle cleanup) |
Infrequent background cleanup |
The counter may be slightly over-estimated after these paths, causing the eviction target to be marginally higher than necessary. This is preferable to the previous 22x inflation — slight over-eviction only frees a small amount of extra memory and does not cause write failures, whereas under-estimation leads to insufficient eviction and persistent write failures.
Impact
mooncake-store/include/master_service.h: Add mem_object_count to MetadataShard, add OnMemReplicaCompleted / OnMemReplicasEvicted to MetadataShardAccessorRW
mooncake-store/src/master_service.cpp:
PutEnd, CopyEnd, MoveEnd, PromotionCommit: call OnMemReplicaCompleted
BatchEvict: use mem_object_count for quota calculation
try_evict_or_offload: call OnMemReplicasEvicted after eviction
MetadataSerializer::Reset: clear mem_object_count
MetadataSerializer::DeserializeShard: recompute mem_object_count from restored metadata
Problem
When SSD offload is enabled, every eviction cycle over-evicts memory objects:
The eviction target far exceeds what is needed, evicting 70–90% of memory objects in a single cycle and dropping memory usage from ~97% to ~10%, well beyond the intended eviction ratio (5%). As data is continuously offloaded to SSD, the proportion of disk-only objects in metadata keeps growing, making every eviction susceptible to this issue.
Root Cause
BatchEvictusesobject_count = sum(shard->metadata.size())as the denominator for computing the eviction target:However,
metadata.size()includes all objects, many of which are disk-only objects that have already been offloaded to SSD and retain only aLOCAL_DISKreplica. These objects:With SSD offload enabled, as data is continuously written and offloaded, the fraction of disk-only objects only grows, and the inflation factor of the eviction target increases accordingly.
Measured data:
A target of 4.3M exceeds the evictable pool of 3.7M → all memory objects are evicted.
Design
Add a
mem_object_countfield to eachMetadataShard, tracking the number of objects in that shard that have at least one completed memory replica. Use this in place ofmetadata.size()as the denominator for eviction target calculation.Field Definition
Increment:
OnMemReplicaCompletedCalled after a memory replica is marked complete. Since
mark_complete()is called before this method, we cannot use "already had a completed mem replica" as the check — the just-completed replica would make it always true. Instead, we count how many completed memory replicas the object has now. If exactly 1, this replica just crossed the threshold from 0 to 1, so increment the counter.Call sites (4):
PutEnd: after memory replica mark_completeCopyEnd: after target memory replica mark_completeMoveEnd: after target memory replica mark_completePromotionCommit: after promoted memory replica mark_completeDecrement:
OnMemReplicasEvictedCalled after a memory replica is evicted. Checks whether the object still has a completed memory replica after eviction. If not, all memory replicas of this object have been evicted, so decrement the counter.
Call site (1):
try_evict_or_offloadinsideBatchEvict: after evicting a memory replicaBatchEvict Quota Calculation Changes
The second-pass lowerbound also switches to
total_mem_object_count:The empty-pool check is updated accordingly:
Snapshot Restore
mem_object_countis not serialized in snapshots. On restore:Reset()clearsmem_object_countto 0 along withmetadata.clear().DeserializeShard()incrementsmem_object_countfor each restored object that has at least one completed memory replica.This ensures the counter is correctly reconstructed from the restored metadata without requiring serialization format changes.
Untracked Paths (Acceptable Minor Drift)
The following infrequent paths do not update the counter:
BatchReplicaClear(replica removal during segment unmount)PutRevoke(write revocation)CleanupStaleHandles(stale handle cleanup)The counter may be slightly over-estimated after these paths, causing the eviction target to be marginally higher than necessary. This is preferable to the previous 22x inflation — slight over-eviction only frees a small amount of extra memory and does not cause write failures, whereas under-estimation leads to insufficient eviction and persistent write failures.
Impact
mooncake-store/include/master_service.h: Addmem_object_counttoMetadataShard, addOnMemReplicaCompleted/OnMemReplicasEvictedtoMetadataShardAccessorRWmooncake-store/src/master_service.cpp:PutEnd,CopyEnd,MoveEnd,PromotionCommit: callOnMemReplicaCompletedBatchEvict: usemem_object_countfor quota calculationtry_evict_or_offload: callOnMemReplicasEvictedafter evictionMetadataSerializer::Reset: clearmem_object_countMetadataSerializer::DeserializeShard: recomputemem_object_countfrom restored metadata