feat(cache): Add budget manager for cache capacity control #7399

fimanishi · 2025-10-30T21:50:42Z

Signed-off-by: fimanishi [email protected]

What changed?
Added a generic budget manager for cache capacity control with the following features:

Two-tier soft cap logic: Capacity is split into free space (shared) and fair share (per-cache) tiers based on a
configurable threshold
Per-cache usage tracking: Mutex-based tracking of bytes/count usage for each cache
Two admission modes: Optimistic (add-then-undo) and Strict (CAS pre-check)
Two reclaim patterns: Self-release (cache calls Release per-item) and Manager-release (manager calls Release once with totals)
Callback-based wrappers: Reserve/Release/Reclaim methods with automatic cleanup on errors

Why?
To provide a unified, host-scoped budget management system that can be shared across multiple caches (both evicting and non-evicting) with:

Fair resource allocation: Prevents any single cache from monopolizing capacity through the two-tier soft cap system
Flexible integration: Supports different cache eviction patterns (self-release vs manager-release) and admission
policies (optimistic vs strict)
Safe concurrent access: Mutex-based per-cache tracking combined with atomic global counters ensures thread-safety

How did you test it?
Unit tests. More tests will be done when the manager is used by the replication cache.

Potential risks
No risk. This is just the manager implementation. It is not being used.

Release notes
Added new budget manager for cache capacity control. No migration required - this is a new internal component that can be optionally adopted by existing caches. New metrics available:

cadence_budget_manager_capacity_bytes/count
cadence_budget_manager_used_bytes/count
cadence_budget_manager_soft_threshold
cadence_budget_manager_active_cache_count
cadence_budget_manager_hard_cap_exceeded
cadence_budget_manager_soft_cap_exceeded

Documentation Changes

Signed-off-by: fimanishi <[email protected]> # Conflicts: # common/metrics/defs.go

Signed-off-by: fimanishi <[email protected]>

davidporter-id-au · 2025-10-31T04:29:12Z

As a general point, I'm struggling a little to follow the implementation, I think it part because I've forgotten the cache side a bit. Would either another PR or example of how to use it be possible?

davidporter-id-au

Partway through review but just submitting comments for now which are... possibly not super helpful but hopefully illustrate the general theme.

Apart from examples, which I think would help both a user and the review, I worry mostly about the degree of initial complexity upfront.

A lot of these features seem really quite reasonable and quite useful in a cache context, but as a user it's a little overwhelming, and a few I think may not add a lot of value (the spin-lock limit check, for example), so paring the API back down to a for it's initial purpose with the option to expand it as needed I would suggest.

davidporter-id-au · 2025-10-31T01:04:39Z

common/cache/budget.go

+	// Cache-aware reservation methods for two-tier soft cap enforcement
+
+	// ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic.
+	ReserveForCache(cacheID string, nBytes uint64, nCount int64) error


If we have the below methods, do we need to put this in the interface? It can be in the implementation, but it'd be good to start with as small a surface area as possible?

You probably know this already, but fyi the functions here don't need to be the complete set in the implementation, this is only needs to be the user-facing implementation part, so if there are functions you're using internally, they don't need to show up in the interface.

davidporter-id-au · 2025-10-31T01:05:44Z

common/cache/budget.go

+	// ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic.
+	ReserveForCache(cacheID string, nBytes uint64, nCount int64) error
+	// ReserveBytesForCache reserves bytes for a specific cache, applying two-tier soft cap logic.
+	ReserveBytesForCache(cacheID string, n uint64) error


nit: consider an informative name n just to inform folks with their IDEs that this is bytes

common/cache/budget.go

davidporter-id-au · 2025-10-31T01:16:32Z

common/cache/budget.go

+	// Cache-aware release methods
+
+	// ReleaseForCache releases usage for a specific cache.
+	ReleaseForCache(cacheID string, nBytes uint64, nCount int64)


What calls this?

davidporter-id-au · 2025-10-31T01:18:15Z

common/cache/budget.go

+	name string,
+	maxBytes dynamicproperties.IntPropertyFn,
+	maxCount dynamicproperties.IntPropertyFn,
+	admission AdmissionMode,


You can always add a new kind of admission later, I'd suggest picking one that works best for now

davidporter-id-au · 2025-10-31T01:32:34Z

common/cache/budget.go

+	}
+
+	if m.logger != nil {
+		m.logger.Warn("Hard capacity limit exceeded",


I would worry this would be extremely noisy, might be worth putting in as loglevel debug so that the we can figure out the data, potentially with sampling?

davidporter-id-au · 2025-10-31T01:55:13Z

common/cache/budget.go

+	}
+}
+
+func (m *manager) reserveBytesStrict(n uint64) error {


I ... apologize for bringing up spin-locks earlier, unless I'm mistaken, i'm not sure this degree of complexity is warranted in the manager use-case. I'd suggest just going with the reserveBytesOptimistic and lowering the complexity if possible

common/cache/budget.go

davidporter-id-au · 2025-10-31T04:31:17Z

common/cache/budget.go

+	isActive := cacheUsage.usedBytes > 0 || cacheUsage.usedCount > 0
+	cacheUsage.mu.Unlock()
+
+	if !wasActive && isActive {


I don't think we got to discuss the active vs inactive state the various caches go through. What is the number of caches used to drive? I am guessing it's used in the fairness calculations?

… to debug Signed-off-by: fimanishi <[email protected]>

…nterface Signed-off-by: fimanishi <[email protected]>

davidporter-id-au · 2025-11-03T19:48:26Z

common/cache/budget.go

+	"github.com/uber/cadence/common/metrics"
+)
+
+// This implements a generic host-scoped budget manager suitable for


nit: Put this comment over / mentioning the Manager interface to have Golang's tooling and IDE support / Godoc be able to read it. If you just have a floating comment it'll not be visible unless someone reads the source-code

davidporter-id-au · 2025-11-03T21:29:41Z

common/cache/budget.go

+//		return c.budgetMgr.ReserveOrReclaimManagerReleaseWithCallback(
+//			ctx, c.cacheID, itemSizeBytes, 1, true,
+//			func(needBytes uint64, needCount int64) (uint64, int64) {
+//				return c.cache.EvictLRU(needBytes, needCount)


Comment: No action required

Happy to 👍 this to get unblocked / allow experimentation, but imho this is something that would be less complex if it lives as a callback on the cache side.

ie, (to use the example below) something like this(? - I might be missing some context)

func (c *myCache) PutWithEviction(ctx context.Context, key, value interface{}) error { itemSizeBytes := calculateSize(value) err := c.cache.CanReserveCB() // checks if there's free space in the manager if err != nil { return err // overall manager is full } defer c.cache.PutCB() // which calls c.budgetMgr.Update(c.cacheID, c.Size(), c.count()) var freedBytes, freedCount uint64, int64 for freedBytes < needBytes || freedCount < needCount { evictedBytes, evictedCount, err := c.cache.EvictOldest() if err != nil { break } freedBytes += evictedBytes freedCount += evictedCount } return c.cache.Put(key, value) }

More generally, I think it's reasonable to not want the cache implementations to be super aware of an overarching accounting thing (the manager in this instance), but changing their implementations to have a callback for get/put results I think could get you close enough.

davidporter-id-au · 2025-11-03T21:41:08Z

common/cache/budget.go

+	ReserveOrReclaimSelfReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimSelfRelease, callback func() error) error
+
+	// ReserveOrReclaimManagerReleaseWithCallback reserves/reclaims capacity, executes callback, releases on callback error.
+	ReserveOrReclaimManagerReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimManagerRelease, callback func() error) error


I would suggest having an Update(cacheID, size, count) method which allows for the caller to do an more basic write operation also

davidporter-id-au · 2025-11-03T21:43:41Z

common/cache/budget.go

+type ReclaimManagerRelease func(needBytes uint64, needCount int64) (freedBytes uint64, freedCount int64)
+
+// CapEnforcementResult contains the result of capacity enforcement (both hard and soft cap) for a cache
+type CapEnforcementResult struct {


in other programming languages, this would be a normal pattern (option types), but in golang you're really going against the grain of convention by putting error inside another value. Because we must write code for the team and not ourselves, I think you probably want to pull the error value out to be a separate return value.

davidporter-id-au · 2025-11-03T21:50:09Z

common/cache/budget.go

+func (m *manager) updateMetrics() {
+	// Emit capacity metrics
+	capacityBytes := m.CapacityBytes()
+	if capacityBytes != math.MaxUint64 {


I don't think it's super likely you need the guard, as it'll wraparound if its exceeding maxint, though by that time whatever you're measuring is pretty big, so it's probably beyond any reasonable size that could be in memory for current computing.

I know I was being annoying about it before, I think I was fixating on the wrong thing in practice. If you do want to put guards in place, they probably should be <

davidporter-id-au · 2025-11-03T21:55:33Z

common/cache/budget.go

+
+	if nBytes > 0 {
+		if err := m.reserveBytes(nBytes); err != nil {
+			cacheUsage.mu.Unlock()


for future safety, do you mind moving this as a defer under like 367 instead, so that it can't be missed on a refactor?

This code looks fine, but if someone half-paying-attention does a return err on some random codepath they could miss the unlock

davidporter-id-au · 2025-11-03T22:13:53Z

common/cache/budget.go

+
+	cacheUsage := m.getCacheUsage(cacheID)
+
+	cacheUsage.mu.Lock()


same comment as above: putting defer unlock next to the lock is generally safer / easier / works on panic recovery

davidporter-id-au · 2025-11-03T22:15:21Z

common/cache/budget.go

+	if nCount < 0 {
+		if m.logger != nil {
+			m.logger.Error("Invalid negative count value in ReserveCountForCache",
+				tag.Dynamic("cache_id", cacheID),


nit: if the cache_id is to be logged frequently, it's worth it probably to make a log-tag

davidporter-id-au · 2025-11-03T22:16:05Z

common/cache/budget.go

+
+	cacheUsage := m.getCacheUsage(cacheID)
+
+	cacheUsage.mu.Lock()


same comment as above, request to put the lock as a defer

davidporter-id-au · 2025-11-03T22:38:27Z

common/cache/budget.go

+			// can safely take its own locks (e.g., for protecting internal data structures
+			// during eviction) without risk of deadlock. The cache must call ReleaseForCache
+			// separately after evicting items.
+			reclaim(needB, needC)


Question: as far as I'm aware, the LRU cache does this internally. I might be not quite understanding it's purpose of doing this reclaim step here too? Is this due to the risk of concurrent updates or something else?

If you're happy with a possible overshoot / using the optimistic update approach, then I would think you shouldn't need to loop here?

fimanishi added 2 commits October 30, 2025 14:43

feat(cache): Add budget manager for cache capacity control

2c9d2ae

Signed-off-by: fimanishi <[email protected]> # Conflicts: # common/metrics/defs.go

lint

186807f

Signed-off-by: fimanishi <[email protected]>

fimanishi marked this pull request as ready for review October 31, 2025 03:45

fimanishi requested review from 3vilhamster, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners October 31, 2025 03:45

davidporter-id-au reviewed Oct 31, 2025

View reviewed changes

fimanishi added 2 commits October 31, 2025 10:40

Add comments, examples, rename variables and change hard capacity log…

7f016fd

… to debug Signed-off-by: fimanishi <[email protected]>

Improve comments, add better examples, remove some methods from the i…

72da31c

…nterface Signed-off-by: fimanishi <[email protected]>

davidporter-id-au reviewed Nov 3, 2025

View reviewed changes

feat(cache): Add budget manager for cache capacity control #7399

Are you sure you want to change the base?

feat(cache): Add budget manager for cache capacity control #7399

Conversation

fimanishi commented Oct 30, 2025

Uh oh!

davidporter-id-au commented Oct 31, 2025

Uh oh!

davidporter-id-au left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidporter-id-au Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davidporter-id-au Nov 3, 2025 •

edited

Loading