Skip to content

Conversation

@fimanishi
Copy link
Member

Signed-off-by: fimanishi [email protected]

What changed?
Added a generic budget manager for cache capacity control with the following features:

  • Two-tier soft cap logic: Capacity is split into free space (shared) and fair share (per-cache) tiers based on a
    configurable threshold
  • Per-cache usage tracking: Mutex-based tracking of bytes/count usage for each cache
  • Two admission modes: Optimistic (add-then-undo) and Strict (CAS pre-check)
  • Two reclaim patterns: Self-release (cache calls Release per-item) and Manager-release (manager calls Release once with totals)
  • Callback-based wrappers: Reserve/Release/Reclaim methods with automatic cleanup on errors

Why?
To provide a unified, host-scoped budget management system that can be shared across multiple caches (both evicting and non-evicting) with:

  1. Fair resource allocation: Prevents any single cache from monopolizing capacity through the two-tier soft cap system
  2. Flexible integration: Supports different cache eviction patterns (self-release vs manager-release) and admission
    policies (optimistic vs strict)
  3. Safe concurrent access: Mutex-based per-cache tracking combined with atomic global counters ensures thread-safety

How did you test it?
Unit tests. More tests will be done when the manager is used by the replication cache.

Potential risks
No risk. This is just the manager implementation. It is not being used.

Release notes
Added new budget manager for cache capacity control. No migration required - this is a new internal component that can be optionally adopted by existing caches. New metrics available:

  • cadence_budget_manager_capacity_bytes/count
  • cadence_budget_manager_used_bytes/count
  • cadence_budget_manager_soft_threshold
  • cadence_budget_manager_active_cache_count
  • cadence_budget_manager_hard_cap_exceeded
  • cadence_budget_manager_soft_cap_exceeded

Documentation Changes

Signed-off-by: fimanishi <[email protected]>

# Conflicts:
#	common/metrics/defs.go
Signed-off-by: fimanishi <[email protected]>
@davidporter-id-au
Copy link
Member

As a general point, I'm struggling a little to follow the implementation, I think it part because I've forgotten the cache side a bit. Would either another PR or example of how to use it be possible?

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partway through review but just submitting comments for now which are... possibly not super helpful but hopefully illustrate the general theme.

Apart from examples, which I think would help both a user and the review, I worry mostly about the degree of initial complexity upfront.

A lot of these features seem really quite reasonable and quite useful in a cache context, but as a user it's a little overwhelming, and a few I think may not add a lot of value (the spin-lock limit check, for example), so paring the API back down to a for it's initial purpose with the option to expand it as needed I would suggest.

// Cache-aware reservation methods for two-tier soft cap enforcement

// ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic.
ReserveForCache(cacheID string, nBytes uint64, nCount int64) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the below methods, do we need to put this in the interface? It can be in the implementation, but it'd be good to start with as small a surface area as possible?

You probably know this already, but fyi the functions here don't need to be the complete set in the implementation, this is only needs to be the user-facing implementation part, so if there are functions you're using internally, they don't need to show up in the interface.

// ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic.
ReserveForCache(cacheID string, nBytes uint64, nCount int64) error
// ReserveBytesForCache reserves bytes for a specific cache, applying two-tier soft cap logic.
ReserveBytesForCache(cacheID string, n uint64) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider an informative name n just to inform folks with their IDEs that this is bytes

// Cache-aware release methods

// ReleaseForCache releases usage for a specific cache.
ReleaseForCache(cacheID string, nBytes uint64, nCount int64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What calls this?

name string,
maxBytes dynamicproperties.IntPropertyFn,
maxCount dynamicproperties.IntPropertyFn,
admission AdmissionMode,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can always add a new kind of admission later, I'd suggest picking one that works best for now

}

if m.logger != nil {
m.logger.Warn("Hard capacity limit exceeded",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would worry this would be extremely noisy, might be worth putting in as loglevel debug so that the we can figure out the data, potentially with sampling?

}
}

func (m *manager) reserveBytesStrict(n uint64) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ... apologize for bringing up spin-locks earlier, unless I'm mistaken, i'm not sure this degree of complexity is warranted in the manager use-case. I'd suggest just going with the reserveBytesOptimistic and lowering the complexity if possible

isActive := cacheUsage.usedBytes > 0 || cacheUsage.usedCount > 0
cacheUsage.mu.Unlock()

if !wasActive && isActive {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we got to discuss the active vs inactive state the various caches go through. What is the number of caches used to drive? I am guessing it's used in the fairness calculations?

"github.com/uber/cadence/common/metrics"
)

// This implements a generic host-scoped budget manager suitable for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Put this comment over / mentioning the Manager interface to have Golang's tooling and IDE support / Godoc be able to read it. If you just have a floating comment it'll not be visible unless someone reads the source-code

// return c.budgetMgr.ReserveOrReclaimManagerReleaseWithCallback(
// ctx, c.cacheID, itemSizeBytes, 1, true,
// func(needBytes uint64, needCount int64) (uint64, int64) {
// return c.cache.EvictLRU(needBytes, needCount)
Copy link
Member

@davidporter-id-au davidporter-id-au Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: No action required

Happy to 👍 this to get unblocked / allow experimentation, but imho this is something that would be less complex if it lives as a callback on the cache side.

ie, (to use the example below) something like this(? - I might be missing some context)

func (c *myCache) PutWithEviction(ctx context.Context, key, value interface{}) error {
	itemSizeBytes := calculateSize(value)
    err := c.cache.CanReserveCB()  // checks if there's free space in the manager
    if err != nil {
       return err // overall manager is full
   }
   defer c.cache.PutCB() // which calls c.budgetMgr.Update(c.cacheID, c.Size(), c.count())
	var freedBytes, freedCount uint64, int64
	for freedBytes < needBytes || freedCount < needCount {
		evictedBytes, evictedCount, err := c.cache.EvictOldest()
		if err != nil {
			break
		}
		freedBytes += evictedBytes
		freedCount += evictedCount
	}
	return c.cache.Put(key, value)
}

More generally, I think it's reasonable to not want the cache implementations to be super aware of an overarching accounting thing (the manager in this instance), but changing their implementations to have a callback for get/put results I think could get you close enough.

ReserveOrReclaimSelfReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimSelfRelease, callback func() error) error

// ReserveOrReclaimManagerReleaseWithCallback reserves/reclaims capacity, executes callback, releases on callback error.
ReserveOrReclaimManagerReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimManagerRelease, callback func() error) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest having an Update(cacheID, size, count) method which allows for the caller to do an more basic write operation also

type ReclaimManagerRelease func(needBytes uint64, needCount int64) (freedBytes uint64, freedCount int64)

// CapEnforcementResult contains the result of capacity enforcement (both hard and soft cap) for a cache
type CapEnforcementResult struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other programming languages, this would be a normal pattern (option types), but in golang you're really going against the grain of convention by putting error inside another value. Because we must write code for the team and not ourselves, I think you probably want to pull the error value out to be a separate return value.

func (m *manager) updateMetrics() {
// Emit capacity metrics
capacityBytes := m.CapacityBytes()
if capacityBytes != math.MaxUint64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's super likely you need the guard, as it'll wraparound if its exceeding maxint, though by that time whatever you're measuring is pretty big, so it's probably beyond any reasonable size that could be in memory for current computing.

I know I was being annoying about it before, I think I was fixating on the wrong thing in practice. If you do want to put guards in place, they probably should be <


if nBytes > 0 {
if err := m.reserveBytes(nBytes); err != nil {
cacheUsage.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future safety, do you mind moving this as a defer under like 367 instead, so that it can't be missed on a refactor?

This code looks fine, but if someone half-paying-attention does a return err on some random codepath they could miss the unlock


cacheUsage := m.getCacheUsage(cacheID)

cacheUsage.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above: putting defer unlock next to the lock is generally safer / easier / works on panic recovery

if nCount < 0 {
if m.logger != nil {
m.logger.Error("Invalid negative count value in ReserveCountForCache",
tag.Dynamic("cache_id", cacheID),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if the cache_id is to be logged frequently, it's worth it probably to make a log-tag


cacheUsage := m.getCacheUsage(cacheID)

cacheUsage.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, request to put the lock as a defer

// can safely take its own locks (e.g., for protecting internal data structures
// during eviction) without risk of deadlock. The cache must call ReleaseForCache
// separately after evicting items.
reclaim(needB, needC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: as far as I'm aware, the LRU cache does this internally. I might be not quite understanding it's purpose of doing this reclaim step here too? Is this due to the risk of concurrent updates or something else?

If you're happy with a possible overshoot / using the optimistic update approach, then I would think you shouldn't need to loop here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants