-
Notifications
You must be signed in to change notification settings - Fork 867
feat(cache): Add budget manager for cache capacity control #7399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(cache): Add budget manager for cache capacity control #7399
Conversation
Signed-off-by: fimanishi <[email protected]> # Conflicts: # common/metrics/defs.go
Signed-off-by: fimanishi <[email protected]>
|
As a general point, I'm struggling a little to follow the implementation, I think it part because I've forgotten the cache side a bit. Would either another PR or example of how to use it be possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partway through review but just submitting comments for now which are... possibly not super helpful but hopefully illustrate the general theme.
Apart from examples, which I think would help both a user and the review, I worry mostly about the degree of initial complexity upfront.
A lot of these features seem really quite reasonable and quite useful in a cache context, but as a user it's a little overwhelming, and a few I think may not add a lot of value (the spin-lock limit check, for example), so paring the API back down to a for it's initial purpose with the option to expand it as needed I would suggest.
common/cache/budget.go
Outdated
| // Cache-aware reservation methods for two-tier soft cap enforcement | ||
|
|
||
| // ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic. | ||
| ReserveForCache(cacheID string, nBytes uint64, nCount int64) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have the below methods, do we need to put this in the interface? It can be in the implementation, but it'd be good to start with as small a surface area as possible?
You probably know this already, but fyi the functions here don't need to be the complete set in the implementation, this is only needs to be the user-facing implementation part, so if there are functions you're using internally, they don't need to show up in the interface.
common/cache/budget.go
Outdated
| // ReserveForCache reserves usage for a specific cache, applying two-tier soft cap logic. | ||
| ReserveForCache(cacheID string, nBytes uint64, nCount int64) error | ||
| // ReserveBytesForCache reserves bytes for a specific cache, applying two-tier soft cap logic. | ||
| ReserveBytesForCache(cacheID string, n uint64) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider an informative name n just to inform folks with their IDEs that this is bytes
common/cache/budget.go
Outdated
| // Cache-aware release methods | ||
|
|
||
| // ReleaseForCache releases usage for a specific cache. | ||
| ReleaseForCache(cacheID string, nBytes uint64, nCount int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What calls this?
| name string, | ||
| maxBytes dynamicproperties.IntPropertyFn, | ||
| maxCount dynamicproperties.IntPropertyFn, | ||
| admission AdmissionMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can always add a new kind of admission later, I'd suggest picking one that works best for now
common/cache/budget.go
Outdated
| } | ||
|
|
||
| if m.logger != nil { | ||
| m.logger.Warn("Hard capacity limit exceeded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would worry this would be extremely noisy, might be worth putting in as loglevel debug so that the we can figure out the data, potentially with sampling?
| } | ||
| } | ||
|
|
||
| func (m *manager) reserveBytesStrict(n uint64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ... apologize for bringing up spin-locks earlier, unless I'm mistaken, i'm not sure this degree of complexity is warranted in the manager use-case. I'd suggest just going with the reserveBytesOptimistic and lowering the complexity if possible
| isActive := cacheUsage.usedBytes > 0 || cacheUsage.usedCount > 0 | ||
| cacheUsage.mu.Unlock() | ||
|
|
||
| if !wasActive && isActive { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we got to discuss the active vs inactive state the various caches go through. What is the number of caches used to drive? I am guessing it's used in the fairness calculations?
… to debug Signed-off-by: fimanishi <[email protected]>
…nterface Signed-off-by: fimanishi <[email protected]>
| "github.com/uber/cadence/common/metrics" | ||
| ) | ||
|
|
||
| // This implements a generic host-scoped budget manager suitable for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Put this comment over / mentioning the Manager interface to have Golang's tooling and IDE support / Godoc be able to read it. If you just have a floating comment it'll not be visible unless someone reads the source-code
| // return c.budgetMgr.ReserveOrReclaimManagerReleaseWithCallback( | ||
| // ctx, c.cacheID, itemSizeBytes, 1, true, | ||
| // func(needBytes uint64, needCount int64) (uint64, int64) { | ||
| // return c.cache.EvictLRU(needBytes, needCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: No action required
Happy to 👍 this to get unblocked / allow experimentation, but imho this is something that would be less complex if it lives as a callback on the cache side.
ie, (to use the example below) something like this(? - I might be missing some context)
func (c *myCache) PutWithEviction(ctx context.Context, key, value interface{}) error {
itemSizeBytes := calculateSize(value)
err := c.cache.CanReserveCB() // checks if there's free space in the manager
if err != nil {
return err // overall manager is full
}
defer c.cache.PutCB() // which calls c.budgetMgr.Update(c.cacheID, c.Size(), c.count())
var freedBytes, freedCount uint64, int64
for freedBytes < needBytes || freedCount < needCount {
evictedBytes, evictedCount, err := c.cache.EvictOldest()
if err != nil {
break
}
freedBytes += evictedBytes
freedCount += evictedCount
}
return c.cache.Put(key, value)
}
More generally, I think it's reasonable to not want the cache implementations to be super aware of an overarching accounting thing (the manager in this instance), but changing their implementations to have a callback for get/put results I think could get you close enough.
| ReserveOrReclaimSelfReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimSelfRelease, callback func() error) error | ||
|
|
||
| // ReserveOrReclaimManagerReleaseWithCallback reserves/reclaims capacity, executes callback, releases on callback error. | ||
| ReserveOrReclaimManagerReleaseWithCallback(ctx context.Context, cacheID string, nBytes uint64, nCount int64, retriable bool, reclaim ReclaimManagerRelease, callback func() error) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest having an Update(cacheID, size, count) method which allows for the caller to do an more basic write operation also
| type ReclaimManagerRelease func(needBytes uint64, needCount int64) (freedBytes uint64, freedCount int64) | ||
|
|
||
| // CapEnforcementResult contains the result of capacity enforcement (both hard and soft cap) for a cache | ||
| type CapEnforcementResult struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in other programming languages, this would be a normal pattern (option types), but in golang you're really going against the grain of convention by putting error inside another value. Because we must write code for the team and not ourselves, I think you probably want to pull the error value out to be a separate return value.
| func (m *manager) updateMetrics() { | ||
| // Emit capacity metrics | ||
| capacityBytes := m.CapacityBytes() | ||
| if capacityBytes != math.MaxUint64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's super likely you need the guard, as it'll wraparound if its exceeding maxint, though by that time whatever you're measuring is pretty big, so it's probably beyond any reasonable size that could be in memory for current computing.
I know I was being annoying about it before, I think I was fixating on the wrong thing in practice. If you do want to put guards in place, they probably should be <
|
|
||
| if nBytes > 0 { | ||
| if err := m.reserveBytes(nBytes); err != nil { | ||
| cacheUsage.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for future safety, do you mind moving this as a defer under like 367 instead, so that it can't be missed on a refactor?
This code looks fine, but if someone half-paying-attention does a return err on some random codepath they could miss the unlock
|
|
||
| cacheUsage := m.getCacheUsage(cacheID) | ||
|
|
||
| cacheUsage.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above: putting defer unlock next to the lock is generally safer / easier / works on panic recovery
| if nCount < 0 { | ||
| if m.logger != nil { | ||
| m.logger.Error("Invalid negative count value in ReserveCountForCache", | ||
| tag.Dynamic("cache_id", cacheID), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if the cache_id is to be logged frequently, it's worth it probably to make a log-tag
|
|
||
| cacheUsage := m.getCacheUsage(cacheID) | ||
|
|
||
| cacheUsage.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above, request to put the lock as a defer
| // can safely take its own locks (e.g., for protecting internal data structures | ||
| // during eviction) without risk of deadlock. The cache must call ReleaseForCache | ||
| // separately after evicting items. | ||
| reclaim(needB, needC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: as far as I'm aware, the LRU cache does this internally. I might be not quite understanding it's purpose of doing this reclaim step here too? Is this due to the risk of concurrent updates or something else?
If you're happy with a possible overshoot / using the optimistic update approach, then I would think you shouldn't need to loop here?
Signed-off-by: fimanishi [email protected]
What changed?
Added a generic budget manager for cache capacity control with the following features:
configurable threshold
Why?
To provide a unified, host-scoped budget management system that can be shared across multiple caches (both evicting and non-evicting) with:
policies (optimistic vs strict)
How did you test it?
Unit tests. More tests will be done when the manager is used by the replication cache.
Potential risks
No risk. This is just the manager implementation. It is not being used.
Release notes
Added new budget manager for cache capacity control. No migration required - this is a new internal component that can be optionally adopted by existing caches. New metrics available:
cadence_budget_manager_capacity_bytes/countcadence_budget_manager_used_bytes/countcadence_budget_manager_soft_thresholdcadence_budget_manager_active_cache_countcadence_budget_manager_hard_cap_exceededcadence_budget_manager_soft_cap_exceededDocumentation Changes