-
Notifications
You must be signed in to change notification settings - Fork 684
Bucket store: Support GCS rate limiting #13703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
💻 Deploy preview available (WIP: Bucket store: Support GCS rate limiting): |
268072a to
c0249fa
Compare
c0249fa to
48950f5
Compare
|
💻 Deploy preview available (Bucket store: Support GCS rate limiting): |
a23f386 to
48d19a3
Compare
| ConstLabels: constLabels, | ||
| }, []string{"allowed"}) | ||
| rl.currentQPSGauge.Set(float64(startQPS)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Rate limiter metrics lack component labels causing duplicate registration
The rate limiter metrics (cortex_gcs_rate_limited_seconds_total, cortex_gcs_current_qps, cortex_gcs_requests_total) are registered with only an operation label but no component/bucket name differentiation. In NewClient, the registerer is passed directly to gcs.NewBucketClient before being wrapped with prometheus.WrapRegistererWith(prometheus.Labels{"component": name}, reg) in bucketWithMetrics. This means if multiple GCS bucket clients with rate limiting enabled are created in the same process (e.g., compactor and store-gateway in monolithic mode), the duplicate metric registration will cause a panic at startup. The name parameter is passed to NewBucketClient but not used in the rate limiter metrics.
Additional Locations (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Cursor, I believe it's now resolved.
| # (advanced) Initial queries per second limit for GCS uploads. The rate doubles | ||
| # every ramp period until it reaches the maximum. | ||
| # CLI flag: -<prefix>.gcs.upload-initial-qps | ||
| [upload_initial_qps: <int> | default = 1000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the Google best practices I’m reading:
If your request rate is expected to go over these thresholds, you should start with a request rate below or near the thresholds and then gradually increase the rate, no faster than doubling over a period of 20 minutes.
Here we have the default set to the maximum value for both uploads (1000) and reads (5000). Should we instead start with values closer to the recommended limits rather than using the maximum? I’m thinking we might run into rate-limiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't recommended limits respectively 1000 and 5000? I don't know of any others. Also, when I was speaking with Google Cloud support, they were just concerned about going above said limits. Additionally, we won't in practice attain exact numbers for these limits since we have to approximate them by dividing by the expected number of replicas. All in all, I wouldn't worry, especially as the limiter automatically backs off if it receives a rate limiting error.
| default: | ||
| panic(fmt.Errorf("unrecognized rateLimiterMode %v", mode)) | ||
| } | ||
| startQPS := min(initialQPS, maxQPS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should not allow having maxQps < initialQps instead of getting the min here. It should be possible with the current config to set:
- initialQps: 10
- maxQps: 5
Is it something that we should allow ? Or at least recommend that maxQps should be higher ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice one shouldn't have max QPS lower than initial QPS, this is just a guard. It doesn't really matter to me whether we return a validation error instead.
| if reg != nil { | ||
| constLabels := prometheus.Labels{"name": name, "operation": operation} | ||
| rl.rateLimitedSeconds = promauto.With(reg).NewCounter(prometheus.CounterOpts{ | ||
| Name: "cortex_gcs_rate_limited_seconds_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use a counter for measuring seconds , and not a a histogram ? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't given it much thought yet. The PR is still in an early state. Do you have specific arguments for using a histogram instead?
| if newQPS != rl.currentQPS { | ||
| rl.currentQPS = newQPS | ||
| rl.limiter.SetLimit(rate.Limit(newQPS)) | ||
| rl.limiter.SetBurst(newQPS * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be higher than the maxQps 🤔 , which can cause rate limit , no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we set newQPS = rl.maxQPS
tacole02
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good! I left a few minor suggestions. Thank you!
Signed-off-by: Arve Knudsen <[email protected]>
Signed-off-by: Arve Knudsen <[email protected]>
d99a957 to
4fe3e06
Compare
Co-authored-by: Taylor C <[email protected]>
Co-authored-by: Taylor C <[email protected]>
Co-authored-by: Taylor C <[email protected]>
Co-authored-by: Taylor C <[email protected]> Signed-off-by: Arve Knudsen <[email protected]>
9e60897 to
054ea7a
Compare
Signed-off-by: Arve Knudsen <[email protected]>
tacole02
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good! Thank you!
What this PR does
Augment bucket store subsystem with support for GCS rate limiting, as per Google's best practices. There are separate configurations for respectively upload and read rate limiting, because of different GCS guidelines for each.
A remaining question is whether to divide initial and max QPS by the expected number of replicas at deployment time.
Which issue(s) this PR fixes or relates to
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]. If changelog entry is not needed, please add thechangelog-not-neededlabel to the PR.about-versioning.mdupdated with experimental features.Note
Adds GCS upload/read rate limiting with exponential ramp-up, wiring it into the bucket client, plus new flags, docs, defaults, metrics, validation, and tests.
prometheus.Registerer.*-gcs.upload-rate-limit-enabled,*-gcs.upload-initial-qps,*-gcs.upload-max-qps,*-gcs.upload-ramp-period,*-gcs.read-rate-limit-enabled,*-gcs.read-initial-qps,*-gcs.read-max-qps,*-gcs.read-ramp-periodacrossblocks-storage,ruler-storage,alertmanager-storage, andcommon.storage.operations/mimir/mimir-flags-defaults.json) and help output (help-all.txt.tmpl), and docs (configuration-parameters/index.md).gcs.NewBucketClientand its caller to pass aRegisterer; add rate limiter implementation (rate_limiter.go).Written by Cursor Bugbot for commit 48d19a3. This will update automatically on new commits. Configure here.