Skip to content

Conversation

@aknuds1
Copy link
Contributor

@aknuds1 aknuds1 commented Nov 28, 2025

What this PR does

Augment bucket store subsystem with support for GCS rate limiting, as per Google's best practices. There are separate configurations for respectively upload and read rate limiting, because of different GCS guidelines for each.

A remaining question is whether to divide initial and max QPS by the expected number of replicas at deployment time.

Which issue(s) this PR fixes or relates to

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]. If changelog entry is not needed, please add the changelog-not-needed label to the PR.
  • about-versioning.md updated with experimental features.

Note

Adds GCS upload/read rate limiting with exponential ramp-up, wiring it into the bucket client, plus new flags, docs, defaults, metrics, validation, and tests.

  • Enhancement:
    • Add GCS request rate limiting (uploads and reads) with exponential ramp-up following Google best practices.
  • Storage/GCS:
    • Wrap GCS bucket client with retry and new rate-limiting layer; expose Prometheus metrics and accept a prometheus.Registerer.
    • Validate GCS config on use.
  • Configuration & Flags:
    • New flags and config fields: *-gcs.upload-rate-limit-enabled, *-gcs.upload-initial-qps, *-gcs.upload-max-qps, *-gcs.upload-ramp-period, *-gcs.read-rate-limit-enabled, *-gcs.read-initial-qps, *-gcs.read-max-qps, *-gcs.read-ramp-period across blocks-storage, ruler-storage, alertmanager-storage, and common.storage.
    • Update defaults (operations/mimir/mimir-flags-defaults.json) and help output (help-all.txt.tmpl), and docs (configuration-parameters/index.md).
  • Code Changes:
    • Change gcs.NewBucketClient and its caller to pass a Registerer; add rate limiter implementation (rate_limiter.go).
    • Add config validation for rate-limiting parameters.
  • Tests:
    • Add comprehensive unit tests for rate limiter and rate-limited bucket behavior.
  • Changelog:
    • Document new GCS rate limiting support and flags.

Written by Cursor Bugbot for commit 48d19a3. This will update automatically on new commits. Configure here.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 28, 2025

@aknuds1 aknuds1 force-pushed the arve/gcs-rate-limiter branch 10 times, most recently from 268072a to c0249fa Compare December 1, 2025 15:28
@aknuds1 aknuds1 added the enhancement New feature or request label Dec 1, 2025
@aknuds1 aknuds1 changed the title WIP: Bucket store: Support GCS rate limiting Bucket store: Support GCS rate limiting Dec 1, 2025
@aknuds1 aknuds1 marked this pull request as ready for review December 1, 2025 15:29
@aknuds1 aknuds1 requested review from a team and tacole02 as code owners December 1, 2025 15:29
@aknuds1 aknuds1 force-pushed the arve/gcs-rate-limiter branch from c0249fa to 48950f5 Compare December 1, 2025 15:36
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

@aknuds1 aknuds1 force-pushed the arve/gcs-rate-limiter branch 2 times, most recently from a23f386 to 48d19a3 Compare December 1, 2025 16:57
@aknuds1 aknuds1 marked this pull request as draft December 1, 2025 16:58
ConstLabels: constLabels,
}, []string{"allowed"})
rl.currentQPSGauge.Set(float64(startQPS))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Rate limiter metrics lack component labels causing duplicate registration

The rate limiter metrics (cortex_gcs_rate_limited_seconds_total, cortex_gcs_current_qps, cortex_gcs_requests_total) are registered with only an operation label but no component/bucket name differentiation. In NewClient, the registerer is passed directly to gcs.NewBucketClient before being wrapped with prometheus.WrapRegistererWith(prometheus.Labels{"component": name}, reg) in bucketWithMetrics. This means if multiple GCS bucket clients with rate limiting enabled are created in the same process (e.g., compactor and store-gateway in monolithic mode), the duplicate metric registration will cause a panic at startup. The name parameter is passed to NewBucketClient but not used in the rate limiter metrics.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Cursor, I believe it's now resolved.

# (advanced) Initial queries per second limit for GCS uploads. The rate doubles
# every ramp period until it reaches the maximum.
# CLI flag: -<prefix>.gcs.upload-initial-qps
[upload_initial_qps: <int> | default = 1000]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the Google best practices I’m reading:

If your request rate is expected to go over these thresholds, you should start with a request rate below or near the thresholds and then gradually increase the rate, no faster than doubling over a period of 20 minutes.

Here we have the default set to the maximum value for both uploads (1000) and reads (5000). Should we instead start with values closer to the recommended limits rather than using the maximum? I’m thinking we might run into rate-limiting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't recommended limits respectively 1000 and 5000? I don't know of any others. Also, when I was speaking with Google Cloud support, they were just concerned about going above said limits. Additionally, we won't in practice attain exact numbers for these limits since we have to approximate them by dividing by the expected number of replicas. All in all, I wouldn't worry, especially as the limiter automatically backs off if it receives a rate limiting error.

default:
panic(fmt.Errorf("unrecognized rateLimiterMode %v", mode))
}
startQPS := min(initialQPS, maxQPS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not allow having maxQps < initialQps instead of getting the min here. It should be possible with the current config to set:

  • initialQps: 10
  • maxQps: 5
    Is it something that we should allow ? Or at least recommend that maxQps should be higher ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice one shouldn't have max QPS lower than initial QPS, this is just a guard. It doesn't really matter to me whether we return a validation error instead.

if reg != nil {
constLabels := prometheus.Labels{"name": name, "operation": operation}
rl.rateLimitedSeconds = promauto.With(reg).NewCounter(prometheus.CounterOpts{
Name: "cortex_gcs_rate_limited_seconds_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use a counter for measuring seconds , and not a a histogram ? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't given it much thought yet. The PR is still in an early state. Do you have specific arguments for using a histogram instead?

if newQPS != rl.currentQPS {
rl.currentQPS = newQPS
rl.limiter.SetLimit(rate.Limit(newQPS))
rl.limiter.SetBurst(newQPS * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be higher than the maxQps 🤔 , which can cause rate limit , no ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we set newQPS = rl.maxQPS

Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! I left a few minor suggestions. Thank you!

@aknuds1 aknuds1 force-pushed the arve/gcs-rate-limiter branch from d99a957 to 4fe3e06 Compare December 5, 2025 09:09
@aknuds1 aknuds1 force-pushed the arve/gcs-rate-limiter branch from 9e60897 to 054ea7a Compare December 5, 2025 10:09
Signed-off-by: Arve Knudsen <[email protected]>
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants