Skip to content

Rate limiting observability (metrics and tracing) PR A#5701

Open
Sanskarzz wants to merge 1 commit into
stacklok:mainfrom
Sanskarzz:ratelimiting-olly
Open

Rate limiting observability (metrics and tracing) PR A#5701
Sanskarzz wants to merge 1 commit into
stacklok:mainfrom
Sanskarzz:ratelimiting-olly

Conversation

@Sanskarzz

Copy link
Copy Markdown
Contributor

Summary

  • Preserve which configured rate-limit bucket rejected a request so later metrics and tracing can report the enforcement source accurately.
  • Add a typed RejectionSource contract and Decision.RejectedBy with stable identifiers for shared server, shared tool, per-user server, and per-user
    tool buckets.
  • Pair each applicable bucket with its rejection identifier while retaining the existing atomic Lua result and bucket priority order.
  • Preserve the source through the existing ratelimit.Allow adapter and RateLimitedError, which are already used by MCPServer and vMCP.
  • Add focused tests for every rejection source, allowed decisions, adapter propagation, vMCP propagation, and the case where multiple buckets are
    exhausted.

Part of #4553

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (focused package tests listed below)
  • E2E tests (task test-e2e)
  • Linting (focused package lint listed below)
  • Manual testing (describe below)

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Changes

File Change
pkg/ratelimit/limiter.go Defines the typed rejection-source contract and adds it to decisions.
pkg/ratelimit/errors.go Preserves rejection source on the existing rate-limit error.
pkg/ratelimit/limiter_test.go Verifies all rejection sources and first-rejection ordering.
pkg/vmcp/ratelimit/decorator_test.go Verifies rejection source reaches the vMCP enforcement boundary.

Does this introduce a user-facing change?

No. Rate-limit enforcement, responses, Redis keys, retry timing, and fail-open behavior are unchanged.

Implementation plan

Approved implementation plan
  1. Define bounded rejection identifiers with a named RejectionSource type.
  2. Pair applicable buckets with those stable identifiers.
  3. Populate Decision.RejectedBy from the first rejected index returned by the
    existing atomic Lua check.
  4. Preserve the source when the existing ratelimit.Allow helper returns a
    RateLimitedError.
  5. Leave RejectedBy empty for allowed decisions.
  6. Verify every bucket source, adapter propagation, vMCP propagation, and
    first-rejection ordering.

Follow-up PRs

This is the first PR in the rate-limit observability work. To keep each review
focused, the remaining changes will be opened sequentially after the preceding
PR merges:

  1. Core rate-limit metrics

    Add OpenTelemetry instruments for allowed/rejected decisions, Redis errors,
    and Lua check latency. Decision metrics will use the rejection metadata from
    this PR to distinguish shared versus per-user and server versus tool checks.
    Redis failures will be classified into bounded error types, and latency will
    cover each attempted atomic Lua check without adding tool names or user IDs
    as high-cardinality labels.

    This PR will include focused metric-reader unit tests, update MCPServer
    middleware ordering so telemetry is active when rate limiting executes, and
    extend the existing vMCP rate-limit E2E to enable Prometheus, send allowed
    and rejected traffic, scrape /metrics, and verify non-zero decision and
    latency series. The new metric names, labels, and counting semantics will be
    documented in the same PR.

  2. Fail-open metric

    Add a dedicated counter for requests that proceed because Redis was
    unavailable. This is separate from the Redis-error counter: the error metric
    reports the dependency failure, while the fail-open metric reports its
    effect on enforcement. The implementation will preserve the current policy
    in both production paths: MCPServer HTTP middleware and the vMCP
    core.VMCP decorator continue forwarding requests after infrastructure
    errors.

    Focused tests will verify one increment per affected request, no increments
    for normal allows or rate-limit rejections, repeated outage behavior, and
    continued delegation through both enforcement paths. Its operational
    meaning will be added to the metric documentation in the same PR.

  3. Request-span attributes

    Annotate the existing request span rather than creating a second span.
    Allowed requests will report rate_limit.decision=allowed,
    rate_limit.rejected_by=none, and rate_limit.fail_open=false. Rejected
    requests will identify the bucket preserved by this PR. Redis failures will
    report an allowed decision with rate_limit.fail_open=true, making it
    possible to distinguish a normal allow from an unenforced request during an
    outage.

    Span-recorder tests will cover allowed, rejected, and fail-open outcomes
    through both the MCPServer middleware and vMCP decorator, including the
    invariant that rejected calls never reach the next handler or inner
    core.VMCP. The tracing attribute contract will be documented in that PR.

Each follow-up will contain the tests and documentation as needed.

Special notes for reviewers

The new metadata uses the rejected bucket index already returned by bucket.ConsumeAll; it does not introduce another rate-limit implementation or perform a second Redis check. It is propagated through the existing ratelimit.Allow and RateLimitedError path, while the vMCP core.VMCP decorator implementation remains unchanged.

Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label Jul 1, 2026
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.66%. Comparing base (99e233d) to head (733dc33).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5701   +/-   ##
=======================================
  Coverage   70.66%   70.66%           
=======================================
  Files         667      667           
  Lines       67607    67625   +18     
=======================================
+ Hits        47775    47790   +15     
- Misses      16361    16365    +4     
+ Partials     3471     3470    -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant