Skip to content

Conversation

@pranav-merugu
Copy link

Description

Fixes #19580

This PR addresses a bug in rate_counter() where it returns approximately half the expected rate when log samples fall exactly at range boundaries.

Problem

When querying rate_counter() with samples at exact range boundaries (e.g., t=0 for a range query at t=120s with [2m]), the sample at the range start was incorrectly excluded. This caused:

  • Loss of the starting counter value
  • Incorrect rate calculations
  • Observed rates appearing as ~50% of expected values

For example, with logs emitted every 60 seconds showing counter increases of ~2200/minute:

  • Expected rate: 2200/60 = ~36.67 messages/second
  • Actual rate (before fix): ~18.33 messages/second (exactly half)

Root Cause

The streamRangeVectorIterator.load() method used a <= start check that excluded samples at the boundary:

if sample.Timestamp <= start {  // Excludes samples AT the boundary
    _ = r.iter.Next()
    continue
}

For counter metrics, this is problematic because we need the starting counter value to calculate accurate rates. The range semantics (start, end] should exclude samples before start, but include samples at start for counter calculations.

Solution

Modified the boundary check to be operation-specific:

  • For rate_counter (counter metrics): Use < start to include boundary samples
  • For other operations: Keep <= start to preserve existing behavior
isCounter := r.r.Operation == syntax.OpRangeTypeRateCounter

if isCounter {
    if sample.Timestamp < start {  // Include samples AT boundary
        _ = r.iter.Next()
        continue
    }
} else {
    if sample.Timestamp <= start {  // Keep original behavior
        _ = r.iter.Next()
        continue
    }
}

Testing

New Tests Added (5 comprehensive test functions)

  1. TestRateCounterBug19580 - Reproduces the exact scenario from issue Recording rule generates incorrect values? #19580

    • 3 samples at t=0, 60, 120s
    • Verifies rate is ~36.93/s (not ~18.47/s)
  2. TestRateCounterBugTwoSamples - Validates 2-sample case still works correctly

  3. TestRateCounterWithReset - Counter reset edge case

    • Value decreases (counter wraps)
    • Verifies positive rate despite reset
  4. TestRateCounterIrregularIntervals - Non-uniform sampling

    • Samples at irregular intervals (0s, 45s, 90s, 180s)
    • Tests extrapolation with irregular intervals
  5. TestRateCounterSingleSampleAtBoundary - Degenerate case

    • Single sample at range boundary
    • Verifies rate=0 (need 2+ samples)

Test Results

$ go test -v -run TestRateCounter ./pkg/logql/
=== RUN   TestRateCounterBug19580
    rate_counter_bug_test.go:105: Expected rate: 36.933333 messages/second
    rate_counter_bug_test.go:106: Actual rate: 36.933370 messages/second
    rate_counter_bug_test.go:107: Ratio (actual/expected): 1.000001
--- PASS: TestRateCounterBug19580 (0.00s)
[... all tests PASS]
PASS
ok      github.com/grafana/loki/v3/pkg/logql   0.590s

Regression Testing

  • ✅ All existing rate_counter tests pass
  • ✅ All existing rate() tests pass (no impact on non-counter metrics)
  • ✅ Full pkg/logql test suite passes (8 packages, 0 failures)

Impact

  • Scope: Minimal - single function, ~10 lines changed
  • Risk: Low - targeted fix with operation-specific logic
  • Affected functionality: Only rate_counter() queries where samples fall at exact range boundaries
  • Backwards compatibility: Preserved for all non-counter operations

Checklist

…t rates

When querying rate_counter() with samples at exact range boundaries,
the sample at the range start was incorrectly excluded, causing rates
to be calculated as approximately half the expected value.

This occurred because the streamRangeVectorIterator.load() method
used a '<= start' check that excluded samples at the boundary. For
counter metrics, this is problematic because we need the starting
counter value to calculate accurate rates.

Changes:
- Modified streamRangeVectorIterator.load() to include samples at
  range boundaries for counter metrics (rate_counter operation)
- Preserved existing behavior for non-counter aggregations
- Added comprehensive test coverage for edge cases

Fixes grafana#19580
@pranav-merugu pranav-merugu requested a review from a team as a code owner December 11, 2025 02:50
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recording rule generates incorrect values?

2 participants