fix(logql): include boundary samples for rate_counter to fix incorrect rates #20203

pranav-merugu · 2025-12-11T02:50:14Z

Description

Fixes #19580

This PR addresses a bug in rate_counter() where it returns approximately half the expected rate when log samples fall exactly at range boundaries.

Problem

When querying rate_counter() with samples at exact range boundaries (e.g., t=0 for a range query at t=120s with [2m]), the sample at the range start was incorrectly excluded. This caused:

Loss of the starting counter value
Incorrect rate calculations
Observed rates appearing as ~50% of expected values

For example, with logs emitted every 60 seconds showing counter increases of ~2200/minute:

Expected rate: 2200/60 = ~36.67 messages/second
Actual rate (before fix): ~18.33 messages/second (exactly half)

Root Cause

The streamRangeVectorIterator.load() method used a <= start check that excluded samples at the boundary:

if sample.Timestamp <= start {  // Excludes samples AT the boundary
    _ = r.iter.Next()
    continue
}

For counter metrics, this is problematic because we need the starting counter value to calculate accurate rates. The range semantics (start, end] should exclude samples before start, but include samples at start for counter calculations.

Solution

Modified the boundary check to be operation-specific:

For rate_counter (counter metrics): Use < start to include boundary samples
For other operations: Keep <= start to preserve existing behavior

isCounter := r.r.Operation == syntax.OpRangeTypeRateCounter

if isCounter {
    if sample.Timestamp < start {  // Include samples AT boundary
        _ = r.iter.Next()
        continue
    }
} else {
    if sample.Timestamp <= start {  // Keep original behavior
        _ = r.iter.Next()
        continue
    }
}

Testing

New Tests Added (5 comprehensive test functions)

TestRateCounterBug19580 - Reproduces the exact scenario from issue Recording rule generates incorrect values? #19580
- 3 samples at t=0, 60, 120s
- Verifies rate is ~36.93/s (not ~18.47/s)
TestRateCounterBugTwoSamples - Validates 2-sample case still works correctly
TestRateCounterWithReset - Counter reset edge case
- Value decreases (counter wraps)
- Verifies positive rate despite reset
TestRateCounterIrregularIntervals - Non-uniform sampling
- Samples at irregular intervals (0s, 45s, 90s, 180s)
- Tests extrapolation with irregular intervals
TestRateCounterSingleSampleAtBoundary - Degenerate case
- Single sample at range boundary
- Verifies rate=0 (need 2+ samples)

Test Results

$ go test -v -run TestRateCounter ./pkg/logql/
=== RUN   TestRateCounterBug19580
    rate_counter_bug_test.go:105: Expected rate: 36.933333 messages/second
    rate_counter_bug_test.go:106: Actual rate: 36.933370 messages/second
    rate_counter_bug_test.go:107: Ratio (actual/expected): 1.000001
--- PASS: TestRateCounterBug19580 (0.00s)
[... all tests PASS]
PASS
ok      github.com/grafana/loki/v3/pkg/logql   0.590s

Regression Testing

✅ All existing rate_counter tests pass
✅ All existing rate() tests pass (no impact on non-counter metrics)
✅ Full pkg/logql test suite passes (8 packages, 0 failures)

Impact

Scope: Minimal - single function, ~10 lines changed
Risk: Low - targeted fix with operation-specific logic
Affected functionality: Only rate_counter() queries where samples fall at exact range boundaries
Backwards compatibility: Preserved for all non-counter operations

Checklist

Added comprehensive test coverage
All existing tests pass
Code follows project formatting standards (gofmt)
Commit message follows Conventional Commits format
Links to related issue (Recording rule generates incorrect values? #19580)

…t rates When querying rate_counter() with samples at exact range boundaries, the sample at the range start was incorrectly excluded, causing rates to be calculated as approximately half the expected value. This occurred because the streamRangeVectorIterator.load() method used a '<= start' check that excluded samples at the boundary. For counter metrics, this is problematic because we need the starting counter value to calculate accurate rates. Changes: - Modified streamRangeVectorIterator.load() to include samples at range boundaries for counter metrics (rate_counter operation) - Preserved existing behavior for non-counter aggregations - Added comprehensive test coverage for edge cases Fixes grafana#19580

CLAassistant · 2025-12-11T02:51:22Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

pranav-merugu requested a review from a team as a code owner December 11, 2025 02:50

pull-request-size bot added the size/L label Dec 11, 2025

Merge branch 'main' into fix-rate-counter-boundary-samples

e26a618

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(logql): include boundary samples for rate_counter to fix incorrect rates #20203

fix(logql): include boundary samples for rate_counter to fix incorrect rates #20203

Uh oh!

pranav-merugu commented Dec 11, 2025

Uh oh!

CLAassistant commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(logql): include boundary samples for rate_counter to fix incorrect rates #20203

Are you sure you want to change the base?

fix(logql): include boundary samples for rate_counter to fix incorrect rates #20203

Uh oh!

Conversation

pranav-merugu commented Dec 11, 2025

Description

Problem

Root Cause

Solution

Testing

New Tests Added (5 comprehensive test functions)

Test Results

Regression Testing

Impact

Checklist

Uh oh!

CLAassistant commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants