Skip to content

Conversation

@calghar
Copy link
Contributor

@calghar calghar commented Oct 31, 2025

Summary

This PR fixes the flaky Test_rateLimitExport test by replacing the fixed sleep duration with a polling-based synchronization mechanism, addressing the timing race condition reported in #2789.

Related Issue

Fixes #2789

Root Cause

The test used a fixed 200ms sleep which allowed multiple rate limiter ticker intervals (50ms each) to fire during the wait period. This created a timing window where:

  • Multiple tickers could emit rate-limit-info messages if events were still processing
  • A race condition at ticker boundaries could cause off-by-one errors in event counts
  • No synchronization existed between "events sent" and "events fully processed"

Proposed Changes

  • Added countEvents() helper function: Non-blocking function to count events and rate-limit-info messages without assertions
  • Replaced fixed sleep with polling: Poll every 10ms until expected number of events and rate-limit-info messages are received
  • Added timeout with clear diagnostics: 500ms timeout (2.5× original sleep) with descriptive error message showing actual vs. expected counts

Testing Performed

  • Test passes 20 consecutive runs without failures (verified via Docker with golang:1.25)
  • No performance degradation - typically should complete faster than the original
  • Deterministic behavior regardless of system timing or load

Backward Compatibility

No breaking changes. The fix only modifies the test implementation, not the rate limiter functionality itself.

Changelog


@calghar calghar requested a review from a team as a code owner October 31, 2025 13:20
@calghar calghar requested a review from tpapagian October 31, 2025 13:20
Replace fixed sleep with polling-based synchronization to eliminate
timing-dependent race condition that caused intermittent test failures.

The test was using a fixed 200ms sleep which created a window where
multiple ticker intervals could fire, causing off-by-one errors in
event counts. The new approach polls every 10ms until expected events
are received, with a 500ms timeout for robustness.

Fixes cilium#2789

Signed-off-by: Farooq Shaikh <[email protected]>
@calghar calghar force-pushed the fix/flaky-rate-limit-test-2789 branch from 299f80b to 57d5e47 Compare October 31, 2025 13:23
@calghar calghar changed the title exporter: fix flaky Test_rateLimitExport by replacing fixed sleep wit… exporter: fix flaky Test_rateLimitExport with polling Oct 31, 2025
@calghar
Copy link
Contributor Author

calghar commented Oct 31, 2025

This PR fixes a flaky test with no user-facing changes. Could a maintainer please add the release-note/misc label? Thanks!

@sayboras sayboras added release-note/ci This PR makes changes to the CI. release-note/misc This PR makes changes that have no direct user impact. labels Nov 1, 2025
Copy link
Contributor

@FedeDP FedeDP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@mtardy mtardy removed the release-note/ci This PR makes changes to the CI. label Nov 21, 2025
Copy link
Member

@mtardy mtardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as well, only a few golang nits. Will merge this later, give you some time to update the patch or not depending on what you prefer! thanks

Comment on lines +125 to +126
func countEvents(eventsJSON []string) (int, int, uint64) {
gotEvents, gotRateLimitInfo, gotDropped := 0, 0, uint64(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd used the named return parameters feature if you are anyway going to do it

Suggested change
func countEvents(eventsJSON []string) (int, int, uint64) {
gotEvents, gotRateLimitInfo, gotDropped := 0, 0, uint64(0)
func countEvents(eventsJSON []string) (gotEvents int, gotRateLimitInfo int, gotDropped uint64) {

Comment on lines +252 to +257
pollLoop:
for {
gotEvents, gotRateLimitInfo, _ := countEvents(results.items)
if gotEvents >= tt.wantEvents && gotRateLimitInfo >= tt.wantRateLimitInfo {
// We have all the expected output, proceed to cleanup and assertions.
break pollLoop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to name it in your case?

Suggested change
pollLoop:
for {
gotEvents, gotRateLimitInfo, _ := countEvents(results.items)
if gotEvents >= tt.wantEvents && gotRateLimitInfo >= tt.wantRateLimitInfo {
// We have all the expected output, proceed to cleanup and assertions.
break pollLoop
for {
gotEvents, gotRateLimitInfo, _ := countEvents(results.items)
if gotEvents >= tt.wantEvents && gotRateLimitInfo >= tt.wantRateLimitInfo {
// We have all the expected output, proceed to cleanup and assertions.
break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note/misc This PR makes changes that have no direct user impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tests: pkg.sensors.tracing.TestKprobeRateLimit is flaky

4 participants