Skip to content

[test] FlexCounter.bulkChunksize is flaky due to timing-dependent usleep #1765

@rustiqly

Description

@rustiqly

Description

FlexCounter.bulkChunksize in unittest/syncd/TestFlexCounter.cpp is a flaky test that intermittently fails in CI due to a timing race condition.

Failure Signature

TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
  object_count
    Which is: 6
  unifiedBulkChunkSize
    Which is: 3

The test expects bulk counter polling to split 6 objects into chunks of 3, but all 6 arrive in a single chunk.

Reproduction Evidence

The failure is not caused by code changes — it occurs on completely unrelated PRs:

PR Content Build Result
#1763 vslib oper speed fix 1035621 FlexCounter.bulkChunksize FAILED
#1764 docs only (copilot-instructions.md) 1035471 FlexCounter.bulkChunksize FAILED
#1757 counter stats fix 1030510 All tests PASSED

Both failures show identical symptoms: object_count=6 vs expected 3, test duration ~9467ms.

Root Cause Analysis

The test uses usleep(1000*1050) (1.05s) as a hardcoded wait for the FlexCounter polling thread, which has a 1-second poll interval. The flaky scenario is the third sub-test ("Remove per counter bulk chunk size after initializing it"):

// Line 1480: Remove per counter bulk chunk size after initializing it
unifiedBulkChunkSize = 3;
initialCheckCount = 6;
testAddRemoveCounter(6, ..., "3",
    "SAI_PORT_STAT_IF_OUT_QLEN:0;SAI_PORT_STAT_IF_IN_FEC:2",
    true,   // bulkChunkSizeAfterPort
    "",     // pluginName
    true);  // immediatelyRemoveBulkChunkSizePerCounter

The race condition

Inside testAddRemoveCounter, the sequence is:

1. Add 6 ports one-at-a-time via bulkAddCounter() — each calls notifyPoll()
2. addCounterPlugin() with bulk_chunk_size=3 + per-counter prefix config
3. addCounterPlugin() with empty per-counter prefix (removes it)  ← immediate
4. usleep(1000*1050)  ← 1.05s fixed wait
5. Read counters from DB and verify chunk sizes

What goes wrong under CI load:

  • Port additions in step 1 each trigger notifyPoll(), waking the polling thread
  • The polling thread and test thread contend for the FlexCounter mutex
  • initialCheckCount=6 gates init vs. data collection calls, but init calls happen across poll cycles triggered by port additions
  • Under CPU pressure, the timing between steps 2-3 (config application) and the polling thread's collectCounters() shifts
  • The chunk size update from setBulkChunkSize(3) propagates through BulkContextType objects, but if the polling thread runs its data collection before the per-prefix removal in step 3 is fully processed, or if the bulk contexts are rebuilt with default_bulk_chunk_size=0 during the removal, the chunking falls back to size (all 6 objects)

Key code paths

  • FlexCounter::addCounterPlugin() (line 2668) — takes mutex, sets chunk size, calls notifyPoll()
  • FlexCounter::flexCounterThreadRunFunction() (line 3033) — takes mutex, calls collectCounters()
  • bulkCollectData() (line 1290) — uses ctx.default_bulk_chunk_size for chunking; falls back to full size when chunk size is 0
  • Mock bulkGetStats (line 1330) — initialCheckCount-- gates init vs data, assertion at line 1390 checks chunk size

Why the 1.05s sleep is insufficient

The test assumes one complete data collection cycle happens within 50ms after the 1s poll interval. But:

  • Thread scheduling under CI load can add hundreds of ms of jitter
  • Multiple poll cycles from port-addition notifications burn through initialCheckCount at unpredictable rates
  • The correction calculation in the polling thread (line 3058-3059) can shift the actual poll timing

Suggested Fix

Replace usleep(1000*1050) with a deterministic synchronization mechanism:

Option A: Poll-wait for expected DB state

// Instead of: usleep(1000*1050);
auto deadline = std::chrono::steady_clock::now() + std::chrono::seconds(5);
while (std::chrono::steady_clock::now() < deadline) {
    countersTable.getKeys(keys);
    removeTimeStamp(keys, countersTable);
    if (keys.size() == object_ids.size()) break;
    usleep(100 * 1000); // 100ms poll
}

Option B: Add a callback/condition variable to FlexCounter
Signal the test when collectCounters() completes, so the test waits for exactly one data collection cycle instead of guessing timing.

Option C: Increase sleep margin (band-aid)

usleep(1000*3000); // 3s instead of 1.05s — reduces flakiness but doesn't eliminate it

Option A is the least invasive. Option B is the most robust.

Affected Code

  • unittest/syncd/TestFlexCounter.cpp — 8 occurrences of usleep(1000*1050) (lines 175, 1696, 1715, 1726, 1738, 1749, 1759, 1820)
  • All sub-tests in FlexCounter.bulkChunksize (line 1192) and testAddRemoveCounter (line 93) share this pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions