-
Notifications
You must be signed in to change notification settings - Fork 351
Description
Description
FlexCounter.bulkChunksize in unittest/syncd/TestFlexCounter.cpp is a flaky test that intermittently fails in CI due to a timing race condition.
Failure Signature
TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
object_count
Which is: 6
unifiedBulkChunkSize
Which is: 3
The test expects bulk counter polling to split 6 objects into chunks of 3, but all 6 arrive in a single chunk.
Reproduction Evidence
The failure is not caused by code changes — it occurs on completely unrelated PRs:
| PR | Content | Build | Result |
|---|---|---|---|
| #1763 | vslib oper speed fix | 1035621 | FlexCounter.bulkChunksize FAILED |
| #1764 | docs only (copilot-instructions.md) | 1035471 | FlexCounter.bulkChunksize FAILED |
| #1757 | counter stats fix | 1030510 | All tests PASSED |
Both failures show identical symptoms: object_count=6 vs expected 3, test duration ~9467ms.
Root Cause Analysis
The test uses usleep(1000*1050) (1.05s) as a hardcoded wait for the FlexCounter polling thread, which has a 1-second poll interval. The flaky scenario is the third sub-test ("Remove per counter bulk chunk size after initializing it"):
// Line 1480: Remove per counter bulk chunk size after initializing it
unifiedBulkChunkSize = 3;
initialCheckCount = 6;
testAddRemoveCounter(6, ..., "3",
"SAI_PORT_STAT_IF_OUT_QLEN:0;SAI_PORT_STAT_IF_IN_FEC:2",
true, // bulkChunkSizeAfterPort
"", // pluginName
true); // immediatelyRemoveBulkChunkSizePerCounterThe race condition
Inside testAddRemoveCounter, the sequence is:
1. Add 6 ports one-at-a-time via bulkAddCounter() — each calls notifyPoll()
2. addCounterPlugin() with bulk_chunk_size=3 + per-counter prefix config
3. addCounterPlugin() with empty per-counter prefix (removes it) ← immediate
4. usleep(1000*1050) ← 1.05s fixed wait
5. Read counters from DB and verify chunk sizes
What goes wrong under CI load:
- Port additions in step 1 each trigger
notifyPoll(), waking the polling thread - The polling thread and test thread contend for the FlexCounter mutex
initialCheckCount=6gates init vs. data collection calls, but init calls happen across poll cycles triggered by port additions- Under CPU pressure, the timing between steps 2-3 (config application) and the polling thread's
collectCounters()shifts - The chunk size update from
setBulkChunkSize(3)propagates throughBulkContextTypeobjects, but if the polling thread runs its data collection before the per-prefix removal in step 3 is fully processed, or if the bulk contexts are rebuilt withdefault_bulk_chunk_size=0during the removal, the chunking falls back tosize(all 6 objects)
Key code paths
FlexCounter::addCounterPlugin()(line 2668) — takes mutex, sets chunk size, callsnotifyPoll()FlexCounter::flexCounterThreadRunFunction()(line 3033) — takes mutex, callscollectCounters()bulkCollectData()(line 1290) — usesctx.default_bulk_chunk_sizefor chunking; falls back to full size when chunk size is 0- Mock
bulkGetStats(line 1330) —initialCheckCount--gates init vs data, assertion at line 1390 checks chunk size
Why the 1.05s sleep is insufficient
The test assumes one complete data collection cycle happens within 50ms after the 1s poll interval. But:
- Thread scheduling under CI load can add hundreds of ms of jitter
- Multiple poll cycles from port-addition notifications burn through
initialCheckCountat unpredictable rates - The
correctioncalculation in the polling thread (line 3058-3059) can shift the actual poll timing
Suggested Fix
Replace usleep(1000*1050) with a deterministic synchronization mechanism:
Option A: Poll-wait for expected DB state
// Instead of: usleep(1000*1050);
auto deadline = std::chrono::steady_clock::now() + std::chrono::seconds(5);
while (std::chrono::steady_clock::now() < deadline) {
countersTable.getKeys(keys);
removeTimeStamp(keys, countersTable);
if (keys.size() == object_ids.size()) break;
usleep(100 * 1000); // 100ms poll
}Option B: Add a callback/condition variable to FlexCounter
Signal the test when collectCounters() completes, so the test waits for exactly one data collection cycle instead of guessing timing.
Option C: Increase sleep margin (band-aid)
usleep(1000*3000); // 3s instead of 1.05s — reduces flakiness but doesn't eliminate itOption A is the least invasive. Option B is the most robust.
Affected Code
unittest/syncd/TestFlexCounter.cpp— 8 occurrences ofusleep(1000*1050)(lines 175, 1696, 1715, 1726, 1738, 1749, 1759, 1820)- All sub-tests in
FlexCounter.bulkChunksize(line 1192) andtestAddRemoveCounter(line 93) share this pattern