[test] FlexCounter.bulkChunksize is flaky due to timing-dependent usleep

## Description

`FlexCounter.bulkChunksize` in `unittest/syncd/TestFlexCounter.cpp` is a flaky test that intermittently fails in CI due to a timing race condition.

## Failure Signature

```
TestFlexCounter.cpp:1390: Failure
Expected equality of these values:
  object_count
    Which is: 6
  unifiedBulkChunkSize
    Which is: 3
```

The test expects bulk counter polling to split 6 objects into chunks of 3, but all 6 arrive in a single chunk.

## Reproduction Evidence

The failure is **not caused by code changes** — it occurs on completely unrelated PRs:

| PR | Content | Build | Result |
|----|---------|-------|--------|
| [#1763](https://github.com/sonic-net/sonic-sairedis/pull/1763) | vslib oper speed fix | [1035621](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=1035621) | `FlexCounter.bulkChunksize` FAILED |
| [#1764](https://github.com/sonic-net/sonic-sairedis/pull/1764) | docs only (copilot-instructions.md) | [1035471](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=1035471) | `FlexCounter.bulkChunksize` FAILED |
| [#1757](https://github.com/sonic-net/sonic-sairedis/pull/1757) | counter stats fix | [1030510](https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_build/results?buildId=1030510) | All tests PASSED |

Both failures show identical symptoms: `object_count=6` vs expected `3`, test duration ~9467ms.

## Root Cause Analysis

The test uses `usleep(1000*1050)` (1.05s) as a hardcoded wait for the FlexCounter polling thread, which has a 1-second poll interval. The flaky scenario is the **third sub-test** ("Remove per counter bulk chunk size after initializing it"):

```cpp
// Line 1480: Remove per counter bulk chunk size after initializing it
unifiedBulkChunkSize = 3;
initialCheckCount = 6;
testAddRemoveCounter(6, ..., "3",
    "SAI_PORT_STAT_IF_OUT_QLEN:0;SAI_PORT_STAT_IF_IN_FEC:2",
    true,   // bulkChunkSizeAfterPort
    "",     // pluginName
    true);  // immediatelyRemoveBulkChunkSizePerCounter
```

### The race condition

Inside `testAddRemoveCounter`, the sequence is:

```
1. Add 6 ports one-at-a-time via bulkAddCounter() — each calls notifyPoll()
2. addCounterPlugin() with bulk_chunk_size=3 + per-counter prefix config
3. addCounterPlugin() with empty per-counter prefix (removes it)  ← immediate
4. usleep(1000*1050)  ← 1.05s fixed wait
5. Read counters from DB and verify chunk sizes
```

**What goes wrong under CI load:**

- Port additions in step 1 each trigger `notifyPoll()`, waking the polling thread
- The polling thread and test thread contend for the FlexCounter mutex
- `initialCheckCount=6` gates init vs. data collection calls, but init calls happen across poll cycles triggered by port additions
- Under CPU pressure, the timing between steps 2-3 (config application) and the polling thread's `collectCounters()` shifts
- The chunk size update from `setBulkChunkSize(3)` propagates through `BulkContextType` objects, but if the polling thread runs its data collection before the per-prefix removal in step 3 is fully processed, or if the bulk contexts are rebuilt with `default_bulk_chunk_size=0` during the removal, the chunking falls back to `size` (all 6 objects)

### Key code paths

- `FlexCounter::addCounterPlugin()` (line 2668) — takes mutex, sets chunk size, calls `notifyPoll()`
- `FlexCounter::flexCounterThreadRunFunction()` (line 3033) — takes mutex, calls `collectCounters()`  
- `bulkCollectData()` (line 1290) — uses `ctx.default_bulk_chunk_size` for chunking; falls back to full size when chunk size is 0
- Mock `bulkGetStats` (line 1330) — `initialCheckCount--` gates init vs data, assertion at line 1390 checks chunk size

### Why the 1.05s sleep is insufficient

The test assumes one complete data collection cycle happens within 50ms after the 1s poll interval. But:
- Thread scheduling under CI load can add hundreds of ms of jitter
- Multiple poll cycles from port-addition notifications burn through `initialCheckCount` at unpredictable rates
- The `correction` calculation in the polling thread (line 3058-3059) can shift the actual poll timing

## Suggested Fix

Replace `usleep(1000*1050)` with a deterministic synchronization mechanism:

**Option A: Poll-wait for expected DB state**
```cpp
// Instead of: usleep(1000*1050);
auto deadline = std::chrono::steady_clock::now() + std::chrono::seconds(5);
while (std::chrono::steady_clock::now() < deadline) {
    countersTable.getKeys(keys);
    removeTimeStamp(keys, countersTable);
    if (keys.size() == object_ids.size()) break;
    usleep(100 * 1000); // 100ms poll
}
```

**Option B: Add a callback/condition variable to FlexCounter**
Signal the test when `collectCounters()` completes, so the test waits for exactly one data collection cycle instead of guessing timing.

**Option C: Increase sleep margin (band-aid)**
```cpp
usleep(1000*3000); // 3s instead of 1.05s — reduces flakiness but doesn't eliminate it
```

Option A is the least invasive. Option B is the most robust.

## Affected Code

- `unittest/syncd/TestFlexCounter.cpp` — 8 occurrences of `usleep(1000*1050)` (lines 175, 1696, 1715, 1726, 1738, 1749, 1759, 1820)
- All sub-tests in `FlexCounter.bulkChunksize` (line 1192) and `testAddRemoveCounter` (line 93) share this pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] FlexCounter.bulkChunksize is flaky due to timing-dependent usleep #1765

Description

Failure Signature

Reproduction Evidence

Root Cause Analysis

The race condition

Key code paths

Why the 1.05s sleep is insufficient

Suggested Fix

Affected Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PR	Content	Build	Result
#1763	vslib oper speed fix	1035621	`FlexCounter.bulkChunksize` FAILED
#1764	docs only (copilot-instructions.md)	1035471	`FlexCounter.bulkChunksize` FAILED
#1757	counter stats fix	1030510	All tests PASSED

[test] FlexCounter.bulkChunksize is flaky due to timing-dependent usleep #1765

Description

Description

Failure Signature

Reproduction Evidence

Root Cause Analysis

The race condition

Key code paths

Why the 1.05s sleep is insufficient

Suggested Fix

Affected Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions