-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Summary
When using --overprovisioned or --idle-poll-time-us=1 and a using non-default io-properties appropriate for the host, single-shard I/O performance drops to ~2.7K IOPS compared to ~12K IOPS with baseline configuration (no --overprovisioned). This represents a roughly 4-5x performance reduction. The issue appears to be caused by the reactor sleeping longer than "necessary" when waiting for token bucket capacity: specifically because the per grab amount is large, and we wait for the full amount.
Environment
- Seastar version: Current master (commit bbd0001)
- Test workload: Random 4KB reads (or writes, behavior is the same), iodepth=1
- Storage: NVMe SSD on XFS
- Disk capabilities (measured by iotune):
- Random read IOPS: 268,337 IOPS
io-properties Configuration
disks:
- mountpoint: /mnt/xfs
read_iops: 268337
read_bandwidth: 1259085440
write_iops: 134175
write_bandwidth: 604742528Test Results
Baseline (no --overprovisioned)
$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
--conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1
Job highprio -> sched class highprio
IOPS: 12126.1523Result: ~12,126 IOPS
With --overprovisioned
$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
--conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --overprovisioned
Job highprio -> sched class highprio
IOPS: 2676.50513Result: ~2,677 IOPS (about 4.5x slower)
With --idle-poll-time-us=1 (same behavior)
$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
--conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --idle-poll-time-us=1
Job highprio -> sched class highprio
IOPS: 2670.78101Result: ~2,671 IOPS (same degradation as --overprovisioned, confirming the issue is caused by reduced polling frequency)
Without io-properties (control)
$ build/release/apps/io_tester/io_tester \
--conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --overprovisioned
Job highprio -> sched class highprio
IOPS: 12289.8516Result: ~12,290 IOPS (similar to baseline, no degradation without io-properties)
Root Cause Analysis
The issue appears to be caused by two factors:
Issue 1: Sleep Time Based on Full Deficiency, Not I/O Need
The sleep duration is calculated based on replenishing the full token deficiency (the gap between pending reservation and bucket head), not just the tokens needed for the next I/O.
From ioinfo -c1 --directory /mnt/xfs:
fair_queue:
capacities:
4096:
read: 117101 # tokens needed for 4KB read
per_tick_grab_threshold: 12582912
token_bucket:
rate: 16777216 # tokens per millisecond
For a 4KB read:
- Tokens needed: 117,101 tokens
- Time to replenish 1 I/O: ~7 us
But the token bucket reserves in large chunks:
- per_tick_grab_threshold: 12,582,912 tokens
- Time to replenish full reservation: ~750 us
This means the system may sleep considerably longer than needed to dispatch a single I/O.
Issue 2: Tokens Only Added During Polling
Token bucket replenishment is not passive — tokens are only added when maybe_replenish_capacity() is called, which only happens from poll_io_queue().
- With
--overprovisioned,max_poll_time=0uscauses the reactor to sleep immediately when idle - While sleeping, no polling occurs, so no tokens are added to the bucket
- The reactor sleeps for the full calculated deficiency time
The Flow
- I/O completes at time T=0
- io-queue checks for next dispatch — not enough tokens available
next_pending_aio()calculates delay based on the full deficiency, which is typically close toper_tick_grab_threshold(12.5M tokens, ~750us) since that's the reservation size used ingrab_capacity()- Timer armed for T+delay, reactor goes to sleep
- Reactor sleeps for full delay (even though only ~7us worth of tokens are needed for the next I/O)
- Timer fires, reactor wakes
poll_io_queue()called, which callsmaybe_replenish_capacity()- Tokens are now added (for the full elapsed time)
- Next I/O dispatches
The result is that I/O dispatch frequency is limited by the sleep duration rather than the actual token replenishment time needed.
Why Baseline Works
With baseline configuration:
max_poll_time=200usmeans the reactor actively polls before sleepingpoll_io_queue()is called frequentlymaybe_replenish_capacity()runs every few microseconds- Tokens appear in the bucket almost as soon as time passes
With --overprovisioned:
max_poll_time=0usmeans no active polling- Tokens only appear when the reactor wakes from sleep
- Sleep duration becomes the limiting factor for throughput
Evidence from Logging
Baseline replenishment pattern:
io_throttler::maybe_replenish: elapsed=0us, extra=10335, REPLENISHING
io_throttler::maybe_replenish: elapsed=0us, extra=7650, REPLENISHING
io_throttler::maybe_replenish: elapsed=1us, extra=26055, REPLENISHING
Tokens replenished every 0-1 microseconds.
Overprovisioned replenishment pattern:
io_throttler::maybe_replenish: elapsed=543us, extra=9124541, REPLENISHING
io_throttler::maybe_replenish: elapsed=431us, extra=7236634, REPLENISHING
io_throttler::maybe_replenish: elapsed=589us, extra=9894866, REPLENISHING
Tokens only replenished every 400-600 microseconds (when reactor wakes).
Proposed Solutions
Some of the solutions proposed for the multi-shard io-properties issue (see #3201) may also apply here, particularly those related to passive token replenishment or changes to how sleep duration is calculated.
Impact
This issue affects workloads that:
- Use
--overprovisionedmode (common in containerized/virtualized environments) - Have io-properties configured for disk throttling
- Have low iodepth where token bucket throttling dominates
The performance reduction may make --overprovisioned less suitable for use with io-properties in latency-sensitive scenarios.
Reproducibility
100% reproducible with the test configurations provided.
Steps to Reproduce
- Create io-properties file (
~/io-props.yaml):
disks:
- mountpoint: /mnt/xfs
read_iops: 268337
read_bandwidth: 1259085440
write_iops: 134175
write_bandwidth: 604742528- Create test config (
~/io2.yaml):
- name: highprio
shards: [0]
type: randread
shard_info:
parallelism: 1
reqsize: 4kB
shares: 1000
think_time: 0- Build Seastar:
./configure.py --mode=release
ninja -C build/release apps/io_tester/io_tester- Run baseline test:
build/release/apps/io_tester/io_tester \
--io-properties-file ~/io-props.yaml \
--conf ~/io2.yaml \
--storage /mnt/xfs/io_tester \
--duration=5 -c1- Run overprovisioned test:
build/release/apps/io_tester/io_tester \
--io-properties-file ~/io-props.yaml \
--conf ~/io2.yaml \
--storage /mnt/xfs/io_tester \
--duration=5 -c1 --overprovisioned- Compare results: overprovisioned should show noticeably lower IOPS.
Related Issues
This issue is related to but distinct from the multi-shard io-properties issue (#3201). Both issues stem from the token bucket design in the sense they are related to the grab granularity but have different root causes:
- Multi-shard issue: Token loss when
ready_tokensare discarded on empty queue - Overprovisioned issue: Tokens not replenished without active polling