Skip to content

Low performance (high effective IO latency) with io_scheduler and no/short polling or --overprovisioned #3202

@travisdowns

Description

@travisdowns

Summary

When using --overprovisioned or --idle-poll-time-us=1 and a using non-default io-properties appropriate for the host, single-shard I/O performance drops to ~2.7K IOPS compared to ~12K IOPS with baseline configuration (no --overprovisioned). This represents a roughly 4-5x performance reduction. The issue appears to be caused by the reactor sleeping longer than "necessary" when waiting for token bucket capacity: specifically because the per grab amount is large, and we wait for the full amount.

Environment

  • Seastar version: Current master (commit bbd0001)
  • Test workload: Random 4KB reads (or writes, behavior is the same), iodepth=1
  • Storage: NVMe SSD on XFS
  • Disk capabilities (measured by iotune):
    • Random read IOPS: 268,337 IOPS

io-properties Configuration

disks:
  - mountpoint: /mnt/xfs
    read_iops: 268337
    read_bandwidth: 1259085440
    write_iops: 134175
    write_bandwidth: 604742528

Test Results

Baseline (no --overprovisioned)

$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
    --conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1
Job highprio -> sched class highprio
    IOPS: 12126.1523

Result: ~12,126 IOPS

With --overprovisioned

$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
    --conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --overprovisioned
Job highprio -> sched class highprio
    IOPS: 2676.50513

Result: ~2,677 IOPS (about 4.5x slower)

With --idle-poll-time-us=1 (same behavior)

$ build/release/apps/io_tester/io_tester --io-properties-file ~/io-props.yaml \
    --conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --idle-poll-time-us=1
Job highprio -> sched class highprio
    IOPS: 2670.78101

Result: ~2,671 IOPS (same degradation as --overprovisioned, confirming the issue is caused by reduced polling frequency)

Without io-properties (control)

$ build/release/apps/io_tester/io_tester \
    --conf ~/io2.yaml --storage /mnt/xfs/io_tester --duration=5 -c1 --overprovisioned
Job highprio -> sched class highprio
    IOPS: 12289.8516

Result: ~12,290 IOPS (similar to baseline, no degradation without io-properties)

Root Cause Analysis

The issue appears to be caused by two factors:

Issue 1: Sleep Time Based on Full Deficiency, Not I/O Need

The sleep duration is calculated based on replenishing the full token deficiency (the gap between pending reservation and bucket head), not just the tokens needed for the next I/O.

From ioinfo -c1 --directory /mnt/xfs:

fair_queue:
  capacities:
    4096:
      read: 117101    # tokens needed for 4KB read
  per_tick_grab_threshold: 12582912
  token_bucket:
    rate: 16777216    # tokens per millisecond

For a 4KB read:

  • Tokens needed: 117,101 tokens
  • Time to replenish 1 I/O: ~7 us

But the token bucket reserves in large chunks:

  • per_tick_grab_threshold: 12,582,912 tokens
  • Time to replenish full reservation: ~750 us

This means the system may sleep considerably longer than needed to dispatch a single I/O.

Issue 2: Tokens Only Added During Polling

Token bucket replenishment is not passive — tokens are only added when maybe_replenish_capacity() is called, which only happens from poll_io_queue().

  1. With --overprovisioned, max_poll_time=0us causes the reactor to sleep immediately when idle
  2. While sleeping, no polling occurs, so no tokens are added to the bucket
  3. The reactor sleeps for the full calculated deficiency time

The Flow

  1. I/O completes at time T=0
  2. io-queue checks for next dispatch — not enough tokens available
  3. next_pending_aio() calculates delay based on the full deficiency, which is typically close to per_tick_grab_threshold (12.5M tokens, ~750us) since that's the reservation size used in grab_capacity()
  4. Timer armed for T+delay, reactor goes to sleep
  5. Reactor sleeps for full delay (even though only ~7us worth of tokens are needed for the next I/O)
  6. Timer fires, reactor wakes
  7. poll_io_queue() called, which calls maybe_replenish_capacity()
  8. Tokens are now added (for the full elapsed time)
  9. Next I/O dispatches

The result is that I/O dispatch frequency is limited by the sleep duration rather than the actual token replenishment time needed.

Why Baseline Works

With baseline configuration:

  • max_poll_time=200us means the reactor actively polls before sleeping
  • poll_io_queue() is called frequently
  • maybe_replenish_capacity() runs every few microseconds
  • Tokens appear in the bucket almost as soon as time passes

With --overprovisioned:

  • max_poll_time=0us means no active polling
  • Tokens only appear when the reactor wakes from sleep
  • Sleep duration becomes the limiting factor for throughput

Evidence from Logging

Baseline replenishment pattern:

io_throttler::maybe_replenish: elapsed=0us, extra=10335, REPLENISHING
io_throttler::maybe_replenish: elapsed=0us, extra=7650, REPLENISHING
io_throttler::maybe_replenish: elapsed=1us, extra=26055, REPLENISHING

Tokens replenished every 0-1 microseconds.

Overprovisioned replenishment pattern:

io_throttler::maybe_replenish: elapsed=543us, extra=9124541, REPLENISHING
io_throttler::maybe_replenish: elapsed=431us, extra=7236634, REPLENISHING
io_throttler::maybe_replenish: elapsed=589us, extra=9894866, REPLENISHING

Tokens only replenished every 400-600 microseconds (when reactor wakes).

Proposed Solutions

Some of the solutions proposed for the multi-shard io-properties issue (see #3201) may also apply here, particularly those related to passive token replenishment or changes to how sleep duration is calculated.

Impact

This issue affects workloads that:

  • Use --overprovisioned mode (common in containerized/virtualized environments)
  • Have io-properties configured for disk throttling
  • Have low iodepth where token bucket throttling dominates

The performance reduction may make --overprovisioned less suitable for use with io-properties in latency-sensitive scenarios.

Reproducibility

100% reproducible with the test configurations provided.

Steps to Reproduce

  1. Create io-properties file (~/io-props.yaml):
disks:
  - mountpoint: /mnt/xfs
    read_iops: 268337
    read_bandwidth: 1259085440
    write_iops: 134175
    write_bandwidth: 604742528
  1. Create test config (~/io2.yaml):
- name: highprio
  shards: [0]
  type: randread
  shard_info:
    parallelism: 1
    reqsize: 4kB
    shares: 1000
    think_time: 0
  1. Build Seastar:
./configure.py --mode=release
ninja -C build/release apps/io_tester/io_tester
  1. Run baseline test:
build/release/apps/io_tester/io_tester \
    --io-properties-file ~/io-props.yaml \
    --conf ~/io2.yaml \
    --storage /mnt/xfs/io_tester \
    --duration=5 -c1
  1. Run overprovisioned test:
build/release/apps/io_tester/io_tester \
    --io-properties-file ~/io-props.yaml \
    --conf ~/io2.yaml \
    --storage /mnt/xfs/io_tester \
    --duration=5 -c1 --overprovisioned
  1. Compare results: overprovisioned should show noticeably lower IOPS.

Related Issues

This issue is related to but distinct from the multi-shard io-properties issue (#3201). Both issues stem from the token bucket design in the sense they are related to the grab granularity but have different root causes:

  • Multi-shard issue: Token loss when ready_tokens are discarded on empty queue
  • Overprovisioned issue: Tokens not replenished without active polling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions