Skip to content

Commit 8f547af

Browse files
viraatcclaude
andcommitted
fix(metrics): raise metrics drain-timeout default to 300s
A 1M-sample run holds ~2M deferred tokenizations at ENDED; the drain fans the whole buffer into one encode_batch per shard, so a 60s budget expires before any chunk returns and the entire backlog is dropped. 300s covers 1M-sample runs with headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 770ae30 commit 8f547af

8 files changed

Lines changed: 9 additions & 9 deletions

File tree

AGENTS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ The aggregator is a separate process (`python -m inference_endpoint.async_utils.
115115

116116
- **Series storage**: each `SeriesSampler` keeps three parallel views: O(1) cheap rollups (count/total/min/max/sum_sq, exact), an HDR Histogram (cheap live percentiles), and an in-memory `array.array` of raw values (for exact percentiles in the `COMPLETE` snapshot). Hot path is `registry.record(name, value)` — no allocation, no I/O.
117117
- **Counter API**: `registry.increment(name, delta=1)` for sample-event counters. `registry.set_counter(name, value)` only for the two duration counters (`total_duration_ns` max-of-elapsed, `tracked_duration_ns` sum-of-blocks).
118-
- **Lifecycle**: `INITIALIZE` (constructed, awaiting first `STARTED`) → `LIVE` (run in progress, ticking every `--publish-interval` seconds) → `DRAINING` (set on `ENDED`; tick continues; bounded by the `--drain-timeout` budget — schema default 60 s) → terminal: `COMPLETE` (clean end via `publish_final`, exact stats) **or** `INTERRUPTED` (signal-handler-triggered final via SIGTERM/SIGINT; best-effort partial stats). Drain timeout detected by consumers as `state == COMPLETE and n_pending_tasks > 0`; interrupted runs are detected as `state == INTERRUPTED` directly.
118+
- **Lifecycle**: `INITIALIZE` (constructed, awaiting first `STARTED`) → `LIVE` (run in progress, ticking every `--publish-interval` seconds) → `DRAINING` (set on `ENDED`; tick continues; bounded by the `--drain-timeout` budget — schema default 300 s) → terminal: `COMPLETE` (clean end via `publish_final`, exact stats) **or** `INTERRUPTED` (signal-handler-triggered final via SIGTERM/SIGINT; best-effort partial stats). Drain timeout detected by consumers as `state == COMPLETE and n_pending_tasks > 0`; interrupted runs are detected as `state == INTERRUPTED` directly.
119119
- **Final delivery is dual-path with separated concerns**: `publish_final` atomically writes `final_snapshot.json` (`tmp + fsync(file) + rename + fsync(parent_dir)`) — this is the **primary** Report source — AND emits the terminal-state snapshot over pub/sub as a TUI shutdown signal. Each path is wrapped in its own try/except so one failure cannot suppress the other. Main process consumer reads `final_snapshot.json` (via `json.loads` to dict, no Struct decode); falls back to the subscriber's `latest` live snapshot only if the file is missing (e.g. SIGKILL / OOM before the signal handler ran). The dict form is the canonical consumer contract (see `snapshot_to_dict`).
120120
- **Histogram bucket edges are dynamic per snapshot**: log-spaced over the observed `[min, max]`. Bucket count is fixed at construction; consumers MUST re-render from the snapshot's `(lo, hi, count)` triples each frame and MUST NOT track bucket-by-index across snapshots.
121121

src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ def __init__(
124124
):
125125
# drain_timeout_s is injected (not derived) because the right
126126
# value is workload-dependent: long-context tokenize-heavy runs
127-
# need more headroom than the schema default 60 s, and the
127+
# need more headroom than the schema default 300 s, and the
128128
# aggregator itself can't measure that ahead of time. Keeping it
129129
# as an arg lets the __main__ CLI flag plumb the user's choice
130130
# through without coupling this class to argparse.

src/inference_endpoint/async_utils/services/metrics_aggregator/snapshot.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ class SessionState(str, Enum):
4545
LIVE → run in progress; tick task publishing live HDR-derived stats.
4646
DRAINING → ``SessionEventType.ENDED`` has been received; the aggregator
4747
is tokenizing the buffered samples (bounded by the
48-
``--drain-timeout`` budget — schema default 60 s). Tick task
48+
``--drain-timeout`` budget — schema default 300 s). Tick task
4949
continues at this stage, still HDR-derived; no new events
5050
will arrive.
5151
COMPLETE → terminal clean state. The ``publish_final()`` snapshot

src/inference_endpoint/config/schema.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -558,11 +558,11 @@ class DrainConfig(BaseModel):
558558
),
559559
),
560560
] = Field(
561-
60.0,
561+
300.0,
562562
ge=0,
563563
description=(
564564
"Wall-clock budget (seconds) to finish tokenizing buffered samples "
565-
"after ENDED (default: 60.0; 0 = unlimited)."
565+
"after ENDED (default: 300.0; 0 = unlimited)."
566566
),
567567
)
568568
metrics_tokenizer_workers: Annotated[

src/inference_endpoint/config/templates/concurrency_template_full.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ settings:
7979
warmup_timeout_s: 240.0 # Warmup drain timeout in seconds (None = wait indefinitely)
8080
performance_timeout_s: 240.0 # Performance drain timeout in seconds (None = wait indefinitely)
8181
accuracy_timeout_s: null # Accuracy drain timeout in seconds (None = wait indefinitely)
82-
metrics_drain_timeout_s: 60.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 60.0; 0 = unlimited).
82+
metrics_drain_timeout_s: 300.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 300.0; 0 = unlimited).
8383
metrics_tokenizer_workers: 2 # In-process tokenizer threads for live (mid-run) ISL/OSL/TPOT (default: 2; 0 = defer everything to the end-of-run drain).
8484
warmup:
8585
enabled: false # Enable warmup phase before performance run

src/inference_endpoint/config/templates/offline_template_full.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ settings:
7979
warmup_timeout_s: 240.0 # Warmup drain timeout in seconds (None = wait indefinitely)
8080
performance_timeout_s: 240.0 # Performance drain timeout in seconds (None = wait indefinitely)
8181
accuracy_timeout_s: null # Accuracy drain timeout in seconds (None = wait indefinitely)
82-
metrics_drain_timeout_s: 60.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 60.0; 0 = unlimited).
82+
metrics_drain_timeout_s: 300.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 300.0; 0 = unlimited).
8383
metrics_tokenizer_workers: 2 # In-process tokenizer threads for live (mid-run) ISL/OSL/TPOT (default: 2; 0 = defer everything to the end-of-run drain).
8484
warmup:
8585
enabled: false # Enable warmup phase before performance run

src/inference_endpoint/config/templates/online_template_full.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ settings:
7979
warmup_timeout_s: 240.0 # Warmup drain timeout in seconds (None = wait indefinitely)
8080
performance_timeout_s: 240.0 # Performance drain timeout in seconds (None = wait indefinitely)
8181
accuracy_timeout_s: null # Accuracy drain timeout in seconds (None = wait indefinitely)
82-
metrics_drain_timeout_s: 60.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 60.0; 0 = unlimited).
82+
metrics_drain_timeout_s: 300.0 # Wall-clock budget (seconds) to finish tokenizing buffered samples after ENDED (default: 300.0; 0 = unlimited).
8383
metrics_tokenizer_workers: 2 # In-process tokenizer threads for live (mid-run) ISL/OSL/TPOT (default: 2; 0 = defer everything to the end-of-run drain).
8484
warmup:
8585
enabled: false # Enable warmup phase before performance run

tests/unit/commands/test_benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -489,7 +489,7 @@ def test_defaults(self):
489489
assert cfg.warmup_timeout_s == 240.0
490490
assert cfg.performance_timeout_s == 240.0
491491
assert cfg.accuracy_timeout_s is None
492-
assert cfg.metrics_drain_timeout_s == 60.0
492+
assert cfg.metrics_drain_timeout_s == 300.0
493493

494494
@pytest.mark.unit
495495
@pytest.mark.parametrize(

0 commit comments

Comments
 (0)