perf(metrics): batch tokenization with defer-to-flush drain by viraatc · Pull Request #350 · mlcommons/endpoints

viraatc · 2026-06-09T20:33:38Z

What

ISL/OSL/TPOT need a tokenizer pass per completion. main dispatches one
asyncio task per event into a 2-thread pool — at high completion rates the
backlog grows unboundedly and the end-of-run drain takes ~an hour per million
samples. This PR batches: triggers enqueue O(1); a small live lane keeps live
metrics current; the end-of-run drain tokenizes everything left through a
process-sharded pool that uses the whole machine.

How

BatchTokenizer — the drain runs encode_batch_fast (Rust, rayon)
across auto-sized worker processes, one pinned per 8-core block of the
allowed CPU universe (probed via expand_to_all_online_cpus(), then the
aggregator's inherited mask is restored — the service stays wherever
the parent placed it). No silent fallbacks: a tokenizer without a fast
backend, or a failed/over-budget warmup, is a clean startup error. macOS
shards unpinned (rayon capped per worker) at full speed.
Live lane — in-process threads (--metrics-tokenizer-workers, schema
default 2, the pre-existing knob and footprint; 0 = defer everything to
the drain), rayon-capped, slice-capped per flush. Owned by the queue
(start_live); the publisher knows nothing about tokenization.
TokenBatchQueue — buffers (text, on_count) per event; live
failures/cancellations re-queue items (no sample loss), drain failures are
terminal and stay counted in n_pending_tasks (incomplete-drain contract:
state == complete && n_pending_tasks > 0). Drain budget --drain-timeout
(default 60 s, 0 = unlimited); finalize always runs.
MetricsTable is fully synchronous; CORES_PER_WORKER is a module
constant. Defaults are single-sourced in config/schema.py
(metrics_drain_timeout_s 60 s, metrics_tokenizer_workers 2); the
service args are required and always forwarded by the benchmark.

Validation

Unit suite green (176 metrics-aggregator: queue contract, shard sizing,
drain timeout/failure, live requeue, RAYON caps, wiring seams);
pre-commit clean. Offline-burst e2e: state=complete, all series
populated, drain to n_pending_tasks=0.
Sharding is default-on through the real launch path (verified on a
48-core x86 host and a 144-core GB200): the drain shards span the machine
while the aggregator keeps its inherited mask.

Tokenizer micro-benchmark (GB200, real DeepSeek-R1 tokenizer)

144-core Grace, corpus = MLPerf DS-R1 prompts tiled to the dataset-mean OSL
of 3877 tokens; identical token counts both sides.

impl	parallelism	texts/s	tokens/s	speedup
`main`	2 threads, per-text encode	313	1.21 M	—
this PR	18 shards, batched encode	11,951	46.3 M	38×

1M-sample end-to-end A/B vs `main`

Offline 1M samples, streaming, DS-R1 tokenizer, server-paced at 8k QPS with
~1k-token outputs. Both sides: 1,000,000/1,000,000, state=complete,
n_pending_tasks=0, identical token series.

host	impl	backlog at `ENDED`	drain	total	speedup
GB200 144c	`main`	2,970,972	3,362 s	58.1 min	—
GB200 144c	this PR	2,782,912	42.9 s	3.2 min	18.1×
B200 192c	`main`	2,994,925	3,286 s	56.9 min	—
B200 192c	this PR	2,788,032	61 s	3.4 min	16.5×

Measured on the final design (in-process live lane, --tokenizer-workers 2,
300 s drain default). The live lane keeps ~7% of tokenizations current; the
rest (~2.78M) defer to the end-of-run drain, which the sharded pool clears in
43-61 s. A 1M-sample run needs the 300 s budget — 60 s drops the backlog.
main rows (unlimited drain budget, or they never finish) and the
micro-benchmark are unaffected.

🤖 Generated with Claude Code

github-actions · 2026-06-09T20:33:48Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request replaces the thread-based TokenizePool with a process-sharded BatchTokenizer and a TokenBatchQueue to buffer and batch tokenization work (ISL/OSL/TPOT) during metrics aggregation, preventing the system from falling behind on high-throughput runs. The review feedback highlights critical reliability improvements in token_metrics.py. Specifically, it is recommended to wrap the queue's flush logic in a try...finally block to prevent self._inflight from leaking on exceptions or cancellations. Additionally, count_texts and count_texts_async should explicitly check if the tokenizer is closed, and close() should wait for process pools to shut down to avoid resource leaks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Replace the per-event async tokenize model (one asyncio task per sample's ISL/OSL/TPOT) with a deferred batch design that keeps tokenization ahead of completions on high-completion-rate runs, where the per-event tasks otherwise piled up faster than a single tokenizer thread could clear them and stretched the end-of-run drain. - BatchTokenizer: counts whole batches via the raw tokenizers backend (encode_batch_fast), sharded across worker processes each pinned to a disjoint CORES_PER_WORKER-core block so their rayon pools stay NUMA-local. Falls back to a single in-process thread when there is no fast backend or fewer than two core blocks fit. - TokenBatchQueue: triggers enqueue (text/message + a recorder callback) instead of spawning tasks; the buffer is tokenized in one sharded call at each publish tick (live ISL/OSL/TPOT) and once at end-of-run (flush_remaining, bounded by the drain budget). n_pending_tasks now counts un-tokenized items, preserving the Report "incomplete drain" contract. - MetricsTable is now fully synchronous (drops the in-flight task set, drain_tasks, and in_flight_tasks_count). - CORES_PER_WORKER is a module constant; removes the metrics_tokenizer_workers config knob (schema/execute/CLI) and regenerates the YAML templates. Validated: 234 unit + 3 integration tests pass. Offline-burst e2e (echo server, streaming, real tokenizer) shows a 3000-tokenization backlog at ENDED drained to n_pending_tasks=0 with the final report state=complete. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ENDED drain sat outside the finalization try/finally and flush_remaining caught only TimeoutError: any other tokenizer failure (e.g. BrokenProcessPool from a dead shard) escaped the fire-and-forget process() task, skipped publish_final, and hung the subprocess with no final_snapshot.json. The drain now runs inside the finalization boundary and flush_remaining swallows non-timeout failures, logs them, and returns the un-tokenized count — surfacing as an incomplete drain (n_pending_tasks > 0) instead of a hang. Cleanup (review feedback): - delete the test-only sync API (count_texts / token_count / token_count_message); production uses only the async paths, and count_texts_async now raises RuntimeError after close() - rename AsyncTokenTrigger -> TokenTrigger (fire() is sync; it enqueues) - extract _encode_batch_lengths shared by the worker and in-process paths - pending_tokens property collapses the triple None-guard; the SIGTERM handler takes a pending_tokens callback instead of reaching into aggregator._token_queue - drop vestigial return None and quoted forward-ref; trim stale "async tasks" wording in docs and the drain-timeout help text (templates regenerated); document the wait=False shard shutdown Tests: sharded-path reassembly + BrokenProcessPool propagation, _even_chunks, and queue/aggregator drain-failure regression tests. 145 aggregator unit + 160 config/commands/integration tests pass; pre-commit clean. Validated on GB200 (ptyche, 144-core Grace, 18 shards, real DeepSeek-R1 tokenizer at mean OSL 3877): 38x vs the per-event pool; 1M-output drain 84s vs ~53min. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…izer-workers Review-council + e2e findings on the batch-tokenization branch. The tokenizer drain runs after the benchmark, so the loadgen/worker affinity partition does not apply to it — but the aggregator subprocess inherited the loadgen's narrow pin (subprocess.Popen propagates the parent mask) and sharding silently never engaged under the default enable_cpu_affinity=true. - cpu_affinity: add expand_to_all_online_cpus() — reset the current process to every online CPU (kernel still clamps to the cgroup/Slurm cpuset). The aggregator calls it before constructing the tokenizer, so shards size to the full machine by default. - Restore the --tokenizer-workers service flag with shard semantics: -1 auto (one process per 8-core block), explicit count clamped to capacity, 0 disables sharding. Every fallback path logs its reason and the success log includes setup time. - flush() phase isolation: a text-batch failure no longer drops the message items (separate failure scopes per executor; first error re-raised after both phases), and a raising recorder callback is logged instead of poisoning the rest of the batch. - Shard workers ignore SIGINT: Ctrl-C goes to the whole process group; the parent drain must control worker lifetime. - Stale "in-flight async tokenize tasks" wording updated in snapshot.py, publisher.py, and AGENTS.md (TokenizePool reference); documented the wait=False shard shutdown. Validated e2e through the real launch path (echo server, default flags, 48-CPU host): aggregator expands 10 -> 48 CPUs, "BatchTokenizer: 6 shards x 8 cores", drain to n_pending_tasks=0, state=complete. 166 unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ellipsis bodies trip the code-quality bot's "statement has no effect" check on every push; pass is semantically identical for Protocol method declarations and keeps the report clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…erview New docs/async_utils/services/metrics_aggregator/DESIGN.md (mirroring the event_logger convention) covering the service lifecycle and the token metrics pipeline: defer-to-flush batching, process-sharded batch encoding, the post-run affinity expansion, failure isolation, and the n_pending_tasks contract. The services overview 6.2 entry now reflects the batched tokenizer, the snapshot outputs, and the current CLI flags, and links the new doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d shard warmup Review-council findings (handled locally): - A persistently failing pre_publish flush aborted every tick before the snapshot was built, silently stopping ALL live metrics publishing — not just token series. The flush now fails in its own handler (logged once) and the tick always proceeds to build and publish; unflushed items stay visible as n_pending_tasks. Regression-tested: a failing flush must not suppress state capture/publish. - Shard warmup waits are bounded (_SHARD_WARMUP_TIMEOUT_S): a hung tokenizer load (e.g. stuck network filesystem) now degrades to the in-process path instead of wedging service startup forever. - close() and warmup cleanup terminate shard workers (cancel_futures + SIGTERM) so an in-flight encode cannot stall interpreter exit after a drain timeout. - TokenCounter protocol stubs use docstring + raise NotImplementedError (the one body shape CodeQL, mypy, and Pyright all accept). - New TestSetupShardsDecisions pins the --tokenizer-workers contract (auto/clamp/disable thresholds, block pinning, affinity and warmup failure fallbacks) — previously zero coverage of the decision logic. 162 aggregator unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A fully set-up environment (fast Rust tokenizer backend + Linux affinity) always shards; anything else was previously a silent in-process fallback that cannot keep up with completions and only surfaces much later as an incomplete drain. Setup is now strict: - no fast backend / no CPU affinity / failed or over-budget warmup -> RuntimeError, surfaced by the service entry as a FATAL launch failure - --tokenizer-workers 0 is the only (explicit) in-process mode - auto mode always shards: max(1, cpus // 8) — the "fewer than two blocks" in-process heuristic is gone; one pinned shard below a full block Also converts the new shard-decision tests to context-managed BatchTokenizer construction (CodeQL: use-with-statement). 164 aggregator unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The affinity API's absence is a platform property, not a broken environment: sharding works identically without pinning — the OS scheduler spreads the workers and only cache/NUMA locality is lost. _setup_shards now sizes blocks from the online CPU count when sched_getaffinity is unavailable, and each worker that cannot pin caps its rayon pool to its block size via RAYON_NUM_THREADS so unpinned shards do not oversubscribe each other. The strict startup errors remain for genuine environment problems: a tokenizer without a fast (Rust) backend, and a failed or over-budget shard warmup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…hook The publisher no longer knows about tokenization: TokenBatchQueue owns its flush cadence via start_live(interval), removing the pre_publish callback (and its failure-isolation machinery) added earlier in this branch. Mid-run flushes go through a bounded live lane — --live-tokenizers shards (default 1), taken from the highest core blocks, farthest from the loadgen's low cores — so live ISL/OSL/TPOT stay current without contending with the benchmark hot path; --live-tokenizers 0 defers all tokenization to the end-of-run drain, which always uses every shard. Live-flush failures and cancellations re-queue the detached items so a mid-run hiccup never loses samples (the drain retries them); drain failures remain terminal and pending-counted. Default metrics-drain-timeout rises 60s -> 300s since the live lane is sized for currency, not for keeping up with peak completion rates. For comparison, main tokenizes continuously during the run on 2 threads inside the aggregator process — which inherits the loadgen's pinned mask, i.e. directly on the loadgen's cores. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Workstreams from the full design audit: - Live flushes take at most _LIVE_FLUSH_MAX_ITEMS per kind: bounds the queue-lock hold time, the unstoppable in-flight thread encode left behind by a drain-start cancellation (close(wait=True) is now bounded by ~one slice), and the drain's re-encode of requeued items. - BatchTokenizer live_workers ctor default aligned to 2 (the CLI default); the aggregator class drain-timeout default aligned to 300s (the CLI default); --tokenizer-workers < 0 rejected at startup. - A failed restore of the inherited CPU mask is logged instead of silently leaving the aggregator expanded. - Comment/docstring hygiene: removed prior-implementation narration and stale shard-lane/warmup-degrade/publish-tick wording; SIGTERM-only phrasing in publisher docs. - Tests: shard-decision suite no longer issues real sched_setaffinity syscalls (probes and restore are patched and asserted); live lane pinned as in-process-only; new coverage for RAYON caps (ctor, operator override, per-shard block override), live flush slice cap, live cancellation/message-failure requeue, and STARTED arming the live loop with ENDED stopping it; live-method aliases on all stubs. - DESIGN.md rewritten for the final shape (in-process live lane, drain-only auto-sized shards, probe-and-restore affinity, requeue semantics, diagram + CLI table); services overview and AGENTS.md row aligned. 345 unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nvzhihanj

Review Council — re-audit after rebase (HEAD `4633699d`)

Reviewed by: Claude + Codex (low reasoning, correctness pass) · Depth: thorough. Focus (as requested): is the metrics-tokenization change modular / clean / non-intrusive (existing benchmark behavior preserved), and are redundant/meaningless tests added.

Verdict: the rebase reduced the intrusiveness but did not resolve it. Replacing the pre_publish hook (full tokenizer pool fired from the publish tick) with a bounded single-shard live lane is a real improvement. But mid-run tokenization is still on by default (--live-tokenizers defaults to 1 → 0.25s live flush), the live shard runs on the highest core block which overlaps the HTTP workers (compute_affinity_plan Phase-3 spillover), and the PR removed the only opt-out (metrics_tokenizer_workers) without replacement — so the observability component can perturb the SUT during measurement and the operator can't turn it off via config. Headline recommendation: default --live-tokenizers 0 for measurement-grade runs (defer all tokenization to the post-run drain), or confine the live shard to cores disjoint from worker_cpu_sets; restore a benchmark-reachable knob. (A1, A2, A3.)

Otherwise clean / non-intrusive. The change stays in the aggregator subprocess (only cross-module touch: importing endpoint_client.cpu_affinity). The consumer contract is verified intact — SessionState, the MetricsSnapshot schema, publisher cadence, and the state==COMPLETE && n_pending_tasks>0 incomplete-drain signal are unchanged; flush_remaining is bounded by the drain budget and never raises; the live-loop's failure cannot skip publish_final. The "shard or exit cleanly" fallback and the unpinned-without-affinity (macOS) path are correct and tested.

Tests: no redundant or meaningless tests. The new branches are mostly well covered with behavior-grounded assertions (the _setup_shards decision matrix, no-fast-backend-exit, unpinned-without-affinity, warmup-failure-exit, flush_remaining timeout/failure, live-loop start/stop/survives-failure, expand_to_all_online_cpus). Removing the old metrics_tokenizer_workers tests was correct (dead). The problems are coverage gaps, not redundancy: the aggregator-side start_live wiring is untested (A5) and TestAggregatorArgs no longer pins the forwarded-args contract (A6). Two _FakeProc-injection tests are borderline-coupled to internals but still verify fan-out/reassembly; TestEvenChunks is trivial-but-cheap. No mock-only or duplicate tests found.

Codex findings — not posted: (1) a multi-turn-ISL precompute regression at execute.py:351 — that's PR #349's change, out of scope here; (2) a shutdown(wait=False) worker-terminate race — _terminate_procs already defensively handles _processes is None and CPython doesn't synchronously null it, so the specific mechanism couldn't be verified → dropped. Existing gemini/github-code-quality token_metrics.py comments (flush-exception inflight; closed-tokenizer guards; close() shutdown leak; Protocol ...→pass) are unaddressed but deduped here, not re-posted.

…args seam - flush_remaining gathers the cancelled live task (return_exceptions) instead of a bare suppressed await; the cancellation test awaits via wait_for. Both silence the code-quality ineffectual-statement check without changing semantics. - New TestAggregatorArgs case pins the SUT-intrusion seam: --tokenizer is forwarded, and no live/worker knobs are — the service defaults deliberately govern mid-run tokenization (review feedback). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Review feedback (human + council), with the API surface pulled back toward main: - A live-flush cancellation landing in the text encode dropped the already-detached message items — lost tool-call samples and a final snapshot stuck at n_pending_tasks > 0 for work the drain could never reach. The text-phase CancelledError handler now re-queues both kinds; regression test covers text+message together. - count_texts_live_async is gone: the live lane is a live= keyword on count_texts_async, so the TokenCounter protocol is back to two methods and every test stub lost its alias. - The SIGTERM handler takes the token queue object again (reads .pending), not a callable. - Live flushes take their slice in place (del list[:cap]) instead of copying the whole backlog tail under the queue lock each tick. - Shard warmup budget reduced to 25s so its diagnostic FATAL fires before the parent's 30s service-launch kill. - TestAggregatorArgs pins the SUT-intrusion seam: --tokenizer is forwarded; live/worker knobs deliberately are not. 276 unit tests pass; pre-commit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The service entry wires the SIGTERM handler from the aggregator's table and token queue; expose them as read-only properties instead of reaching into private attributes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The end-of-run drain runs on the full shard pool, so 60s covers roughly a million buffered tokenizations on a large node; bigger runs set --metrics-drain-timeout explicitly (0 = unlimited). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…docs metrics_tokenizer_workers returns to DrainConfig (default 2, ge=0; 0 = defer all to drain) and execute.py forwards it again. --drain-timeout and --tokenizer-workers become required service args; the aggregator ctor and BatchTokenizer lose their duplicated defaults. Docs and comments trimmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…edits Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A 1M-sample run holds ~2M deferred tokenizations at ENDED; the drain fans the whole buffer into one encode_batch per shard, so a 60s budget expires before any chunk returns and the entire backlog is dropped. 300s covers 1M-sample runs with headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viraatc requested a review from a team June 9, 2026 20:33

github-actions Bot requested review from arekay-nv and nvzhihanj June 9, 2026 20:33

github-code-quality Bot found potential problems Jun 9, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Fixed

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/token_metrics.py Fixed

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes