H100 SXM5 HSTU inference profile: kernel mix, padding waste, and two proposed Python-side improvements

@zoranzhao, I spent the past week profiling the public `dlrm_v3` inference path on H100 SXM5 (against the AOT-T kernel) to understand HSTU's perf surface. Sharing measurements and a couple of observations below.

## Measured baseline (H100 SXM5)

Setup: synthetic 5-layer STULayer stack matching `dlrm_v3/configs.py` defaults (`hstu_num_heads=4`, `hstu_attn_qk_dim=128`, `hstu_attn_linear_dim=128`, `hstu_embedding_dim=512`, `num_layers=5`). Ragged batch shape: `B=32, avg uih_len=1024, num_candidates=2048`, giving S_max ≈ 4020 and ~99k jagged tokens per iteration. BF16, dispatched via `HammerKernel.TRITON`.

Hardware: NVIDIA H100 80GB HBM3 SXM5 (Vast.ai), driver 580.126, CUDA 13.0.

| Metric | Value |
|---|---|
| Forward latency (50-iter avg) | **9.30 ms / iter** |
| Throughput | 10.63 M tokens / sec |
| Top kernel mix (nsys) | `_addmm_fwd` 39.8% / `_hstu_attn_fwd` 39.1% / `_ln_mul_dropout_fwd` 9.2% / silu 7.9% / `_weighted_layer_norm_fwd` 3.6% |
| Per-batch padding waste | 23.2% at headline; 9.2% @ `uih=256`; 37.5% @ `uih=4096` |
| ncu hardware counters | Not captured. Vast doesn't grant `CAP_SYS_ADMIN`, so SM occupancy, HBM BW, and Tensor Core util need a counter-enabled host. |

One thing stood out: `_addmm_fwd` (the UVQK projection) is slightly larger than the attention kernel itself. Any optimization plan that targets only the attention path misses ~40% of forward time.

Analytical activation HBM partition: per-request peak working set is ~45 MB at this shape. K/V together (the natural target for an FP4 follow-up) is ~18-22% of that. An 80 GB H100 can in principle host ~1700 concurrent peak-working-sets, so HBM isn't saturating concurrency at the public-config shape.

## Two proposed Python-side improvements

### 1. Dynamic length-aware batching with CUDA Graph caching (~6-10 weeks)

Replace the static-batch path in `generative_recommenders/dlrm_v3/inference/main.py` `runner.enqueue()` / `run_one_item()` (lines 146-284) with a length-bucketed scheduler. Time-bounded batch formation T_form ≈ 100-200 µs; 4-8 seqlen buckets; CUDA Graph capture+replay for common (bucket × batch_size) shapes; LRU-capped at 32 entries (~1 GB GPU memory).

Recovers the measured 23-37% padding waste. Projected ~1.3-1.6× per-request latency depending on production UIH variance.

### 2. Post-attention `uvqk` eviction when KV caching is disabled (~1-2 weeks, surgical)

`hstu_preprocess_and_attention` returns `(u, attn_output, k, v)`. The `k`, `v` outputs are views into the 16.47 MB `uvqk` tensor produced by the UVQK addmm in `hstu_compute_uqvk` (`generative_recommenders/ops/hstu_compute.py:115-126`). Because they're views, `uvqk` stays alive across the entire `STULayer.forward()`: through `update_kv_cache`, through `hstu_compute_output`, until the function returns.

When `max_kv_caching_len=0` (which I observe is the production inference path; `STULayer.forward` is called, not `cached_forward`), `k` and `v` are never consumed downstream. `update_kv_cache` is a no-op on the prefill-disabled path, and `hstu_compute_output` reads only `u`, `attn`, and `x`. An explicit `del k, v` after the no-op call, or a signature change that doesn't return `k, v` when caching is disabled, lets the 16.47 MB `uvqk` allocation be freed post-attention.

I verified the storage relationship with a standalone PyTorch test (no Triton dependency): the `torch.split` outputs (u, v, q, k) share storage with the parent `uvqk` allocation, and after `del uvqk` in the local scope, reading from `k` still succeeds. That proves `k`'s view alone is keeping the full 16.47 MB alive. This is ~37% of per-request peak working set, no Triton changes, no accuracy effect.

Question (3) below asks whether this is already known and addressed internally. It would surprise me if it wasn't.

## An FP4 K/V compression direction I scoped but deferred

I originally scoped an NVFP4 (E2M1 elements with E4M3-FN block scale, group=16) K/V compression in the AOT-T Triton attention kernel as a primary contribution. The bit-budget math works (~4.5 bits/elem alone, ~5.1-5.3 bits/elem with a QJL residual recovery), Triton 3.3+ exposes `tl.dot_scaled`, and the existing AOT-T kernel is the natural integration point.

The measurement data made me uncertain it's worth the 7-8 weeks of kernel work at the public-config H100 shape:

- HSTU attention is compute-bound at the public dlrm_v3 shape (intensity `I(S) = S/2` vs H100 SXM5 ridge ≈ 295 ops/byte; at S≈4k that's ~7× over the ridge). K/V bandwidth savings don't reduce latency on the attention kernel itself.
- HBM isn't constraining concurrency at the public shape (per-request peak ~45 MB vs 80 GB available), so K/V compression's capacity multiplier is ~1.0-1.2× rather than the 3.5× a per-tensor compression ratio would suggest.

The direction has a clear case if either (a) internal HSTU runs at long sequence lengths (S ≥ 20k) where per-request working set crosses 200+ MB and concurrency becomes HBM-bound, or (b) B200 migration is on the roadmap (native FP4 GMMA closes the latency gap entirely). Both are mentioned in `generative_recommenders/ops/triton_aot/README.md` as future directions.

## Three calibration questions

1. **Does internal HSTU run at materially longer sequence lengths than the public `dlrm_v3` streaming/movielens-large config (uih ≈ 1k, candidates ≈ 2k, S ≈ 4k)?** Is S ≥ 20k a production target anywhere? That's the regime where FP4 K/V compression's HBM-pressure relief actually buys concurrency.

2. **Is B200 migration on the team's near-term roadmap?** Native FP4 GMMA would turn the K/V compression direction from a borderline H100 capacity tweak into a real latency win.

3. **Is the `uvqk`-post-attention-eviction observation in section 2 above already addressed in the internal stack?** I'm fairly confident in the analysis but would like to verify before treating it as a free win.

Happy to share the profiling scripts (nsys/ncu wrappers, a synthetic STULayer driver, and the storage-relationship verification) if any of this is reproducible or useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H100 SXM5 HSTU inference profile: kernel mix, padding waste, and two proposed Python-side improvements #530

Measured baseline (H100 SXM5)

Two proposed Python-side improvements

1. Dynamic length-aware batching with CUDA Graph caching (~6-10 weeks)

2. Post-attention `uvqk` eviction when KV caching is disabled (~1-2 weeks, surgical)

An FP4 K/V compression direction I scoped but deferred

Three calibration questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value
Forward latency (50-iter avg)	9.30 ms / iter
Throughput	10.63 M tokens / sec
Top kernel mix (nsys)	`_addmm_fwd` 39.8% / `_hstu_attn_fwd` 39.1% / `_ln_mul_dropout_fwd` 9.2% / silu 7.9% / `_weighted_layer_norm_fwd` 3.6%
Per-batch padding waste	23.2% at headline; 9.2% @ `uih=256`; 37.5% @ `uih=4096`
ncu hardware counters	Not captured. Vast doesn't grant `CAP_SYS_ADMIN`, so SM occupancy, HBM BW, and Tensor Core util need a counter-enabled host.

H100 SXM5 HSTU inference profile: kernel mix, padding waste, and two proposed Python-side improvements #530

Description

Measured baseline (H100 SXM5)

Two proposed Python-side improvements

1. Dynamic length-aware batching with CUDA Graph caching (~6-10 weeks)

2. Post-attention uvqk eviction when KV caching is disabled (~1-2 weeks, surgical)

An FP4 K/V compression direction I scoped but deferred

Three calibration questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Post-attention `uvqk` eviction when KV caching is disabled (~1-2 weeks, surgical)