@zoranzhao, I spent the past week profiling the public dlrm_v3 inference path on H100 SXM5 (against the AOT-T kernel) to understand HSTU's perf surface. Sharing measurements and a couple of observations below.
Measured baseline (H100 SXM5)
Setup: synthetic 5-layer STULayer stack matching dlrm_v3/configs.py defaults (hstu_num_heads=4, hstu_attn_qk_dim=128, hstu_attn_linear_dim=128, hstu_embedding_dim=512, num_layers=5). Ragged batch shape: B=32, avg uih_len=1024, num_candidates=2048, giving S_max ≈ 4020 and ~99k jagged tokens per iteration. BF16, dispatched via HammerKernel.TRITON.
Hardware: NVIDIA H100 80GB HBM3 SXM5 (Vast.ai), driver 580.126, CUDA 13.0.
| Metric |
Value |
| Forward latency (50-iter avg) |
9.30 ms / iter |
| Throughput |
10.63 M tokens / sec |
| Top kernel mix (nsys) |
_addmm_fwd 39.8% / _hstu_attn_fwd 39.1% / _ln_mul_dropout_fwd 9.2% / silu 7.9% / _weighted_layer_norm_fwd 3.6% |
| Per-batch padding waste |
23.2% at headline; 9.2% @ uih=256; 37.5% @ uih=4096 |
| ncu hardware counters |
Not captured. Vast doesn't grant CAP_SYS_ADMIN, so SM occupancy, HBM BW, and Tensor Core util need a counter-enabled host. |
One thing stood out: _addmm_fwd (the UVQK projection) is slightly larger than the attention kernel itself. Any optimization plan that targets only the attention path misses ~40% of forward time.
Analytical activation HBM partition: per-request peak working set is ~45 MB at this shape. K/V together (the natural target for an FP4 follow-up) is ~18-22% of that. An 80 GB H100 can in principle host ~1700 concurrent peak-working-sets, so HBM isn't saturating concurrency at the public-config shape.
Two proposed Python-side improvements
1. Dynamic length-aware batching with CUDA Graph caching (~6-10 weeks)
Replace the static-batch path in generative_recommenders/dlrm_v3/inference/main.py runner.enqueue() / run_one_item() (lines 146-284) with a length-bucketed scheduler. Time-bounded batch formation T_form ≈ 100-200 µs; 4-8 seqlen buckets; CUDA Graph capture+replay for common (bucket × batch_size) shapes; LRU-capped at 32 entries (~1 GB GPU memory).
Recovers the measured 23-37% padding waste. Projected ~1.3-1.6× per-request latency depending on production UIH variance.
2. Post-attention uvqk eviction when KV caching is disabled (~1-2 weeks, surgical)
hstu_preprocess_and_attention returns (u, attn_output, k, v). The k, v outputs are views into the 16.47 MB uvqk tensor produced by the UVQK addmm in hstu_compute_uqvk (generative_recommenders/ops/hstu_compute.py:115-126). Because they're views, uvqk stays alive across the entire STULayer.forward(): through update_kv_cache, through hstu_compute_output, until the function returns.
When max_kv_caching_len=0 (which I observe is the production inference path; STULayer.forward is called, not cached_forward), k and v are never consumed downstream. update_kv_cache is a no-op on the prefill-disabled path, and hstu_compute_output reads only u, attn, and x. An explicit del k, v after the no-op call, or a signature change that doesn't return k, v when caching is disabled, lets the 16.47 MB uvqk allocation be freed post-attention.
I verified the storage relationship with a standalone PyTorch test (no Triton dependency): the torch.split outputs (u, v, q, k) share storage with the parent uvqk allocation, and after del uvqk in the local scope, reading from k still succeeds. That proves k's view alone is keeping the full 16.47 MB alive. This is ~37% of per-request peak working set, no Triton changes, no accuracy effect.
Question (3) below asks whether this is already known and addressed internally. It would surprise me if it wasn't.
An FP4 K/V compression direction I scoped but deferred
I originally scoped an NVFP4 (E2M1 elements with E4M3-FN block scale, group=16) K/V compression in the AOT-T Triton attention kernel as a primary contribution. The bit-budget math works (~4.5 bits/elem alone, ~5.1-5.3 bits/elem with a QJL residual recovery), Triton 3.3+ exposes tl.dot_scaled, and the existing AOT-T kernel is the natural integration point.
The measurement data made me uncertain it's worth the 7-8 weeks of kernel work at the public-config H100 shape:
- HSTU attention is compute-bound at the public dlrm_v3 shape (intensity
I(S) = S/2 vs H100 SXM5 ridge ≈ 295 ops/byte; at S≈4k that's ~7× over the ridge). K/V bandwidth savings don't reduce latency on the attention kernel itself.
- HBM isn't constraining concurrency at the public shape (per-request peak ~45 MB vs 80 GB available), so K/V compression's capacity multiplier is ~1.0-1.2× rather than the 3.5× a per-tensor compression ratio would suggest.
The direction has a clear case if either (a) internal HSTU runs at long sequence lengths (S ≥ 20k) where per-request working set crosses 200+ MB and concurrency becomes HBM-bound, or (b) B200 migration is on the roadmap (native FP4 GMMA closes the latency gap entirely). Both are mentioned in generative_recommenders/ops/triton_aot/README.md as future directions.
Three calibration questions
-
Does internal HSTU run at materially longer sequence lengths than the public dlrm_v3 streaming/movielens-large config (uih ≈ 1k, candidates ≈ 2k, S ≈ 4k)? Is S ≥ 20k a production target anywhere? That's the regime where FP4 K/V compression's HBM-pressure relief actually buys concurrency.
-
Is B200 migration on the team's near-term roadmap? Native FP4 GMMA would turn the K/V compression direction from a borderline H100 capacity tweak into a real latency win.
-
Is the uvqk-post-attention-eviction observation in section 2 above already addressed in the internal stack? I'm fairly confident in the analysis but would like to verify before treating it as a free win.
Happy to share the profiling scripts (nsys/ncu wrappers, a synthetic STULayer driver, and the storage-relationship verification) if any of this is reproducible or useful.
@zoranzhao, I spent the past week profiling the public
dlrm_v3inference path on H100 SXM5 (against the AOT-T kernel) to understand HSTU's perf surface. Sharing measurements and a couple of observations below.Measured baseline (H100 SXM5)
Setup: synthetic 5-layer STULayer stack matching
dlrm_v3/configs.pydefaults (hstu_num_heads=4,hstu_attn_qk_dim=128,hstu_attn_linear_dim=128,hstu_embedding_dim=512,num_layers=5). Ragged batch shape:B=32, avg uih_len=1024, num_candidates=2048, giving S_max ≈ 4020 and ~99k jagged tokens per iteration. BF16, dispatched viaHammerKernel.TRITON.Hardware: NVIDIA H100 80GB HBM3 SXM5 (Vast.ai), driver 580.126, CUDA 13.0.
_addmm_fwd39.8% /_hstu_attn_fwd39.1% /_ln_mul_dropout_fwd9.2% / silu 7.9% /_weighted_layer_norm_fwd3.6%uih=256; 37.5% @uih=4096CAP_SYS_ADMIN, so SM occupancy, HBM BW, and Tensor Core util need a counter-enabled host.One thing stood out:
_addmm_fwd(the UVQK projection) is slightly larger than the attention kernel itself. Any optimization plan that targets only the attention path misses ~40% of forward time.Analytical activation HBM partition: per-request peak working set is ~45 MB at this shape. K/V together (the natural target for an FP4 follow-up) is ~18-22% of that. An 80 GB H100 can in principle host ~1700 concurrent peak-working-sets, so HBM isn't saturating concurrency at the public-config shape.
Two proposed Python-side improvements
1. Dynamic length-aware batching with CUDA Graph caching (~6-10 weeks)
Replace the static-batch path in
generative_recommenders/dlrm_v3/inference/main.pyrunner.enqueue()/run_one_item()(lines 146-284) with a length-bucketed scheduler. Time-bounded batch formation T_form ≈ 100-200 µs; 4-8 seqlen buckets; CUDA Graph capture+replay for common (bucket × batch_size) shapes; LRU-capped at 32 entries (~1 GB GPU memory).Recovers the measured 23-37% padding waste. Projected ~1.3-1.6× per-request latency depending on production UIH variance.
2. Post-attention
uvqkeviction when KV caching is disabled (~1-2 weeks, surgical)hstu_preprocess_and_attentionreturns(u, attn_output, k, v). Thek,voutputs are views into the 16.47 MBuvqktensor produced by the UVQK addmm inhstu_compute_uqvk(generative_recommenders/ops/hstu_compute.py:115-126). Because they're views,uvqkstays alive across the entireSTULayer.forward(): throughupdate_kv_cache, throughhstu_compute_output, until the function returns.When
max_kv_caching_len=0(which I observe is the production inference path;STULayer.forwardis called, notcached_forward),kandvare never consumed downstream.update_kv_cacheis a no-op on the prefill-disabled path, andhstu_compute_outputreads onlyu,attn, andx. An explicitdel k, vafter the no-op call, or a signature change that doesn't returnk, vwhen caching is disabled, lets the 16.47 MBuvqkallocation be freed post-attention.I verified the storage relationship with a standalone PyTorch test (no Triton dependency): the
torch.splitoutputs (u, v, q, k) share storage with the parentuvqkallocation, and afterdel uvqkin the local scope, reading fromkstill succeeds. That provesk's view alone is keeping the full 16.47 MB alive. This is ~37% of per-request peak working set, no Triton changes, no accuracy effect.Question (3) below asks whether this is already known and addressed internally. It would surprise me if it wasn't.
An FP4 K/V compression direction I scoped but deferred
I originally scoped an NVFP4 (E2M1 elements with E4M3-FN block scale, group=16) K/V compression in the AOT-T Triton attention kernel as a primary contribution. The bit-budget math works (~4.5 bits/elem alone, ~5.1-5.3 bits/elem with a QJL residual recovery), Triton 3.3+ exposes
tl.dot_scaled, and the existing AOT-T kernel is the natural integration point.The measurement data made me uncertain it's worth the 7-8 weeks of kernel work at the public-config H100 shape:
I(S) = S/2vs H100 SXM5 ridge ≈ 295 ops/byte; at S≈4k that's ~7× over the ridge). K/V bandwidth savings don't reduce latency on the attention kernel itself.The direction has a clear case if either (a) internal HSTU runs at long sequence lengths (S ≥ 20k) where per-request working set crosses 200+ MB and concurrency becomes HBM-bound, or (b) B200 migration is on the roadmap (native FP4 GMMA closes the latency gap entirely). Both are mentioned in
generative_recommenders/ops/triton_aot/README.mdas future directions.Three calibration questions
Does internal HSTU run at materially longer sequence lengths than the public
dlrm_v3streaming/movielens-large config (uih ≈ 1k, candidates ≈ 2k, S ≈ 4k)? Is S ≥ 20k a production target anywhere? That's the regime where FP4 K/V compression's HBM-pressure relief actually buys concurrency.Is B200 migration on the team's near-term roadmap? Native FP4 GMMA would turn the K/V compression direction from a borderline H100 capacity tweak into a real latency win.
Is the
uvqk-post-attention-eviction observation in section 2 above already addressed in the internal stack? I'm fairly confident in the analysis but would like to verify before treating it as a free win.Happy to share the profiling scripts (nsys/ncu wrappers, a synthetic STULayer driver, and the storage-relationship verification) if any of this is reproducible or useful.