Skip to content

Update GB300 FP4 GLM5 low-latency sweep#175

Open
weireweire wants to merge 1 commit into
NVIDIA:mainfrom
weireweire:gb300-fp4-glm5-lowlat-concurrency
Open

Update GB300 FP4 GLM5 low-latency sweep#175
weireweire wants to merge 1 commit into
NVIDIA:mainfrom
weireweire:gb300-fp4-glm5-lowlat-concurrency

Conversation

@weireweire
Copy link
Copy Markdown
Collaborator

Updates the GB300 FP4 GLM5 8k1k low-latency sweep to add a fifth decode point and align max-running-requests, CUDA graph batch sizes, and SA-Bench concurrencies.

Validation:

  • Parsed recipes/gb300-fp4/glm5.yaml with Ruby YAML parser.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@3523798). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #175   +/-   ##
=======================================
  Coverage        ?   65.10%           
=======================================
  Files           ?       67           
  Lines           ?     8217           
  Branches        ?        0           
=======================================
  Hits            ?     5350           
  Misses          ?     2867           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weireweire weireweire marked this pull request as ready for review May 25, 2026 03:35
@weireweire weireweire force-pushed the gb300-fp4-glm5-lowlat-concurrency branch from 80952de to 41d73ff Compare May 25, 2026 03:36
functionstackx added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 29, 2026
…pology (#1583)

* glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology

Mirrors NVIDIA/srt-slurm#175: adds a 5th 8k1k_stp_lowlat_4 recipe with
decode_nodes/workers=17, and lowers per-zip-index decode
max-running-requests / cuda-graph-max-bs from a flat 4096 to
128/64/32/16/1 across lowlat_0..4. Benchmark concurrencies follow suit:
128/64/32/16/12. nvidia-master.yaml conc-list updated to match for each
of the five 1p{3,5,9,15,17}d entries.

* perf-changelog: set PR link to #1583

---------

Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants