Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
581 commits
Select commit Hold shift + click to select a range
f8a20f6
[Standup] Convert step 5 to python (#480)
maugustosilva Oct 29, 2025
0f65966
Initial Integration of WVA (#481)
Vezio Oct 29, 2025
ef9eb94
GuideLLM v0.4.0 Enablement (#479)
sjmonson Oct 29, 2025
7fabfa1
Fix: Quote annotations (#483)
sjmonson Oct 30, 2025
f9b521e
fix version of modelservice (#486)
kalantar Oct 30, 2025
2fd4a3f
Allow import and conversion of all runs in GuideLLM results (#487)
namasl Oct 31, 2025
3b833c6
Add ability to set bounds in config explorer library (#484)
namasl Oct 31, 2025
dcd65fe
fix api handling (#488)
Vezio Oct 31, 2025
8845a44
Add more GuideLLM parameters to DataFrame (#490)
namasl Nov 2, 2025
ec19466
Enable bounds in i/o length in config UI (#491)
jgchn Nov 3, 2025
d60d8a5
Remove ISL/OSL binning hack now that config explorer UI supports boun…
namasl Nov 3, 2025
86f346d
Allow symbolic link recursion for Python versions that support it (#493)
namasl Nov 3, 2025
7e754f2
[Run] Removed `fmperf` as a supported harness (#498)
maugustosilva Nov 6, 2025
2ac07a5
Add warmup to workload profile, disable prefix caching where not eval…
namasl Nov 7, 2025
014c9b3
Revert unsupported warmup, move no-enable-prefix-caching to correct s…
namasl Nov 7, 2025
7ae4141
[Standup] Fix for issue 495 (#501)
maugustosilva Nov 7, 2025
ade556e
Initializing KubeCon tutorial (#497)
jgchn Nov 10, 2025
1652027
[Standup] bug fix for failed k8s context in setup/functions.py (#503)
mengmeiye Nov 10, 2025
526b9c0
[KubeCon NA 2025] Add a basic getting-started tutoral (#504)
sjmonson Nov 11, 2025
bb7e27e
[Admin] Update owners file (#505)
maugustosilva Nov 11, 2025
6e1cd0c
Restore `get_image` in `functions.sh` (#507)
maugustosilva Nov 11, 2025
2c66ae0
Fill in PD tutorial (#508)
jgchn Nov 11, 2025
70e6404
Add more details to PD tutorial (#509)
jgchn Nov 11, 2025
c7ce070
Kubecon tutorial (#510)
jjk-g Nov 11, 2025
9b56c90
Re-add Kubernetes as a dependency on pod (#512)
maugustosilva Nov 11, 2025
15a2c68
[fix] Use -p/--namespace when kubeconfig context has no namespace set…
petecheslock Nov 12, 2025
7b7c595
Allow configuration explorer to import benchmark reports with limited…
namasl Nov 12, 2025
06dc3b8
Benchmark tweaks (#515)
namasl Nov 12, 2025
64f0dac
[Standup] Added an example of profiling (with nsys) for standalone (#…
maugustosilva Nov 13, 2025
061a0c3
Add LLMDBENCH_VLLM_STANDALONE_ENABLE_SLEEP_MODE env. to make sleep/wa…
manoelmarques Nov 13, 2025
a572a21
Add GPU persistence mode to benchmark report (#520)
manoelmarques Nov 14, 2025
9f2e030
use newer modelservice (#521)
kalantar Nov 17, 2025
88613e3
[Standup/Run] Significant (python) code refactor and bugfixes (#519)
maugustosilva Nov 18, 2025
f06d0ac
minor grep change for hf secret (#523)
effi-ofer Nov 18, 2025
7595c7d
[Standup] Add a "standalone" test for pre-merge CI/CD (Kind) (#525)
maugustosilva Nov 19, 2025
a5f1f94
Add convert to benchmark report for InferenceMAX (#528)
namasl Nov 20, 2025
d31b305
Updates in preparation for release 0.4 (#527)
maugustosilva Nov 20, 2025
0be84a4
Add Inferencemax (#529)
mengmeiye Nov 21, 2025
c76e78a
[Standup] Add feature for early detection of pod crash (#526)
mengmeiye Nov 21, 2025
e91d285
Added "tiered-prefix-cache" well-lit path (#530)
maugustosilva Nov 21, 2025
b5265a8
Update conversion to match GuideLLM's updated schema (#533)
namasl Nov 21, 2025
14dab49
[Run] Better support for "interactive run" (#532)
maugustosilva Nov 21, 2025
47e4d85
Enable Deploying One or More Harness Pods (#531)
Vezio Nov 21, 2025
7d2f026
Preparing release 0.4 (#535)
maugustosilva Nov 21, 2025
3b66bd8
Add load parallelism to benchmark report (#534)
namasl Nov 21, 2025
8cc07b0
Fix capacity planner bug on unknown inference kv cache type for quant…
jgchn Nov 24, 2025
ed4262a
Fix new command line option -j/--parallelism (#536)
maugustosilva Nov 24, 2025
bf8fb01
[Standup] use the llm-d-benchmark image to on the job to download mod…
maugustosilva Nov 24, 2025
16cd8ff
Add detection of init:CrashLoopBackOff for pods (#539)
mengmeiye Nov 24, 2025
04802fa
Bugfix pod crash detection (#541)
mengmeiye Nov 25, 2025
9ea412d
move load_kube_config out of wait_for_pods_created_running_ready() (#…
mengmeiye Nov 25, 2025
0d4351a
spyre fixes (#542)
kalantar Nov 25, 2025
0094f6b
[Standup] Additional fixes to allow standup to use non-gpu accelerato…
maugustosilva Nov 26, 2025
e9ffcdd
[Standup] Added a new cpu-only (NOT simulated) example (#545)
maugustosilva Nov 26, 2025
beeb141
[Standup] In case *.gateway.networking.k8s.io CRDs are found, do not …
maugustosilva Dec 2, 2025
cea7f28
handle deploy methods when vllm name contains modelservice or standal…
dmitripikus Dec 2, 2025
0a59726
Add experiment ID to benchmark report (#550)
namasl Dec 2, 2025
1abeb83
fix port selection when deploy method is a vLLM pod (#548)
deanlorenz Dec 2, 2025
87e54bb
standup.sh & teardown.sh for non-cluster-level-admin users (#546)
NaomiEisen Dec 3, 2025
07c7e36
[Standup] Functional deployments with both `istio` and `kgateway` (#551)
maugustosilva Dec 3, 2025
9f2600a
parallelism options (#554)
kalantar Dec 7, 2025
53c8110
[Standup] Serve models from read-only pvcs (#553)
maugustosilva Dec 8, 2025
bb26e74
accelerator update (#558)
kalantar Dec 8, 2025
6091db9
Small fix to have lists of models behave correctly in setup/functions…
galmasi Dec 8, 2025
6754f61
Fix virtual environment detection in install_deps.sh (#557)
sagearc Dec 9, 2025
151c0b1
[Standup] Improvements for deployments with NIXL (and Wide-EP) (#562)
maugustosilva Dec 10, 2025
21677d7
Add Prerequisites to README (#563)
NaomiEisen Dec 10, 2025
7b63180
[Standup] small fixes for spyre and adding `set_llmdbench_environment…
maugustosilva Dec 11, 2025
dac6929
delete app=llm-d-benchmark-harness pods (#560)
effi-ofer Dec 11, 2025
1309792
[Standup] Compatibility with the new `istio` (1.28.1) (#567)
maugustosilva Dec 13, 2025
4f7f6d0
Prepare yaml configuration for run_only.sh (#565)
deanlorenz Dec 15, 2025
e221583
[Enhancement] Upgrade WVA to v0.4.1 Release (#571)
Vezio Dec 15, 2025
534dd31
[Bug Fix] Issue in Prometheus Installation (#572)
Vezio Dec 16, 2025
d91c01b
run in sep. ns (#574)
Vezio Dec 17, 2025
895598a
fix comment (#575)
Vezio Dec 17, 2025
2ecbef6
[Standup] Restore the ability to dump all generate yamls (#577)
maugustosilva Dec 18, 2025
cd81ed6
wva docs (#576)
Vezio Dec 18, 2025
0ceb2aa
gaie config for wide-ep (#570)
kalantar Dec 18, 2025
9fdb095
Final touches for release v0.4 (#578)
maugustosilva Dec 19, 2025
75e4e98
Last minute fixes in convert.py (#579)
maugustosilva Dec 19, 2025
33f1108
Set all tags for release v0.4 (#580)
maugustosilva Dec 19, 2025
bba0d0d
add multi turn chat workload template for inference-perf (#584)
rshavitt Jan 3, 2026
aa00864
Add precise-prefix epp config (#582)
NaomiEisen Jan 3, 2026
1209f6b
Benchmark runner for an existing stack (#566)
dmitripikus Jan 5, 2026
9487086
GPU recommender using llm-optimizer's roofline analysis tool (#583)
jgchn Jan 5, 2026
dd994c2
Add standalone inference server launcher benchmark (#573)
manoelmarques Jan 5, 2026
f9c051d
[Run] fix broken vllm-benchmark (#585)
maugustosilva Jan 5, 2026
cf3aae1
CLI for config explorer (#586)
jgchn Jan 7, 2026
18150d0
Clarify warning message in analysis notebook (#591)
namasl Jan 7, 2026
cfd206c
CLI subcommand for gpu recommender engine (#592)
jgchn Jan 8, 2026
a42a26a
Reorganize Benchmark Report code, add Benchmark Report v0.2 (#593)
namasl Jan 8, 2026
5b3c7f4
Fix benchmark report imports inside nop-analyze_results.py (#594)
manoelmarques Jan 8, 2026
49e0f6e
Add GPURecommender example to config explorer (#595)
jgchn Jan 9, 2026
4ae748e
Fix up component validation (#596)
namasl Jan 9, 2026
9a7de28
Clarify authentication error message in GPU recommender UI (#598)
jgchn Jan 9, 2026
3c87816
Tighten up config explorer README (#599)
namasl Jan 9, 2026
95536ee
Fix typo in README (#600)
namasl Jan 9, 2026
8c82f0b
This is large but well tested - **all** `examples` and most `guides` …
maugustosilva Jan 9, 2026
3dd365a
Fix launcher arguments when deploying vllm standalone (#603)
manoelmarques Jan 12, 2026
0ce16cc
[Standup] Remove `v0` as the fixed version when deploying as non-admi…
maugustosilva Jan 12, 2026
de42f2f
Add envars for expt times in ISO-8601 (#605)
namasl Jan 13, 2026
1bd775d
Add UID envar to harness pod (#606)
namasl Jan 13, 2026
d8605a0
[Run] Add support for pull secret in pods created by modelservice (#607)
maugustosilva Jan 13, 2026
ce931c8
Preserve parent process env. variables on call to benchmark-report (#…
manoelmarques Jan 13, 2026
c3416bc
[WVA] Additional Documentation on Running Experiments (#609)
Vezio Jan 14, 2026
0f121a8
Benchmark report code fixes and cleanup (#610)
namasl Jan 15, 2026
bfabbf7
Add ev argument to get_random_node_port (#612)
manoelmarques Jan 16, 2026
73b3c2e
[Standup] Automatically detected the presence of rdma (roce_gdr/ib) (…
maugustosilva Jan 19, 2026
4cc57ad
[Standup] Smoketest is now retriable. (#615)
maugustosilva Jan 19, 2026
9b70a72
Fix handling of potential null values in configuration explorer impor…
namasl Jan 20, 2026
e6a3715
Include creation of benchmark report v0.2 in harness pod. (#613)
namasl Jan 22, 2026
08fd6c2
Add support for latest benchmarking from vllm (#619)
namasl Jan 22, 2026
7e65b8a
[Standup] Improvements for smoketest (#618)
maugustosilva Jan 22, 2026
25f9872
[Standup] Improvements for deployment of Spyre accelerators (#620)
maugustosilva Jan 23, 2026
13f32a9
Add cost configuration to GPU recommender (#617)
jgchn Jan 23, 2026
52fb9fa
Update GKE CI (#623)
jjk-g Jan 26, 2026
0282dcf
Estimate activation and intermediate memory in capacity planner (#622)
jgchn Jan 27, 2026
c7b4c63
small fix to detect conda virtual env (#624)
mengmeiye Jan 27, 2026
e4f1448
Add component health to benchmark report (#625)
namasl Jan 27, 2026
b24ce7a
Add inference scheduler config to v0.2 benchmark report (#626)
namasl Jan 28, 2026
c29a523
Loosen required package versions (#631)
namasl Jan 29, 2026
cfc21a0
[Standup] Add all rendered environment variables as a configmap (#630)
maugustosilva Jan 30, 2026
f94c2cb
Fix inference-perf std -> std_dev (#633)
jjk-g Jan 30, 2026
598f11f
Check for OSL of `None` or `<1` in workload when creating benchark re…
namasl Feb 2, 2026
2efad69
add random concurrent workload template for inference-perf (#635)
huaxig Feb 2, 2026
39da67d
Feat(run_only): Add cloud storage support and remove PVC dependency (…
huaxig Feb 3, 2026
7a17717
Use parameters ConfigMap for benchmark report stack details (#634)
namasl Feb 3, 2026
0d181c8
Run_only handle disconnects (#636)
deanlorenz Feb 3, 2026
9b7c801
[Run] Capture the logs for all pods (from llm-d stack) at the end (#638)
maugustosilva Feb 3, 2026
f5f2d9a
[Run] Add the ability to upload results to object storage bucket (#639)
maugustosilva Feb 4, 2026
69394e3
Adding namespace in pods lookup kubectl command (#640)
gushob21 Feb 4, 2026
c33358d
Minor fix in config-explorer UI (#644)
jgchn Feb 6, 2026
91c8dbe
add log verbosity for gaie epp deployment (#643)
mengmeiye Feb 7, 2026
4b1afb7
Update inference-perf (#645)
jjk-g Feb 7, 2026
79b113f
[Standup] Enable the use of gateway provided by RHOAI (#646)
maugustosilva Feb 7, 2026
128b6fb
Fix: Quote path variables in shell scripts to handle space characters…
vknaik Feb 10, 2026
93646d4
[Run] Move the benchmark report generation to analysis. (#647)
maugustosilva Feb 10, 2026
70534bd
[Standup] Automatically add several VLLM-specific environment variabl…
maugustosilva Feb 11, 2026
927b4dc
Simplify usage of JSON schema generation (#651)
namasl Feb 12, 2026
ec67b89
Add Benchmark Report v0.2 JSON schema (#652)
namasl Feb 12, 2026
a2cd9d5
🌱 Standardize governance workflows via llm-d-infra (#650)
clubanderson Feb 12, 2026
d2b4fc4
🌱 Add typos config and Dependabot for automated dependency updates (#…
clubanderson Feb 12, 2026
f6ab588
🌱 Remove redundant auto-assign workflow (#659)
clubanderson Feb 12, 2026
26f653d
deps(actions): bump google-github-actions/auth from 2.1.12 to 3.0.0 (…
dependabot[bot] Feb 12, 2026
9e81e20
deps(actions): bump google-github-actions/setup-gcloud (#657)
dependabot[bot] Feb 12, 2026
040fea7
deps(actions): bump actions/upload-artifact from 4 to 6 (#658)
dependabot[bot] Feb 12, 2026
380cf49
deps(docker): bump python in /build (#660)
dependabot[bot] Feb 12, 2026
85b4a61
deps(actions): bump actions/setup-python from 5 to 6 (#655)
dependabot[bot] Feb 12, 2026
7b1b1de
support inference scheduling flags (#662)
kalantar Feb 12, 2026
20367d0
Lazy load kubernetes in benchmark report convert script (#663)
namasl Feb 13, 2026
e9662a7
Update benchmark report README (#664)
namasl Feb 13, 2026
65a9743
🐛 Fix broken reusable workflow references (#667)
clubanderson Feb 13, 2026
02ab48e
add custom dataset profile for vllm-benchmark and add request_timeout…
mengmeiye Feb 13, 2026
429dc38
🐛 Add failure diagnostics to nightly benchmark workflows (#670)
clubanderson Feb 13, 2026
8405a32
🐛 Allow accelerator count=0 in benchmark report schema for simulator …
clubanderson Feb 13, 2026
ef2a8b9
fix: build nightly image from main instead of using stale release tag…
clubanderson Feb 13, 2026
fbcaec5
🐛 Split image build into separate job on ubuntu-latest (#676)
clubanderson Feb 13, 2026
60d91f8
🐛 Revert Python to 3.13 — vllm requires <3.14 (#677)
clubanderson Feb 13, 2026
afca635
🐛 Build nightly image for amd64 only (fix arm64 scikit_build_core) (#…
clubanderson Feb 13, 2026
270a4ab
🐛 Fix ZeroDivisionError when vllm-benchmark has 0 completions (#679)
clubanderson Feb 14, 2026
ecd6184
🐛 Add connectivity wait to vllm-benchmark harness (#680)
clubanderson Feb 14, 2026
18ea0f9
🌱 Remove legacy typo and link checker workflows (#681)
clubanderson Feb 14, 2026
812fcf0
✨ Add GitHub Agentic Workflows for typo, link, and upstream checks (#…
clubanderson Feb 16, 2026
9a3cc8b
Signed-off-by: Dean H Lorenz <dean@il.ibm.com> (#649)
deanlorenz Feb 16, 2026
95645ac
fix: Quote path variables and fix config_explorer directory path reso…
vknaik Feb 16, 2026
9c27b4e
Downgrade the main image back to python 3.12.9 (#688)
maugustosilva Feb 16, 2026
edde2a3
add NUM_CPU_BLOCKS to vllm command (#687)
shashwatj07 Feb 16, 2026
1331bbe
Run only token (#686)
deanlorenz Feb 16, 2026
4370f33
Workload Variant Autoscaler Version Upgrade (#689)
Vezio Feb 16, 2026
d8a3834
Fix WVA NS Override (#690)
Vezio Feb 16, 2026
b70cf76
✨ Add CKS nightly benchmark workflow (#694)
clubanderson Feb 17, 2026
0259867
[Standup] Functional wide-ep-lws standup (on Openshift) (#692)
maugustosilva Feb 17, 2026
454ad3c
Remove readarray to support MAC (#696)
deanlorenz Feb 17, 2026
c1aaadd
fix redundant volume and volume mounts in the standalone yaml file (#…
mengmeiye Feb 17, 2026
516d91b
Allow for all fails reports from Inference Perf in Benchmark Report v…
namasl Feb 17, 2026
4a3a6de
🐛 Account for preemptible GPUs in CKS benchmark simulator fallback (#…
clubanderson Feb 18, 2026
2f27ece
WVA Version RC Bump (#703)
Vezio Feb 18, 2026
bcc969a
[Standup] Additional fixes for end-to-end wide-ep-lws deployment (#704)
maugustosilva Feb 19, 2026
86886a2
deps(actions): bump actions/download-artifact from 6.0.0 to 7.0.0 (#705)
dependabot[bot] Feb 20, 2026
2e5939b
deps(docker): bump python in /build (#706)
dependabot[bot] Feb 20, 2026
92041f2
deps(actions): bump actions/checkout from 4 to 6 (#707)
dependabot[bot] Feb 20, 2026
311db72
deps(actions): bump github/gh-aw from 0.45.0 to 0.46.2 (#708)
dependabot[bot] Feb 20, 2026
a1d0ce3
[Standup] Allow per-pod VLLM cli values. (#710)
maugustosilva Feb 20, 2026
a2aa0f8
Update config explorer tests (#711)
jgchn Feb 20, 2026
f3a92eb
allow routing sidecar to be disabled (#709)
kalantar Feb 20, 2026
77bf659
Add missing percentiles in vllm bench conversion (#712)
namasl Feb 20, 2026
58b18df
change dockerfile base image to 3.13 (#720)
mengmeiye Feb 23, 2026
1bfd9e4
[Standup] Allow preprocess to automatically tag model serving pods (#…
maugustosilva Feb 23, 2026
356b8e0
Update safetensor metadata retrieval (#723)
jgchn Feb 24, 2026
e724935
[Standup] Allow variables defined by `LLMDBENCH_VLLM_COMMON_ENVVARS_T…
maugustosilva Feb 25, 2026
0e1ca83
[Standup] Standardize the use non-default service accounts all steps …
maugustosilva Feb 25, 2026
53c3721
deps(actions): bump actions/checkout from 5.0.1 to 6.0.2 (#729)
dependabot[bot] Feb 26, 2026
30e0a12
deps(actions): bump github/gh-aw from 0.46.2 to 0.50.4 (#730)
dependabot[bot] Feb 26, 2026
2204fe8
Fill in stack details from ev.yaml if ConfigMap unavailable (#727)
namasl Feb 26, 2026
98d3414
Fix stray parenthesis (#731)
namasl Feb 26, 2026
87ab01b
Add architecture-aware activation memory estimation to capacity plann…
jgchn Feb 26, 2026
e98db5a
[Run] Remove a few environment variables from harness pods (#733)
maugustosilva Feb 26, 2026
1845a3c
Fix path handling for directories with spaces in run.sh and functions…
vknaik Feb 27, 2026
6639ce1
fixes rayon thread issue (#735)
Vezio Feb 27, 2026
a32079f
Priority Class Name Implementation (#737)
Vezio Feb 27, 2026
42c1ad8
Provide blank filler data when envar undefined (#738)
namasl Feb 27, 2026
32a73aa
fix default val (#739)
Vezio Feb 27, 2026
b9fcc8c
Bump GAIE chart version to v1.3.0 (#740)
Vezio Feb 27, 2026
b86879a
add pod monitor support and collect metrics data (#734)
mengmeiye Feb 27, 2026
8a4177c
✨ Add upstream auto-fix agentic workflow (#745)
clubanderson Mar 2, 2026
85ada2a
✨ Replace agentic workflow with Copilot SWE agent assignment (#747)
clubanderson Mar 2, 2026
ddaed37
🐛 Restore permissions needed for Copilot SWE agent assignment (#750)
clubanderson Mar 2, 2026
dd32ea3
🐛 Use COPILOT_PAT secret for Copilot SWE agent assignment (#753)
clubanderson Mar 2, 2026
f418aaa
📖 Populate upstream dependency version tracking (#744)
clubanderson Mar 2, 2026
bc0f04d
[Standup] fixes for pd-disaggregation (#756)
maugustosilva Mar 2, 2026
509755a
Remove model-storage volume mount from script (#758)
maugustosilva Mar 3, 2026
7a3fb26
reorg admin level deps (#743)
Vezio Mar 3, 2026
7f2b707
deps(actions): bump actions/upload-artifact from 6.0.0 to 7.0.0 (#765)
dependabot[bot] Mar 5, 2026
e89c2df
deps(actions): bump actions/download-artifact from 7.0.0 to 8.0.0 (#766)
dependabot[bot] Mar 5, 2026
c6cf622
deps(actions): bump github/gh-aw from 0.50.4 to 0.53.2 (#767)
dependabot[bot] Mar 5, 2026
7593b1a
Auto calculate max-model-len (#774)
jgchn Mar 6, 2026
dcadf39
Sync gh-aw workflows from llm-d-infra (a566d16) (#770)
clubanderson Mar 6, 2026
135f61a
Bump version 0.5.0 (#777)
maugustosilva Mar 6, 2026
111c6ff
🌱 Remove per-repo gh-aw typo/link/upstream workflows (#778)
clubanderson Mar 6, 2026
015ff59
⬆️ Bump yq from v4.45.4 to v4.45.5 (#748)
github-actions[bot] Mar 9, 2026
d2e2335
Fix logs for new vllm on nop harness (#781)
manoelmarques Mar 10, 2026
adf7d03
Add memory and cache metrics #2 (#742)
DolevAdas Mar 11, 2026
43aa522
[Experimental] Add a new production trace replay for real-world multi…
achandrasekar Mar 11, 2026
b9e84d4
Update GAIE InferencePool v1.3.0 to v1.3.1 (#830)
diegocastanibm Mar 11, 2026
08dee11
fix the bug where the metrics data failed to collect sometimes (#834)
mengmeiye Mar 11, 2026
d369a2e
update istio (#840)
diegocastanibm Mar 11, 2026
77c7ef8
update vllm (#837)
diegocastanibm Mar 11, 2026
aac309a
update yq (#836)
diegocastanibm Mar 11, 2026
90504a6
update inferecemax (#835)
diegocastanibm Mar 11, 2026
310f66c
update kgateway (#839)
diegocastanibm Mar 12, 2026
8bf152b
update helmfile to v1.4.1 (#832)
diegocastanibm Mar 12, 2026
9780bfb
update wva (#838)
diegocastanibm Mar 12, 2026
2e3c6e2
update inference-perf (#833)
diegocastanibm Mar 12, 2026
b503119
v0.5.3 tagged release (#831)
diegocastanibm Mar 12, 2026
de271e7
[Standup] Add the ability to use initContainers. (#851)
maugustosilva Mar 17, 2026
bc303fe
[Standup] Additional fixes (accelerator automatic selection) (#852)
maugustosilva Mar 18, 2026
5b6423c
🌱 Add missing governance files per CNCF audit (#783)
clubanderson Mar 18, 2026
2515dbe
Feat/small cluster config (#853)
michael-desmond Mar 20, 2026
7f11460
[Standup] Consolidate all sim scenarios (with small gateway pod) (#856)
maugustosilva Mar 20, 2026
c9a86bf
Fix metrics scrape (#854)
mengmeiye Mar 23, 2026
8ec5178
Fix standalone preprocess env. variable (#860)
manoelmarques Mar 23, 2026
3d83e02
Epp log scrape (#855)
mengmeiye Mar 23, 2026
05d7ed5
[Run] Add --repeat flag to repeat experiments N times with aggregatio…
jia-gao Mar 24, 2026
bb42822
remove accessLogging for helm chart schema validation error (#861)
mengmeiye Mar 25, 2026
30fe5a8
Add 'src/config_explorer/' from commit 'bb4282221d3e6a8623530a5420a03…
namasl Mar 26, 2026
d45edb4
Remove refs to benchmark report in config explorer
jgchn Mar 27, 2026
ecd96db
Merge pull request #122 from jgchn/conf-exp
namasl Mar 27, 2026
39b1a5d
Remove stale git/GitHub files
namasl Mar 27, 2026
2c647cf
Merge remote-tracking branch 'upstream/main' into HEAD
jgchn Mar 27, 2026
f02e937
Temporarily exclude src/config_explorer/ from ruff
namasl Mar 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ exclude = [
"dist",
"generated_configs",
"logs",
"src/config_explorer",
]

[tool.ruff.lint]
Expand Down
806 changes: 806 additions & 0 deletions src/config_explorer/Capacity_Planner.py

Large diffs are not rendered by default.

121 changes: 121 additions & 0 deletions src/config_explorer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Configuration Explorer

The configuration explorer is a library that helps find the most cost-effective, optimal configuration for serving models on llm-d based on hardware specification, workload characteristics, and SLO requirements. A CLI and web app front-end are available to use the library immediately.

Features include:

- **Capacity planning**:
- Get per-GPU memory requirements to load and serve a model, and compare parallelism strategies.
- Determine KV cache memory requirements based on workload characteristics.
- Estimate peak activation memory, CUDA graph overhead, and non-torch memory for accurate capacity planning (see empirical results for intermediate memory [here](./empirical-vllm-memory-results.md))
- **GPU recommendation**:
- Recommend GPU configurations using BentoML's llm-optimizer roofline algorithm.
- Analyze throughput, latency (TTFT, ITL, E2E), and concurrency trade-offs across different GPU types.
- Export recommendations in JSON format for integration with other tools.
Core functionality is currently a Python module within `llm-d-benchmark`. In the future we may consider shipping as a separate package depending on community interest.

## Installation

**Requires python 3.11+**

1. (optional) Set up a Python virtual environment

```bash
python -m venv .venv
source .venv/bin/activate
```

2. Install the `config_explorer` Python module after cloning the `llm-d-benchmark` repository.

```bash
git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark
pip install -e ./config_explorer
```

# Usage

## CLI

After installation, the `config-explorer` command will become available:

```bash
# Run capacity planning
config-explorer plan --model Qwen/Qwen2.5-3B --gpu-memory 80 --max-model-len 16000

# Run GPU recommendation and performance estimation (BentoML's roofline model)
config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --max-gpus 8

# Human-readable output
config-explorer estimate --model Qwen/Qwen2.5-3B --input-len 512 --output-len 128 --pretty

# Override GPU costs with custom pricing
config-explorer estimate --model Qwen/Qwen2.5-3B \
--input-len 512 --output-len 128 \
--custom-gpu-cost H100:30.50 \
--custom-gpu-cost A100:22 \
--custom-gpu-cost L40:25.00 \
--pretty

# Start the Streamlit web app
pip install -r requirements-streamlit.txt # one-time installation (run from config_explorer/ dir)
config-explorer start

# Get help
config-explorer --help
```

## Web Application

A Streamlit frontend is provided to showcase the capabilities of the Configuration Explorer in a more intuitive way. Before using this frontend additional requirements must be installed.

After installing Streamlit requirements (`pip install -r requirements-streamlit.txt`) the web app may then be started with
```bash
cd config_explorer # must run from within the config_explorer directory
config-explorer start
```

### Pages

The Streamlit frontend includes the following pages:

1. **Capacity Planner** - Analyze GPU memory requirements and capacity planning for LLM models
2. **GPU Recommender** - Get optimal GPU recommendations based on model and workload requirements

### Using the GPU Recommender

The GPU Recommender page helps you find the optimal GPU for running LLM inference. To use it:

1. **Configure Model**: Enter a HuggingFace model ID (e.g., `meta-llama/Llama-2-7b-hf`)
2. **Set Workload Parameters**:
- Input sequence length (tokens)
- Output sequence length (tokens)
- Maximum number of GPUs
3. **Define Constraints (Optional)**:
- Maximum Time to First Token (TTFT) in milliseconds
- Maximum Inter-Token Latency (ITL) in milliseconds
- Maximum End-to-End Latency in seconds
4. **Run Analysis**: Click the "Run Analysis" button to evaluate all available GPUs
5. **Review Results**:
- Compare GPUs through interactive visualizations
- Examine throughput, latency metrics, and optimal concurrency
- View detailed analysis for each GPU
6. **Export**: Download results as JSON or CSV for further analysis

The GPU Recommender uses BentoML's llm-optimizer roofline algorithm to provide synthetic performance estimates across different GPU types, helping you make informed decisions about hardware selection.

**Note**: You'll need a HuggingFace token set as the `HF_TOKEN` environment variable to access gated models.

### Cost Information

The GPU Recommender displays cost information to help you find cost-effective GPU configurations:

- **Default GPU Costs**: Built-in reference costs for common GPUs (H200, H100, A100, L40, etc.)
- **Custom Cost Override**: Specify your own GPU costs using any numbers you prefer (e.g., your actual $/hour or $/token pricing)
- **Cost-Based Sorting**: Sort results by cost to find the most economical option

**⚠️ IMPORTANT**: Default costs are **reference values for relative comparison only**. They do **NOT** represent actual pricing from any provider. Lower values indicate better value. Use custom costs that reflect your actual infrastructure pricing.

## Library

For GPU recommender API usage see [./examples/gpu_recommender_example.py](./examples/gpu_recommender_example.py).
Empty file added src/config_explorer/__init__.py
Empty file.
50 changes: 50 additions & 0 deletions src/config_explorer/db.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"AMD_INSTINCT_MI300X": {
"memory": 192,
"prefix": "MI300X"
},
"NVIDIA-H100-80GB-HBM3": {
"memory": 80,
"prefix": "H100"
},
"NVIDIA-A100-40GB": {
"memory": 40,
"prefix": "A100"
},
"NVIDIA-A100-80GB": {
"memory": 80,
"prefix": "A100"
},
"NVIDIA-H100-80GB": {
"memory": 80,
"prefix": "H100"
},
"NVIDIA-L40-40GB": {
"memory": 40,
"prefix": "L40"
},
"NVIDIA-RTX-4090": {
"memory": 24,
"prefix": "RTX4090"
},
"NVIDIA-RTX-5090": {
"memory": 32,
"prefix": "RTX5090"
},
"NVIDIA-RTX-6000": {
"memory": 48,
"prefix": "RTX6000"
},
"NVIDIA-A6000": {
"memory": 48,
"prefix": "A6000"
},
"NVIDIA-A4000": {
"memory": 16,
"prefix": "A4000"
},
"NVIDIA-T4": {
"memory": 16,
"prefix": "T4"
}
}
11 changes: 11 additions & 0 deletions src/config_explorer/db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
"""
Mocks DB storing info about common accelerators used for LLM serving and inference
"""
import json
import os

gpu_specs = {}

_dir = os.path.dirname(os.path.abspath(__file__))
with open(os.path.join(_dir, "db.json")) as f:
gpu_specs = json.load(f)
179 changes: 179 additions & 0 deletions src/config_explorer/empirical-vllm-memory-results.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# vLLM Empirical Memory Profiling Results

Test environment: H100 GPU (79.18 GiB), vLLM with FlashAttention, `VLLM_LOGGING_LEVEL=DEBUG`.

All tests use `--enable-prefix-caching --block-size=128`. Default `--gpu-memory-utilization=0.9` unless noted.

## Summary

| Model | Weights | Activation | Non-torch | CUDA Graph | KV Cache | TP | Util | max-model-len |
| ----- | ------- | ---------- | --------- | ---------- | -------- | -- | ---- | ------------- |
| gpt-oss-20b (MoE) | 13.47 | 7.38 | 0.13 | 0.39 | 50.28 | 1 | 0.9 | 16000 |
| gpt-oss-120b (MoE) | 64.38 | 7.38 | 0.13 | 1.03 | 3.33 | 1 | 0.9 | 16000 |
| Llama-3.3-70B-FP8 | 33.88 | 4.84 | 0.55 | -0.42 | 32.00 | 2 | 0.9 | 16000 |
| Llama-3.1-8B | 14.99 | 4.76 | 0.13 | -0.45 | 51.38 | 1 | 0.9 | 16000 |
| Qwen3-0.6B | 1.12 | 5.56 | 0.13 | 0.10 | 64.45 | 1 | 0.9 | 16000 |
| Qwen3-32B | 61.03 | 5.64 | 0.14 | -0.88 | 4.45 | 1 | 0.9 | 16000 |
| Qwen3-32B | 30.59 | 5.64 | 0.54 | -0.33 | 34.49 | 2 | 0.9 | 16000 |
| Mistral-Small-3.2-24B | 44.76 | 2.12 | 0.14 | -0.76 | 28.20 | 1 | 0.95 | 16000 |

All values in GiB. "Activation" = torch peak memory increase. "CUDA Graph" = memory change during graph capture (negative = freed).

### Failed Configurations

| Model | TP | Failure | Root Cause |
| ----- | -- | ------- | ---------- |
| Deepseek-R1 (FP8) | 1 | OOM during load | Weights exceeded single GPU; needs TP |
| Llama-3.3-70B-FP8 | 1 | No KV cache room | 67.72 GiB weights, -1.44 GiB remaining; use TP=2 |
| Qwen3-32B | 1 | No KV cache room | 61.03 GiB weights at max-model-len=32000; use TP=2 or reduce context |

## Key Patterns

**Activation memory is constant per model type** (independent of max-model-len and batch-size):
- Multimodal: ~2.1 GiB (vision encoder skips CUDA graph capture)
- Dense text-only: ~4.8-5.6 GiB
- MoE: ~7.4 GiB

**Non-torch memory** scales with TP: ~0.13 GiB (TP=1), ~0.55 GiB (TP=2).

**CUDA graph memory** ranges from -0.88 to +1.03 GiB. Negative values (memory freed) are common for large dense models.

**Activation is constant across context lengths**: Qwen3-0.6B at max-model-len=16000 and max-model-len=32000 both measured 5.56 GiB activation and 64.45 GiB KV cache.

## Per-Model Notes

### gpt-oss-20b / gpt-oss-120b (MoE)

- **Model:** openai/gpt-oss-20b, openai/gpt-oss-120b
- MoE models have the highest activation memory (~7.38 GiB) due to expert routing overhead
- gpt-oss-120b barely fits on a single H100 (64.38 GiB weights, only 3.33 GiB for KV cache)

### Llama-3.3-70B-FP8

- **Model:** RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
- Requires TP=2 (67.72 GiB weights at TP=1 leaves no room for KV cache)
- At TP=2: 33.88 GiB weights per GPU, 32.0 GiB KV cache available

### Llama-3.1-8B

- **Model:** meta-llama/Llama-3.1-8B-Instruct
- Small footprint (14.99 GiB), generous KV cache (51.38 GiB)

### Qwen3-0.6B / Qwen3-32B

- **Models:** Qwen/Qwen3-0.6B, Qwen/Qwen3-32B
- Qwen3-0.6B: smallest model tested, 64.45 GiB KV cache available
- Qwen3-32B at TP=1: only 4.45 GiB KV cache (tight); TP=2 gives 34.49 GiB

### Mistral-Small-3.2-24B

- **Model:** mistralai/Mistral-Small-3.2-24B-Instruct-2506
- **Architecture:** Mistral3ForConditionalGeneration (multimodal / vision-language)
- **vLLM:** v0.11.0 (V1 engine), `--gpu-memory-utilization=0.95`, `--tokenizer-mode=mistral --config-format=mistral --load-format=mistral`
- **Notable:** Lowest activation memory measured (2.12 GiB), likely because vision encoder does not participate in CUDA graph capture

**Model architecture:** GQA, 40 layers, 32 attention heads, 8 KV heads, head_dim=128, hidden_size=5120

**KV cache validation** -- per-token formula matches vLLM exactly:

```
Per-token KV = num_layers x 2 x head_dim x num_kv_heads x dtype_bytes
= 40 x 2 x 128 x 8 x 2 = 163,840 bytes (160 KB/token)

vLLM empirical: 28.20 GiB / 184,832 tokens = 163,840 bytes/token (exact match)
```

**Live request validation** (15,049 tokens, measured via Prometheus /metrics):

| Metric | Measured | Expected |
| ------ | -------- | -------- |
| KV cache usage | 8.18% | 8.17% (118 blocks / 1,444 total) |
| Blocks allocated | 118 | ceil(15,049 / 128) = 118 |
| Prompt throughput | ~1,481 tok/s | -- |
| Prefix cache hit rate | 30% | -- |

**Capacity planner accuracy** (before/after adding validated activation profiles):

| Metric | Before | After | vLLM Actual |
| ------ | ------ | ----- | ----------- |
| Activation estimate | 5.5 GiB | 2.5 GiB | 2.12 GiB |
| Available KV cache | 24.82 GiB | 27.82 GiB | 28.20 GiB |
| Error | -3.38 GiB | **-0.38 GiB** | -- |
| Max concurrent @16K | 10.2x | **11.4x** | 11.55x |

## How to Replicate

### Setup

Requirements: Kubernetes cluster with H100 GPU nodes, HuggingFace token secret.

Deploy a vLLM pod with `VLLM_LOGGING_LEVEL=DEBUG`:

```yaml
apiVersion: v1
kind: Pod
metadata:
name: vllm-profiling
spec:
restartPolicy: Never
containers:
- name: vllm
image: vllm/vllm-openai:v0.11.0
command: ["vllm", "serve"]
args:
- <model-name> # e.g. Qwen/Qwen3-32B
- --tensor-parallel-size=<tp> # 1 or 2
- --gpu-memory-utilization=0.90
- --max-model-len=16000
- --block-size=128
- --enable-prefix-caching
- --host=0.0.0.0
- --port=8000
resources:
requests:
nvidia.com/gpu: "<tp>" # must match tensor-parallel-size
limits:
nvidia.com/gpu: "<tp>"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef: { name: llm-d-hf-token, key: HF_TOKEN }
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: HF_HOME
value: /tmp/cache
volumeMounts:
- { name: cache, mountPath: /tmp/cache }
volumes:
- { name: cache, emptyDir: {} }
```

Wait for "Application startup complete" in logs.

### Extract Metrics

Search the pod logs for these strings:

| Log substring | What it gives you |
| ------------- | ----------------- |
| `"Model loading took"` | Weight memory (GiB) and load time |
| `"torch peak memory increase"` | Activation memory (GiB) |
| `"non-torch forward increase memory"` | Non-torch memory (GiB) |
| `"Available KV cache memory"` | KV cache allocation (GiB) |
| `"Free memory on device"` | Total/free GPU memory at startup |
| `"GPU KV cache size"` | Total KV cache tokens and block count |
| `"Maximum concurrency for"` | Max concurrent requests at max-model-len |

### Validate KV Cache at Runtime

```bash
# Port-forward to the pod
kubectl port-forward pod/<name> -n <ns> 8000:8000 &

# Send a request and check metrics
curl -X POST localhost:8000/v1/chat/completions -H "Content-Type: application/json" \
-d '{"model":"<model>","messages":[{"role":"user","content":"<long prompt>"}],"max_tokens":10}'

# Check KV cache usage
curl -s localhost:8000/metrics | grep kv_cache_usage_perc
```
Loading
Loading