-
Notifications
You must be signed in to change notification settings - Fork 54
feat: generator add benchmark mode by setting rule plugins #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ethan-ES
wants to merge
1
commit into
main
Choose a base branch
from
etshen/generator_add_benchmark_mode
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+131
−11
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21 changes: 21 additions & 0 deletions
21
src/aiconfigurator/generator/rule_plugin/benchmark/sglang.rule
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| prefill max_batch_size = (max_batch_size if max_batch_size else 1) | ||
| agg_decode max_batch_size = (max_batch_size if max_batch_size else 128) | ||
|
|
||
| agg_prefill_decode max_prefill_tokens = SlaConfig.isl + 1500 | ||
| agg enable_mixed_chunk = true | ||
|
|
||
| agg_prefill_decode cuda_graph_batch_sizes = ((range(1, max_batch_size + 1) | list) if max_batch_size else []) | ||
|
|
||
| # GPUs per worker follow the same TP/PP/DP product that SGLang expects | ||
| agg_prefill_decode gpus_per_worker = (tensor_parallel_size or 1) * (pipeline_parallel_size or 1) * (data_parallel_size or 1) | ||
|
|
||
| agg_prefill_decode kv_cache_dtype = ("fp8_e4m3" if kv_cache_dtype == "fp8" else kv_cache_dtype) | ||
| prefill_decode kv_transfer_backend = (kv_transfer_backend if kv_transfer_backend else "nixl") | ||
|
|
||
| when (ModelConfig.prefix or 0) > 0: | ||
| disable_prefix_cache = false | ||
| DynConfig.enable_router = true | ||
|
|
||
| when (ModelConfig.nextn or 0) > 0: | ||
| speculative_decoding_type = "NEXTN" | ||
| speculative_num_steps = ModelConfig.nextn |
31 changes: 31 additions & 0 deletions
31
src/aiconfigurator/generator/rule_plugin/benchmark/trtllm.rule
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| prefill max_batch_size = (max_batch_size if max_batch_size else 1) | ||
| agg_decode max_batch_size = (max_batch_size if max_batch_size else 128) | ||
|
|
||
| prefill disable_overlap_scheduler = true | ||
| decode disable_overlap_scheduler = false | ||
| agg disable_overlap_scheduler = false | ||
|
|
||
| prefill max_num_tokens = SlaConfig.isl + 1500 | ||
| decode max_num_tokens = max_batch_size | ||
| agg max_num_tokens = max_batch_size + SlaConfig.isl + 1500 | ||
|
|
||
| agg_prefill_decode cuda_graph_batch_sizes = ((range(1, max_batch_size + 1) | list) if max_batch_size else []) | ||
|
|
||
| # Enforce TensorRT-LLM MoE parallelism: moe_tp × moe_ep = tp | ||
| when ModelConfig.is_moe and (moe_tensor_parallel_size and moe_expert_parallel_size): | ||
| agg_prefill_decode tensor_parallel_size = moe_tensor_parallel_size * moe_expert_parallel_size | ||
|
|
||
| # GPUs per worker (fallback to 1 if any dimension missing) | ||
| agg_prefill_decode gpus_per_worker = (tensor_parallel_size or 1) * (pipeline_parallel_size or 1) * (data_parallel_size or 1) | ||
|
|
||
| agg_prefill_decode enable_attention_dp = ((data_parallel_size or 1) > 1) and ModelConfig.is_moe | ||
|
|
||
| when (ModelConfig.prefix or 0) > 0: | ||
| agg_prefill_decode disable_prefix_cache = false | ||
| DynConfig.enable_router = true | ||
|
|
||
|
|
||
| # Speculative decoding | ||
| when (ModelConfig.nextn or 0) > 0: | ||
| agg_decode speculative_decoding_type = "MTP" | ||
| agg_decode num_nextn_predict_layers = ModelConfig.nextn |
21 changes: 21 additions & 0 deletions
21
src/aiconfigurator/generator/rule_plugin/benchmark/vllm.rule
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| prefill max_batch_size = (max_batch_size if max_batch_size else 1) | ||
| decode max_batch_size = (max_batch_size if max_batch_size else 128) | ||
|
|
||
|
|
||
| agg_prefill_decode gpus_per_worker = (tensor_parallel_size or 1) * (pipeline_parallel_size or 1) * (data_parallel_size or 1) | ||
| agg_prefill_decode enable_expert_parallel = ((moe_expert_parallel_size or 1) > 1) | ||
|
|
||
| agg_prefill_decode cuda_graph_batch_sizes = ((range(1, max_batch_size + 1) | list) if max_batch_size else []) | ||
|
|
||
| prefill max_num_tokens = (SlaConfig.isl or 0) + 1500 | ||
| decode max_num_tokens = max_batch_size | ||
| agg max_num_tokens = (max_batch_size or 0) + (SlaConfig.isl or 0) + 1500 | ||
| agg max_seq_len = (SlaConfig.isl or 0) + (SlaConfig.osl or 0) + 1500 | ||
|
|
||
| when (ModelConfig.prefix or 0) > 0: | ||
| disable_prefix_cache = false | ||
| DynConfig.enable_router = true | ||
|
|
||
| when (ModelConfig.nextn or 0) > 0: | ||
| speculative_decoding_type = "mtp" | ||
| num_nextn_predict_layers = ModelConfig.nextn |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to include
cuda_graph_batch_sizestovllm.ruleas well?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, I've added the cuda_graph_batch_sizes to vllm.rule on line 8 to stay consistent with trtllm and sglang. Please let me know if this is not appropriate.