feat: expose quantization and kv_cache_dtype in server args builder#246
Open
ZhitongGuo wants to merge 2 commits intosgl-project:mainfrom
Open
feat: expose quantization and kv_cache_dtype in server args builder#246ZhitongGuo wants to merge 2 commits intosgl-project:mainfrom
ZhitongGuo wants to merge 2 commits intosgl-project:mainfrom
Conversation
SGLang natively supports quantization (INT8, FP8, AWQ, GPTQ) via ServerArgs, but sglang-omni's helper functions did not expose these params, making them undiscoverable. - Add `quantization` and `kv_cache_dtype` named parameters to `build_sglang_server_args()` so callers can set them without resorting to **overrides - Add `quantization` parameter to S2-Pro TTS engine factory (`create_sglang_tts_engine_executor`) and pass through to ServerArgs - Add unit tests for all passthrough paths
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
SGLang natively supports model quantization (INT8, FP8, AWQ, GPTQ) and KV cache dtype configuration via
ServerArgs, but sglang-omni's helper functions (build_sglang_server_argsandcreate_sglang_tts_engine_executor) did not expose these parameters as named arguments, making them undiscoverable for users.Modifications
server_args_builder.py: Addedquantizationandkv_cache_dtypeas explicit named parameters tobuild_sglang_server_args(). They are conditionally inserted into the kwargs dict before**overrides, preserving backward compatibility.stages.py: Added the same two parameters tocreate_sglang_tts_engine_executor()for the S2-Pro TTS pipeline, with conditional passthrough toServerArgs.tests/test_quantization_passthrough.py: Added 4 unit tests verifying argument passthrough for both named params and the existing**overridespath.Accuracy Test
quantization="awq"passthrough, defaultNone,kv_cache_dtype="fp8_e5m2"passthrough, and**overridespath forquantization="gptq".Benchmark & Profiling
No performance impact — this is a configuration passthrough change. Actual quantization performance depends on the underlying SGLang engine and model.