Default tp_size based on slurm number of GPUs#17
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the LLM processor’s SGLang configuration so tp_size defaults dynamically based on SLURM-provided GPU allocation (SLURM_GPUS_ON_NODE), improving out-of-the-box behavior in SLURM environments.
Changes:
- Import
osin the LLM config module to read environment variables. - Change
SGLangServerArgs.tp_sizeto default fromSLURM_GPUS_ON_NODE(fallback to1).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| model_path: str = "none" | ||
| tp_size: int = 1 | ||
| tp_size: int = field(default_factory=lambda: int(os.environ.get("SLURM_GPUS_ON_NODE") or 1)) | ||
| trust_remote_code: bool = True |
There was a problem hiding this comment.
tp_size default parsing can raise ValueError if SLURM_GPUS_ON_NODE is set but not a plain integer (or includes whitespace). It can also result in tp_size=0 if the env var is "0", which is likely invalid for tensor parallelism and will fail later when constructing the SGLang engine. Consider using a small helper to parse the env var defensively (strip, try/except, and clamp to >=1), falling back to 1 on invalid values.
There was a problem hiding this comment.
@copilot open a new pull request to apply changes based on this feedback
|
@fabnemEPFL I've opened a new pull request, #18, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: fabnemEPFL <117652591+fabnemEPFL@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This pull request updates the configuration logic for the
SGLangServerArgsclass in the LLM processor module. The main change is to make thetp_sizeparameter dynamically default to the value of theSLURM_GPUS_ON_NODEenvironment variable if it is set, or fall back to1otherwise. This allows for more flexible configuration in distributed or SLURM-managed environments.Configuration improvements:
tp_sizein theSGLangServerArgsclass to use theSLURM_GPUS_ON_NODEenvironment variable if available, improving compatibility with SLURM-based GPU scheduling.osmodule to support environment variable access inconfig.py.