Skip to content

Default tp_size based on slurm number of GPUs#17

Merged
fabnemEPFL merged 3 commits into
mainfrom
smart_tp_size
Mar 3, 2026
Merged

Default tp_size based on slurm number of GPUs#17
fabnemEPFL merged 3 commits into
mainfrom
smart_tp_size

Conversation

@fabnemEPFL
Copy link
Copy Markdown
Collaborator

This pull request updates the configuration logic for the SGLangServerArgs class in the LLM processor module. The main change is to make the tp_size parameter dynamically default to the value of the SLURM_GPUS_ON_NODE environment variable if it is set, or fall back to 1 otherwise. This allows for more flexible configuration in distributed or SLURM-managed environments.

Configuration improvements:

  • Updated the default value of tp_size in the SGLangServerArgs class to use the SLURM_GPUS_ON_NODE environment variable if available, improving compatibility with SLURM-based GPU scheduling.
  • Added an import of the os module to support environment variable access in config.py.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the LLM processor’s SGLang configuration so tp_size defaults dynamically based on SLURM-provided GPU allocation (SLURM_GPUS_ON_NODE), improving out-of-the-box behavior in SLURM environments.

Changes:

  • Import os in the LLM config module to read environment variables.
  • Change SGLangServerArgs.tp_size to default from SLURM_GPUS_ON_NODE (fallback to 1).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 30 to 32
model_path: str = "none"
tp_size: int = 1
tp_size: int = field(default_factory=lambda: int(os.environ.get("SLURM_GPUS_ON_NODE") or 1))
trust_remote_code: bool = True
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tp_size default parsing can raise ValueError if SLURM_GPUS_ON_NODE is set but not a plain integer (or includes whitespace). It can also result in tp_size=0 if the env var is "0", which is likely invalid for tensor parallelism and will fail later when constructing the SGLang engine. Consider using a small helper to parse the env var defensively (strip, try/except, and clamp to >=1), falling back to 1 on invalid values.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 9, 2026

@fabnemEPFL I've opened a new pull request, #18, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: fabnemEPFL <117652591+fabnemEPFL@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
@fabnemEPFL fabnemEPFL merged commit bbc2a3f into main Mar 3, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants