Skip to content

Add num_nodes() method to ClusterEnvironment#21638

Closed
crawfordxx wants to merge 1 commit into
Lightning-AI:masterfrom
crawfordxx:feat-cluster-env-num-nodes
Closed

Add num_nodes() method to ClusterEnvironment#21638
crawfordxx wants to merge 1 commit into
Lightning-AI:masterfrom
crawfordxx:feat-cluster-env-num-nodes

Conversation

@crawfordxx
Copy link
Copy Markdown

@crawfordxx crawfordxx commented Apr 1, 2026

Summary

  • Adds an abstract num_nodes() method to the ClusterEnvironment base class so the number of nodes can be queried directly from the cluster environment
  • Implements num_nodes() in all seven concrete subclasses (SLURM, TorchElastic, Lightning, Kubeflow, LSF, MPI, XLA), each deriving the value from the appropriate environment-specific source
  • Adds unit tests for the new method across all testable environments

Motivation

Closes #7361. Currently num_nodes must be specified manually by the user, but in most cluster environments this information is already available (e.g., SLURM_NNODES, GROUP_WORLD_SIZE). This change allows the cluster environment to provide it automatically.

Implementation Details

Environment Source
SLURMEnvironment SLURM_NNODES env var (default: 1)
TorchElasticEnvironment GROUP_WORLD_SIZE env var (default: 1)
LightningEnvironment NUM_NODES env var (default: 1)
KubeflowEnvironment NUM_NODES env var (default: world_size)
LSFEnvironment Unique hosts in LSB_DJOB_RANKFILE
MPIEnvironment Unique hostnames via MPI.COMM_WORLD.gather
XLAEnvironment xr.host_count() (XLA >= 2.1) or HOST_WORLD_SIZE env var

Test plan

  • Added dedicated test_num_nodes and test_num_nodes_default tests for SLURM, TorchElastic, Lightning, and Kubeflow environments
  • Added inline assertion in existing test_attributes_from_environment_variables for SLURM and TorchElastic
  • All 42 environment tests pass locally

📚 Documentation preview 📚: https://pytorch-lightning--21638.org.readthedocs.build/en/21638/

Add an abstract num_nodes() method to the ClusterEnvironment base class
so users can query the number of nodes directly from the cluster
environment. Each concrete implementation derives the value from the
appropriate source:

- SLURMEnvironment: SLURM_NNODES env var
- TorchElasticEnvironment: GROUP_WORLD_SIZE env var
- LightningEnvironment: NUM_NODES env var
- KubeflowEnvironment: NUM_NODES env var (defaults to world_size)
- LSFEnvironment: unique hosts in rank file
- MPIEnvironment: unique hostnames via MPI gather
- XLAEnvironment: xr.host_count() or HOST_WORLD_SIZE env var

Closes #7361
@github-actions github-actions Bot added the fabric lightning.fabric.Fabric label Apr 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 48%. Comparing base (0a60211) to head (487aa22).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (0a60211) and HEAD (487aa22). Click for more details.

HEAD has 210 uploads less than BASE
Flag BASE (0a60211) HEAD (487aa22)
cpu 84 42
python 6 3
python3.10 6 3
lightning 30 15
python3.12 24 12
python3.12.7 18 9
python3.11 12 6
python3.13 18 9
pytorch2.1 6 0
pytest-full 42 0
pytorch2.3 3 0
pytorch2.8 6 0
pytorch_lightning 27 0
pytorch2.5.1 3 0
pytorch2.9 6 0
pytorch2.4.1 3 0
pytorch2.10 6 0
pytorch2.6 3 0
pytorch2.2.2 3 0
pytorch2.7 3 0
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21638     +/-   ##
=========================================
- Coverage      87%      48%    -39%     
=========================================
  Files         270      267      -3     
  Lines       23930    23904     -26     
=========================================
- Hits        20709    11414   -9295     
- Misses       3221    12490   +9269     

@crawfordxx crawfordxx closed this by deleting the head repository May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fabric lightning.fabric.Fabric

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow obtaining num_nodes from ClusterEnvironment

1 participant