-
Notifications
You must be signed in to change notification settings - Fork 171
Add default environment variables for PyTorch/TensorFlow distributed training #4243
Copy link
Copy link
Open
Copy link
Description
Motivation
Distributed training with PyTorch and TensorFlow requires specific environment variables (e.g., MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE) to be set correctly within each container. Setting these manually is error-prone and inconvenient, especially in containerized environments orchestrated via Docker or Kubernetes. Providing sensible default values in our Docker images will improve usability, reduce setup friction, and streamline the development and testing process for distributed training workloads.
Required Features
- Define and export common distributed training environment variables by default in Docker containers for PyTorch and TensorFlow:
MASTER_ADDRMASTER_PORTRANKWORLD_SIZE
- Use placeholder or dummy defaults that can be overridden by the user or orchestration system.
- Ensure compatibility with both PyTorch and TensorFlow distributed training APIs.
Impact
- Docker images for PyTorch and TensorFlow used in the platform.
- Entry-point scripts that launch distributed training jobs.
- CI/CD pipelines that test distributed training behaviors.
- Potential update to user-facing documentation for training jobs.
Testing Scenarios
- ✅ Launch a container without manually setting env vars → container starts with default values.
- ✅ Launch a container with custom env vars → defaults are overridden correctly.
- ✅ Run a minimal distributed PyTorch job using the image → training proceeds as expected with default vars.
- ✅ Same for TensorFlow: distributed training works out-of-the-box with default vars.
- 🚫 Launching without proper override in multi-node setup logs a warning (optional enhancement).
- 🧪 Validate Docker image build and environment variables in CI.
JIRA Issue: BA-1214
Reactions are currently unavailable