Skip to content

Add default environment variables for PyTorch/TensorFlow distributed training #4243

@rapsealk

Description

@rapsealk

Motivation  

Distributed training with PyTorch and TensorFlow requires specific environment variables (e.g., MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE) to be set correctly within each container. Setting these manually is error-prone and inconvenient, especially in containerized environments orchestrated via Docker or Kubernetes. Providing sensible default values in our Docker images will improve usability, reduce setup friction, and streamline the development and testing process for distributed training workloads.

Required Features

  • Define and export common distributed training environment variables by default in Docker containers for PyTorch and TensorFlow:
    • MASTER_ADDR
    • MASTER_PORT
    • RANK
    • WORLD_SIZE
  • Use placeholder or dummy defaults that can be overridden by the user or orchestration system.
  • Ensure compatibility with both PyTorch and TensorFlow distributed training APIs.

Impact  

  • Docker images for PyTorch and TensorFlow used in the platform.
  • Entry-point scripts that launch distributed training jobs.
  • CI/CD pipelines that test distributed training behaviors.
  • Potential update to user-facing documentation for training jobs.

Testing Scenarios  

  1. ✅ Launch a container without manually setting env vars → container starts with default values.
  2. ✅ Launch a container with custom env vars → defaults are overridden correctly.
  3. ✅ Run a minimal distributed PyTorch job using the image → training proceeds as expected with default vars.
  4. ✅ Same for TensorFlow: distributed training works out-of-the-box with default vars.
  5. 🚫 Launching without proper override in multi-node setup logs a warning (optional enhancement).
  6. 🧪 Validate Docker image build and environment variables in CI.

JIRA Issue: BA-1214

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions