Add default environment variables for PyTorch/TensorFlow distributed training

## Motivation  

Distributed training with PyTorch and TensorFlow requires specific environment variables (e.g., `MASTER_ADDR`, `MASTER_PORT`, `RANK`, `WORLD_SIZE`) to be set correctly within each container. Setting these manually is error-prone and inconvenient, especially in containerized environments orchestrated via Docker or Kubernetes. Providing sensible default values in our Docker images will improve usability, reduce setup friction, and streamline the development and testing process for distributed training workloads.

## Required Features

- Define and export common distributed training environment variables by default in Docker containers for PyTorch and TensorFlow:
  - `MASTER_ADDR`
  - `MASTER_PORT`
  - `RANK`
  - `WORLD_SIZE`
- Use placeholder or dummy defaults that can be overridden by the user or orchestration system.
- Ensure compatibility with both PyTorch and TensorFlow distributed training APIs.

## Impact  

- **Docker images** for PyTorch and TensorFlow used in the platform.
- **Entry-point scripts** that launch distributed training jobs.
- **CI/CD pipelines** that test distributed training behaviors.
- Potential update to **user-facing documentation** for training jobs.

## Testing Scenarios  

1. ✅ Launch a container without manually setting env vars → container starts with default values.
1. ✅ Launch a container with custom env vars → defaults are overridden correctly.
1. ✅ Run a minimal distributed PyTorch job using the image → training proceeds as expected with default vars.
1. ✅ Same for TensorFlow: distributed training works out-of-the-box with default vars.
1. 🚫 Launching without proper override in multi-node setup logs a warning (optional enhancement).
1. 🧪 Validate Docker image build and environment variables in CI.


JIRA Issue: BA-1214

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add default environment variables for PyTorch/TensorFlow distributed training #4243

Motivation

Required Features

Impact

Testing Scenarios

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add default environment variables for PyTorch/TensorFlow distributed training #4243

Description

Motivation

Required Features

Impact

Testing Scenarios

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions