Skip to content

pacslab/adapTrain

Repository files navigation

AdapTrain

AdapTrain logo

[This logo is generated using DALL.E 3 by OpenAI]

AdapTrain is a framework designed to optimize distributed training in heterogeneous environments. AdapTrain handles workload variations and cloud multi-tenancy effectively, achieving up to 8.2× faster convergence compared to existing methods.

The key features of AdapTrain are:

  • Dynamic Model Partitioning: Automatically adjusts model partitioning based on each worker's computational capacity to optimize resource utilization.
  • Reduced Synchronization Overhead: Minimizes delays by ensuring synchronized completion of training rounds across all workers.
  • Robust to Variations: Performs reliably under workload variations, resource heterogeneity, and cloud multi-tenancy.
  • Accelerated Model Convergence: Demonstrates up to 8.2× faster convergence compared to state-of-the-art distributed training methods.

AdapTrain logo

Setup

Prerequisites

  • A Kubernetes cluster up and running (can be local or cloud-based).
  • kubectl configured to interact with your cluster.
  • A machine with Docker installed for building the images.

Directory Structure

To set up AdapTrain, ensure the following directory structure is provided:

.
├── dataset
│   ├── test_x.npy
│   ├── test_y.npy
│   ├── train_x.npy
│   └── train_y.npy
└── configs
    ├── m_config.json
    ├── p_config.json
    └── d_config.json

Requirements

Dataset

  • The dataset must be compatible with torch.utils.data.Dataset.
  • Ensure the .npy files are properly formatted and preprocessed as per the requirements of the model.

Configuration Files

  • Each configuration file must follow the specified JSON format in Configuration Files. An example for each configuration file is provided below.

Model Configuration:

{
  "num_epochs": 100,
  "batch_size": 128,
  "learning_rate": 0.01,
  "input_channels": 1,
  "layers": [
      {
          "type": "linear",
          "in_features": 4096,
          "out_features": 4096
      },
      // ... other layers ...
      {
          "type": "activation",
          "activation": "log_softmax",
          "dim": 1
      }
  ]
}

Partitioning Configuration

{
  "repartition_iter": 100,
  "log_interval": 25
}

Deployment Configuration

{
  "num_workers": 4,
  "dist_backend": "gloo",
  "dist_url": "tcp://127.0.0.1:9000",
  "node_names": ["worker-1", "worker-2", "worker-3", "worker-4"],
  "namespace": "adaptrain"
} 

Deployment

AdapTrain is designed to be deployed on a Kubernetes cluster. The deployment process involves building both the worker and controller Docker images. Follow these steps to deploy it correctly.

Build Docker Images

1. Build the Worker Docker Image

First, build the worker Docker image. This image contains the necessary dependencies and code for the worker nodes in the distributed training setup. Run the following command in the root of the repository:

docker build -t adaptrain-worker -f ./docker/worker.Dockerfile .

2. Push the Worker Docker Image

Once the worker image is built, push it to your desired Docker repository (e.g., Docker Hub):

docker push <your-repo>/adaptrain-worker:latest

3. Update the Controller Deployment Configuration

Before building the controller image, you need to specify the worker image in the controller's deployment configuration file (configs/d_config.json). This ensures that the controller can reference the correct worker image.

Modify the workers_image field in d_config.json to the name of the worker image you pushed in step 2:

{
  "num_workers": 4,
  "dist_backend": "gloo",
  "dist_url": "tcp://127.0.0.1:9000",
  "node_names": ["worker-1", "worker-2", "worker-3", "worker-4"],
  "namespace": "adaptrain",
  "workers_image": "genericdockerhub/adaptrain-worker:latest" // updated
} 

4. Build the Controller Docker Image

Now that the controller configuration is set, you can build the controller Docker image. Run the following command:

docker build -t adaptrain-controller -f ./docker/controller.Dockerfile .

5. Push the Controller Docker Image

Push the controller image to your repository (just like you did for the worker image):

docker push <your-repo>/adaptrain-controller:latest

Deploy AdapTrain on Kubernetes

Once both images are built and pushed, follow these steps to deploy AdapTrain on your Kubernetes cluster.

1. Create the Namespace

Create the namespace provided in the d_config.json file. The namespace should be defined under the namespace field in the deployment configuration file.

kubectl apply -f ./manifests/namespace.yaml

2. Create the Service Account

Create a service account with the necessary permissions for the controller to be able to deploy the worker pods. You can define this service account using a Kubernetes YAML file or apply it directly using kubectl.

kubectl apply -f ./manifests/service-account.yaml

3. Create the Role and RoleBinding

Next, create a Role to allow the controller to manage pods within the namespace and a RoleBinding to bind the service account to the role by applying the Role and RoleBinding YAMLs:

kubectl apply -f ./manifests/role.yaml
kubectl apply -f ./manifests/rolebinding.yaml

4. Deploy the Controller

Deploy the controller on your Kubernetes cluster:

kubectl apply -f ./manifests/controller.yaml

Once the controller is deployed, it will automatically deploy the worker pods based on the configurations provided in d_config.json.

5. Start Training

After deploying the controller, it will start the distributed training by automatically deploying the workers and managing the synchronization.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published