Slurm GPU Cluster on AWS with RHEL 10

This Ansible playbook automates the deployment of a Slurm cluster on AWS with 3 g6.2xlarge RHEL 10 instances, complete with NVIDIA GPU drivers and container support.

Features

AWS Infrastructure: Automated VPC, subnet, security group, and EC2 instance creation
NVIDIA GPU Support: Full NVIDIA driver installation following Red Hat best practices
Container Runtime: Podman with NVIDIA Container Toolkit for GPU-accelerated containers
Slurm Cluster: Production-ready Slurm workload manager configured for GPU scheduling
High Availability: One controller node and two compute nodes with GPU resources
Optional vLLM Models: Deploy 3 production LLMs (Phi-3, Mistral-7B, Llama-3.1-8B) with one command

Prerequisites

Local Requirements

Ansible (version 2.15 or higher)
```
pip install ansible
```

Ansible Collections

ansible-galaxy collection install -r requirements.yml

AWS CLI configured with appropriate credentials
```
aws configure
```
Python boto3 library
```
pip install boto3 botocore
```

Configuration

Update Variables

Edit group_vars/all.yml to customize your deployment:

# AWS Configuration
aws_region: us-east-1  # Change to your preferred region
key_name: "slurm-cluster-key"  # Your SSH key pair name. Expects key to be on your local system as ~/.ssh/id_rsa

# Instance Configuration
instance_type: g6.2xlarge  # GPU instance type
instance_count: 3  # Number of instances

# NVIDIA Configuration
nvidia_driver_version: "580.82.07"  # Update if needed

# Slurm Configuration
slurm_version: "24.05.4"  # Update to latest stable version
slurm_cluster_name: "gpu-cluster"

Usage

Deploy the Cluster

Run the main playbook to create infrastructure and configure the cluster:

ansible-playbook playbook.yml

This will:

Create AWS infrastructure (VPC, subnet, security group, instances)
Install and configure NVIDIA drivers on all nodes
Install NVIDIA Container Toolkit
Build and install Slurm from source
Configure Munge authentication
Set up Slurm controller and compute nodes
Configure GPU resource scheduling

Note: The full deployment takes approximately 30-45 minutes.

Verify the Cluster

After deployment completes, SSH to the controller node:

ssh -i ~/.ssh/slurm-cluster-key.rsa ec2-user@<controller-public-ip>

Check cluster status:

# View cluster nodes
sinfo

# View detailed node information
scontrol show nodes

# Check GPU resources
sinfo -o "%N %G"

# Submit a test GPU job
srun --gres=gpu:1 nvidia-smi

Test GPU Container Support

On any compute node:

# Test NVIDIA driver
nvidia-smi

# Test GPU container support
sudo podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable nvidia/cuda:13.0.0-base-ubi9 nvidia-smi

Deploy vLLM Models (Optional)

After the cluster is running, optionally deploy three LLM models with vLLM:

ansible-playbook deploy-vllm-models.yml

This installs and configures:

Phi-3 Mini (3.8B): Fast, efficient small model
Mistral 7B Instruct: General-purpose instruction model

See VLLM_DEPLOYMENT.md for complete documentation.

Quick test:

ssh -i ~/.ssh/slurm-cluster-key.pem ec2-user@<controller-ip>
cd ~/vllm-jobs
./vllm-manager.sh serve phi3  # Start serving Phi-3

Destroy the Cluster

When you're done, clean up all AWS resources:

ansible-playbook destroy.yml

Cluster Architecture

Node Configuration

Controller Node (slurm-node-0):
- Runs slurmctld (Slurm controller daemon)
- Runs slurmdbd (Slurm database daemon)
- Hosts MariaDB for accounting
- Manages job scheduling and cluster state
Compute Nodes (slurm-node-1, slurm-node-2):
- Run slurmd (Slurm compute daemon)
- Each has 1 NVIDIA L4 GPU (g6.2xlarge)
- 8 vCPUs and 63GB RAM per node
- Configured for GPU job execution

GPU Scheduling

The cluster is configured with:

Generic Resource Scheduling (GRES) for GPUs
Automatic GPU detection using NVML
GPU partition (gpu) for GPU jobs
Consumable resources tracking (cores, memory, GPUs)

Submit GPU Jobs

# Interactive GPU job
srun --gres=gpu:1 --pty bash

# Batch GPU job
sbatch --gres=gpu:1 my_gpu_script.sh

# Multiple GPUs (if using larger instances)
sbatch --gres=gpu:2 my_multi_gpu_script.sh

Security Considerations

WARNING: This playbook creates a "wide open" security group (0.0.0.0/0) for simplicity. This is NOT RECOMMENDED FOR PRODUCTION.

For production use:

Restrict security group rules to specific IP ranges
Use VPN or bastion host for access
Enable AWS Systems Manager Session Manager
Implement proper IAM roles and policies
Enable CloudWatch logging
Use encrypted EBS volumes
Configure firewall rules on instances

Troubleshooting

CDI Configuration Issues

If GPU containers stop working after a reboot:

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Slurm Node Not Responding

# On controller
scontrol update nodename=slurm-node-1 state=resume

# Check node state
sinfo -Nel

Munge Authentication Errors

# Restart munge on all nodes
sudo systemctl restart munge

# Verify munge is working
munge -n | unmunge

Database Connection Issues

# On controller node
sudo systemctl status mariadb
sudo systemctl status slurmdbd

# Check database
mysql -u slurm -pslurmdbpass slurm_acct_db

Costs

Running this cluster will incur AWS costs:

g6.2xlarge instances: ~$0.75/hour each (3 instances = ~$2.25/hour)
EBS storage: 100GB per instance
Data transfer: Minimal for cluster communication

Estimated cost: ~$54/day if running 24/7

Remember to destroy the cluster when not in use!

File Structure

.
├── ansible.cfg                      # Ansible configuration
├── playbook.yml                     # Main cluster deployment
├── deploy-vllm-models.yml          # Optional: Deploy LLM models
├── destroy.yml                      # Cleanup playbook
├── requirements.yml                 # Ansible collection requirements
├── setup.sh                         # Quick setup script
├── group_vars/
│   ├── all.yml                     # Cluster configuration
│   └── vllm.yml                    # vLLM model configuration
├── templates/
│   ├── slurm.conf.j2               # Slurm main configuration
│   ├── slurmdbd.conf.j2            # Slurm database configuration
│   ├── cgroup.conf.j2              # Cgroup configuration
│   ├── gres.conf.j2                # GPU resource configuration
│   ├── vllm_serve_model.sh.j2      # vLLM serving job template
│   ├── vllm_batch_inference.sh.j2  # vLLM batch job template
│   └── vllm_test_inference.py.j2   # vLLM test script
├── examples/                        # Example Slurm job scripts
│   ├── test_gpu.sh
│   ├── container_gpu.sh
│   ├── multi_job.sh
│   └── pytorch_test.sh
├── inventory/
│   └── hosts                        # Generated inventory file
└── Documentation/
    ├── README.md                    # This file
    ├── QUICKSTART.md                # Quick reference guide
    ├── VLLM_DEPLOYMENT.md          # vLLM deployment guide
    ├── PRE_DEPLOYMENT_CHECKLIST.md # Pre-flight checklist
    └── CONTRIBUTING.md              # Customization guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slurm GPU Cluster on AWS with RHEL 10

Features

Prerequisites

Local Requirements

Configuration

Update Variables

Usage

Deploy the Cluster

Verify the Cluster

Test GPU Container Support

Deploy vLLM Models (Optional)

Destroy the Cluster

Cluster Architecture

Node Configuration

GPU Scheduling

Submit GPU Jobs

Security Considerations

Troubleshooting

CDI Configuration Issues

Slurm Node Not Responding

Munge Authentication Errors

Database Connection Issues

Costs

File Structure

References

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
group_vars		group_vars
slurm-on-ocp		slurm-on-ocp
templates		templates
.gitignore		.gitignore
MISTRAL_MANUAL_DEPLOYMENT.md		MISTRAL_MANUAL_DEPLOYMENT.md
README.md		README.md
VLLM_DEPLOYMENT.md		VLLM_DEPLOYMENT.md
ansible.cfg		ansible.cfg
deploy-vllm-models.yml		deploy-vllm-models.yml
destroy.yml		destroy.yml
playbook.yml		playbook.yml
requirements.yml		requirements.yml
setup.sh		setup.sh

redhat-et/slurm

Folders and files

Latest commit

History

Repository files navigation

Slurm GPU Cluster on AWS with RHEL 10

Features

Prerequisites

Local Requirements

Configuration

Update Variables

Usage

Deploy the Cluster

Verify the Cluster

Test GPU Container Support

Deploy vLLM Models (Optional)

Destroy the Cluster

Cluster Architecture

Node Configuration

GPU Scheduling

Submit GPU Jobs

Security Considerations

Troubleshooting

CDI Configuration Issues

Slurm Node Not Responding

Munge Authentication Errors

Database Connection Issues

Costs

File Structure

References

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages