Primus-SaFE

Stability at Scale: AMD’s Full‑Stack Platform for Large‑Model Training

Overview

Primus-SaFE is AMD's comprehensive, full-stack platform designed to address the critical challenges of multi-node AI training on AMD GPU clusters. Training large AI models demands unwavering stability and robust debugging capabilities at cluster scale, yet today's ROCm-based multi-node GPU deployments often rely on brittle scripts and disjointed tools to launch distributed jobs, monitor performance, and recover from failures.

Primus-SaFE transforms a collection of AMD GPU servers into a resilient, self-monitoring environment for next-generation model training by automating everything from cluster provisioning and intelligent job scheduling to real-time monitoring and hardware health validation. Running atop Kubernetes and integrated with the ROCm software stack, it provides:

Automated Cluster Deployment: Production-grade Kubernetes setup with optimized infrastructure
Intelligent Job Scheduling: Multi-priority queues, automatic failover, and topology-aware placement
Full-Stack Observability: Real-time metrics, interactive dashboards, and comprehensive telemetry
Preflight Validation: Rigorous health checks and performance benchmarking before workload deployment
Fault Tolerance: Automatic recovery mechanisms to minimize downtime and maximize goodput

🧩 Primus Product Matrix

Module	Role	Key Features	Dependencies / Integration
Primus-LM	End-to-end training framework	- Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Turbo and SaFE	- Can invoke Primus-Turbo kernels and modules - Runs on top of Primus-SaFE for stable scheduling
Primus-Turbo	High-performance operators & modules	- Provides common LLM training operators (FlashAttention, GEMM, Collectives, GroupedGemm, etc.) - Modular design, directly pluggable into Primus-LM - Optimized for different architectures and precisions	- Built on AITER, CK, hipBLASLt, Triton and other operator libraries - Can be enabled via configuration inside Primus-LM
Primus-SaFE	Stability & platform layer	- Cluster sanity check and benchmarking - Kubernets scheduling with topology awareness - Fault tolerance - Stability enhancements	- Building a training platform based on the K8s and Slurm ecosystem

Key Features

🚀 High-Availability, High-Performance Infrastructure

Rapid Deployment: Automated Kubernetes cluster provisioning on bare-metal servers
Unified Storage: JuiceFS for high-throughput distributed file system with client-side caching
Secure Registry: Harbor for trusted container image management with RBAC
Scalable Gateway: Higress API gateway for high-concurrency user-facing services

🧠 Intelligent Scheduling and Resilience

Multi-Priority Queuing: Preemption support for urgent experiments
Automatic Failover: Resume training from checkpoints when nodes or GPUs fail
Topology-Aware Placement: Network-locality optimization for distributed training
Gang Scheduling: Co-schedule interdependent pods for distributed workloads
Health Validation: Preflight checks before scheduling production workloads

📊 Comprehensive Monitoring and Insight

Cluster-Wide Metrics: Real-time GPU, CPU, memory, network, and I/O telemetry
Job-Level Tracking: Training progress, throughput, loss curves, and checkpoint events
Interactive Dashboards: Grafana-based visualization with custom alerts
Root Cause Analysis: Correlated metrics and logs for quick issue diagnosis

✅ Preflight Validation and Performance Consistency

Hardware Diagnostics: Verify GPUs, drivers, and network interfaces
Micro-Benchmarks: Standard AI operations to detect underperforming nodes
Trial Runs: Small-scale training jobs to validate end-to-end functionality
Performance Baselines: Ensure all nodes meet minimum performance standards

📈 Extreme Scalability

Tested on clusters from a few GPUs to tens of thousands of GPUs
Architecture designed to handle 100,000+ GPU accelerators
Scales from small lab setups to massive enterprise deployments

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

Follow these steps to deploy the complete Primus-SaFE platform on your AMD GPU cluster:

1. Clone the Repository

git clone https://github.com/AMD-AGI/Primus-SaFE.git
cd Primus-SaFE

2. Bootstrap the Kubernetes Cluster

cd Bootstrap

# Edit hosts.yaml to specify your cluster nodes and roles
vim hosts.yaml

# Run the bootstrap script to deploy Kubernetes
bash bootstrap.sh

This deploys a production-grade Kubernetes cluster with:

High-availability control plane and etcd
JuiceFS distributed storage
Harbor container registry
Higress API gateway

See Bootstrap/README.md for detailed configuration options.

3. Deploy Observability with Primus-Lens

cd ../Lens/bootstrap

# Install Primus-Lens monitoring and logging components
bash install.sh

This installs:

VictoriaMetrics for time-series metrics storage
Grafana for dashboards and visualization
OpenSearch for log aggregation
Custom exporters for GPU, network, and workload metrics

See Lens/README.md for configuration details.

4. Install the Primus-SaFE Platform Layer

cd ../../SaFE/bootstrap

# Deploy Primus-SaFE stability and scheduling components
bash install.sh

This installs:

API Server for unified management interface
Job Manager for workload lifecycle management
Resource Manager for cluster and node management
Node Agent for health monitoring
Webhooks for request validation and modification

See SaFE/README.md for detailed installation guide.

5. (Optional) Run Health Checks with Primus-Bench

cd ../../Bench

# Edit hosts.ini to specify nodes to benchmark
vim hosts.ini

# Run preflight checks and benchmarks
bash run_bare_metal.sh

This runs:

Node configuration and hardware checks
Network connectivity and bandwidth tests
I/O performance benchmarks
Computation-communication overlap tests

See Bench/README.md for different execution modes (bare-metal, SLURM, Kubernetes).

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Location: Bootstrap/

Primus-Bootstrap automates the deployment of a production-grade Kubernetes cluster on bare-metal servers, provisioning key infrastructure components optimized for AI workloads.

Key Components:

Kubernetes Cluster: High-availability setup using Kubespray with redundant control-plane and etcd
JuiceFS Storage: Distributed file system with metadata/data separation and client-side caching
Harbor Registry: Secure container image management with RBAC and image scanning
Higress Gateway: High-concurrency API gateway with WebAssembly plugin support

Usage:

cd Bootstrap
vim hosts.yaml  # Configure your cluster nodes
bash bootstrap.sh

Learn More: Bootstrap/README.md

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Location: SaFE/

The Primus-SaFE Core extends Kubernetes with AI-specific scheduling and fault tolerance capabilities to maximize throughput and reliability for long-running training jobs.

Core Services:

API Server: Unified interface for resource and workload management, user authentication, and SSH access
Job Manager: Full lifecycle management of PyTorchJob, Job, Deployment with intelligent scheduling, queuing, and automatic retry
Resource Manager: Centralized management of clusters, nodes, workspaces, storage, and operations
Webhooks: Kubernetes admission controller for request validation and resource modification
Node Agent: Node-level monitoring, fault detection, and self-healing capabilities

Key Features:

Multi-priority queues with preemption
Automatic failover and checkpoint resume
Topology-aware placement (via custom scheduler plugins)
Preflight validation with Primus-Bench integration
Multi-tenant workspace isolation

Usage:

cd SaFE/bootstrap
bash install.sh

Learn More: SaFE/README.md

Primus-Lens: Full-Stack Observability & Visualization

Location: Lens/

Primus-Lens provides comprehensive observability across infrastructure and training workloads with real-time metrics, logs, and interactive dashboards.

Components:

Metrics Stack:
- VictoriaMetrics cluster for high-performance time-series storage
- Custom exporters: GPU resources, network statistics, node hardware, workloads
- VMAgent for metrics collection and aggregation
Logging Stack:
- OpenSearch for distributed log storage and search
- Fluent Bit for log collection and forwarding
- Structured logging with correlation to metrics
Visualization:
- Grafana with pre-built dashboards for cluster, node, GPU, and job metrics
- Custom alerts and notifications
- Training toolkit integration for job-level telemetry
Storage:
- PostgreSQL for metadata and configuration storage

Key Metrics:

GPU utilization, memory, temperature, power
Network throughput, latency, packet loss, RDMA statistics
Storage I/O, filesystem performance
Training progress, throughput, loss curves, checkpoint timing
Pod lifecycle, resource allocation, scheduling latency

Usage:

cd Lens/bootstrap
bash install.sh

Configuration: Edit bootstrap/manifests/*.yaml.tpl templates before installation.

Primus-Bench: Node Health Checks and Performance Benchmarking

Location: Bench/

Primus-Bench ensures every node in the cluster meets baseline performance standards before allowing it to run production training workloads.

Test Categories:

Preflight Checks:
- SSH reachability and configuration
- Environment variables and system settings
- ROCm driver and GPU detection
- Network connectivity and DNS resolution
Hardware Diagnostics:
- GPU functionality and memory tests
- RDMA device availability and configuration
- Network interface verification
Performance Benchmarks:
- I/O benchmarks (fio, IOR)
- Computation-communication overlap tests
- Kernel launch overhead measurements
- RDMA bandwidth and latency tests
Trial Training Runs:
- Small-scale distributed training jobs
- Multi-node communication validation
- End-to-end training pipeline verification

Execution Modes:

Bare Metal: Ansible-driven execution across hosts
SLURM: Resource allocation and batch job execution
Kubernetes: PyTorchJob-based distributed testing

Usage:

cd Bench

# Bare metal mode
vim hosts.ini
bash run_bare_metal.sh

# SLURM mode
NNODES=4 PARTITION=gpu bash salloc_slurm.sh

# Kubernetes mode
kubectl apply -f kubernetes/pytorchjob.yaml

Learn More: Bench/README.md

Scheduler Plugins: Advanced Scheduling Capabilities

Location: Scheduler-Plugins/

Custom Kubernetes scheduler plugins that extend the default scheduler with advanced capabilities tailored for AI workloads.

Available Plugins:

TopologyIPSort:
- IP-based node sorting for network locality
- Pod group support for gang scheduling
- Priority-based queue sorting
- Co-scheduling for distributed workloads
- Kubeflow training workload integration

Extension Points:

Score Plugin: Node evaluation and scoring
Queue Sort Plugin: Custom pod ordering
Permit Plugin: Co-scheduling and admission control
Post Filter Plugin: Diagnostics and failure handling

Usage:

cd Scheduler-Plugins

# Install via Helm
helm install scheduler-plugins manifests/charts/scheduler-plugins/

# Use in pod specifications
spec:
  schedulerName: custom-scheduler

Learn More: Scheduler-Plugins/README.md

Roadmap

Primus-SaFE is under active development. Planned enhancements include:

Enhanced AMD Hardware Support:
- MI450 series GPU integration
- Latest ROCm stack compatibility
- 400 Gbps AI NIC optimization
Advanced Fault Tolerance:
- Process-level failover mechanisms
- Asynchronous checkpointing
- Redundant training processes
- Graceful degradation strategies
Agentic Platform Automation:
- Multi-agent systems (LangGraph, CrewAI) for cluster operations
- Natural-language cluster management
- Automated deployment and optimization
- Self-healing and self-tuning capabilities
AI-Powered Operations:
- Predictive failure detection
- Automated performance optimization
- Intelligent resource allocation

License

This project is licensed under the Apache License 2.0. See the LICENSE file for full details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Primus-SaFE

Overview

🧩 Primus Product Matrix

Table of Contents

Key Features

🚀 High-Availability, High-Performance Infrastructure

🧠 Intelligent Scheduling and Resilience

📊 Comprehensive Monitoring and Insight

✅ Preflight Validation and Performance Consistency

📈 Extreme Scalability

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

1. Clone the Repository

2. Bootstrap the Kubernetes Cluster

3. Deploy Observability with Primus-Lens

4. Install the Primus-SaFE Platform Layer

5. (Optional) Run Health Checks with Primus-Bench

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Primus-Lens: Full-Stack Observability & Visualization

Primus-Bench: Node Health Checks and Performance Benchmarking

Scheduler Plugins: Advanced Scheduling Capabilities

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.github/workflows		.github/workflows
Bench		Bench
Bootstrap		Bootstrap
Lens		Lens
SaFE		SaFE
Scheduler-Plugins		Scheduler-Plugins
docs/images		docs/images
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

AMD-AGI/Primus-SaFE

Folders and files

Latest commit

History

Repository files navigation

Primus-SaFE

Overview

🧩 Primus Product Matrix

Table of Contents

Key Features

🚀 High-Availability, High-Performance Infrastructure

🧠 Intelligent Scheduling and Resilience

📊 Comprehensive Monitoring and Insight

✅ Preflight Validation and Performance Consistency

📈 Extreme Scalability

Architecture

Primus-SaFE's functionality is organized into four core modules that work together to provide a complete training platform:

Quick Start

1. Clone the Repository

2. Bootstrap the Kubernetes Cluster

3. Deploy Observability with Primus-Lens

4. Install the Primus-SaFE Platform Layer

5. (Optional) Run Health Checks with Primus-Bench

Core Modules

Primus-Bootstrap: Rapid Cluster Deployment

Primus-SaFE Core: Intelligent Job Scheduling and Fault Tolerance

Primus-Lens: Full-Stack Observability & Visualization

Primus-Bench: Node Health Checks and Performance Benchmarking

Scheduler Plugins: Advanced Scheduling Capabilities

Roadmap

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages