AMD-AGI
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎benchmark/megatron/checkpoint/README.md‎
Lines changed: 1 addition & 1 deletion b/‎benchmark/megatron/checkpoint/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎benchmark/megatron/model/benchmark_model.py‎
Lines changed: 2 additions & 2 deletions b/‎benchmark/megatron/model/benchmark_model.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/README.md‎
Lines changed: 49 additions & 37 deletions b/‎examples/README.md‎
Lines changed: 49 additions & 37 deletions
diff --git a/‎examples/megatron/configs/deepseek_v2-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v2-pretrain.yaml‎ b/‎examples/megatron/configs/deepseek_v2-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v2-pretrain.yaml‎
diff --git a/‎examples/megatron/configs/deepseek_v2_lite-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v2_lite-pretrain.yaml‎ b/‎examples/megatron/configs/deepseek_v2_lite-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v2_lite-pretrain.yaml‎
diff --git a/‎examples/megatron/configs/deepseek_v3-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml‎ b/‎examples/megatron/configs/deepseek_v3-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml‎
diff --git a/‎examples/megatron/configs/grok1-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/grok1-pretrain.yaml‎ b/‎examples/megatron/configs/grok1-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/grok1-pretrain.yaml‎
diff --git a/‎examples/megatron/configs/grok2-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/grok2-pretrain.yaml‎ b/‎examples/megatron/configs/grok2-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/grok2-pretrain.yaml‎
diff --git a/‎examples/megatron/configs/llama2_70B-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/llama2_70B-pretrain.yaml‎ b/‎examples/megatron/configs/llama2_70B-pretrain.yaml‎ renamed to ‎examples/megatron/configs/MI300X/llama2_70B-pretrain.yaml‎
@@ -56,7 +56,7 @@ Primus leverages AMD’s ROCm Docker images to provide a consistent, ready-to-ru
 
     ```bash
     cd Primus && pip install -r requirements.txt
-    EXP=examples/megatron/configs/llama2_7B-pretrain.yaml bash ./examples/run_local_pretrain.sh
+    EXP=examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml bash ./examples/run_local_pretrain.sh
 
     ```
 
 
@@ -43,7 +43,7 @@ example:
 ```
 export DATA_PATH=/PATH/TO/DATA
 python3 benchmark/megatron/checkpoint/ckpt_launch.py \
-    --yaml-config-path examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml \
+    --yaml-config-path examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-pretrain.yaml \
     --nnodes 1
 ```
 If you need to benchmark multiple different models, parallel strategies, and checkpoint modes,
 
@@ -9,7 +9,7 @@
 import subprocess
 from pathlib import Path
 
-CONFIG_DIR = Path("examples/megatron/configs")
+CONFIG_DIR = Path("examples/megatron/configs/MI300X")
 
 
 def find_all_model_configs():
@@ -45,7 +45,7 @@ def main():
         type=str,
         help=(
             "Specify a model name (without -pretrain.yaml). "
-            "For example, for config 'examples/megatron/configs/llama2_7B-pretrain.yaml', use: --model llama2_7B"
+            "For example, for config 'examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml', use: --model llama2_7B"
         ),
     )
     parser.add_argument(
 
@@ -7,16 +7,28 @@ It supports both **single-node** and **multi-node** training, and includes optio
 
 ## 📚 Table of Contents
 
-- [⚙️ Supported Backends](#️-supported-backends)
-- [🖥️ Single Node Training](#️-single-node-training)
-  - [Setup Docker](#setup-docker)
-  - [Setup Primus](#setup-primus)
-  - [Run Pretraining](#run-pretraining)
-- [🌐 Multi-node Training](#-multi-node-training)
-- [🚀 HipBLASLt Auto Tuning (Optional)](#-hipblaslt-auto-tuning-optional)
-- [✅ Supported Models](#-supported-models)
-  - [🏃‍♂️ How to Run a Supported Model](#️-how-to-run-a-supported-model)
-- [☸️ Kubernetes Training Management](#️-kubernetes-training-management-run_k8s_pretrainsh)
+- [🧠 Pretraining with Primus](#-pretraining-with-primus)
+  - [📚 Table of Contents](#-table-of-contents)
+  - [⚙️ Supported Backends](#️-supported-backends)
+  - [🖥️ Single Node Training](#️-single-node-training)
+    - [Setup Docker](#setup-docker)
+    - [Setup Primus](#setup-primus)
+    - [Run Pretraining](#run-pretraining)
+      - [🚀 Quick Start Mode](#-quick-start-mode)
+      - [🧑‍🔧 Interactive Mode](#-interactive-mode)
+  - [🌐 Multi-node Training](#-multi-node-training)
+  - [🔧 HipblasLT Auto Tuning](#-hipblaslt-auto-tuning)
+    - [Stage 1: Dump GEMM Shape](#stage-1-dump-gemm-shape)
+    - [Stage 2: Tune GEMM Kernel](#stage-2-tune-gemm-kernel)
+    - [Stage 3: Train with Tuned Kernel](#stage-3-train-with-tuned-kernel)
+  - [✅ Supported Models](#-supported-models)
+    - [🏃‍♂️ How to Run a Supported Model](#️-how-to-run-a-supported-model)
+  - [☸️ Kubernetes Training Management (`run_k8s_pretrain.sh`)](#️-kubernetes-training-management-run_k8s_pretrainsh)
+    - [Requirements](#requirements)
+    - [Usage](#usage)
+    - [⚙️ Commands](#️-commands)
+    - [⚙️ Create Command Options](#️-create-command-options)
+    - [Example](#example)
 
 ---
 
@@ -75,10 +87,10 @@ You do not need to enter the Docker container. Just set the config and run.
 
 ```bash
 # Example for megatron llama3.1_8B
-EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
+EXP=examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
 
 # examples for torchtitan llama3.1_8B
-EXP=examples/torchtitan/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
+EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
 ```
 
 ---
@@ -99,10 +111,10 @@ docker exec -it dev_primus bash
 cd Primus && pip install -r requirements.txt
 
 # Example for megatron llama3.1_8B
-EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_pretrain.sh
+EXP=examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_pretrain.sh
 
 # examples for torchtitan llama3.1_8B
-EXP=examples/torchtitan/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_pretrain.sh
+EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_pretrain.sh
 
 ```
 
@@ -118,10 +130,10 @@ export DOCKER_IMAGE="docker.io/rocm/primus:v25.9_gfx942"
 export NNODES=8
 
 # Example for megatron llama3.1_8B
-EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
+EXP=examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
 
 # examples for torchtitan llama3.1_8b
-EXP=examples/torchtitan/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
+EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
 ```
 
 ## 🔧 HipblasLT Auto Tuning
@@ -144,7 +156,7 @@ It is recommended to reduce `train_iters` for faster shape generation.
 # ./output/tune_hipblaslt/${PRIMUS_MODEL}/gemm_shape
 
 export PRIMUS_HIPBLASLT_TUNING_STAGE=1
-export EXP=examples/megatron/configs/llama2_7B-pretrain.yaml
+export EXP=examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml
 NNODES=1 bash ./examples/run_slurm_pretrain.sh
 ```
 
@@ -161,7 +173,7 @@ It typically takes 10–30 minutes depending on model size and shape complexity.
 # ./output/tune_hipblaslt/${PRIMUS_MODEL}/gemm_tune/tune_hipblas_gemm_results.txt
 
 export PRIMUS_HIPBLASLT_TUNING_STAGE=2
-export EXP=examples/megatron/configs/llama2_7B-pretrain.yaml
+export EXP=examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml
 NNODES=1 bash ./examples/run_slurm_pretrain.sh
 ```
 
@@ -173,7 +185,7 @@ In this final stage, the tuned kernel is loaded for efficient training:
 
 ```bash
 export PRIMUS_HIPBLASLT_TUNING_STAGE=3
-export EXP=examples/megatron/configs/llama2_7B-pretrain.yaml
+export EXP=examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml
 NNODES=1 bash ./examples/run_slurm_pretrain.sh
 ```
 
@@ -183,18 +195,18 @@ The following models are supported out of the box via provided configuration fil
 
 | Model            | Huggingface Config | Megatron Config | TorchTitan Config |
 | ---------------- | ------------------ | --------------- | ----------------- |
-| llama2_7B        | [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)         | [llama2_7B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/config/llama2_7B-pretrain.yaml)               | |
-| llama2_70B       | [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)       | [llama2_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama2_70B-pretrain.yaml)             | |
-| llama3_8B        | [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)     | [llama3_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama3_8B-pretrain.yaml)               | |
-| llama3_70B       | [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)   | [llama3_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama3_70B-pretrain.yaml)             | |
-| llama3.1_8B      | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)           | [llama3.1_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama3.1_8B-pretrain.yaml)           | [llama3.1_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/llama3.1_8B-pretrain.yaml)|
-| llama3.1_70B     | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)         | [llama3.1_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama3.1_70B-pretrain.yaml)         | [llama3.1_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/llama3.1_70B-pretrain.yaml)|
-| llama3.1_405B     | [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B)         | [llama3.1_405B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/llama3.1_405B-pretrain.yaml)         | [llama3.1_405B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/llama3.1_405B-pretrain.yaml)|
-| deepseek_v2_lite | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | [deepseek_v2_lite-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/deepseek_v2_lite-pretrain.yaml) | |
-| deepseek_v2      | [deepseek-ai/DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)           | [deepseek_v2-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/deepseek_v2-pretrain.yaml)           | |
-| deepseek_v3      | [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)           | [deepseek_v3-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/deepseek_v3-pretrain.yaml)           | |
-| Mixtral-8x7B-v0.1 | [mistralai/Mixtral-8x7B-v0.1 ](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)           | [mixtral_8x7B_v0.1-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/mixtral_8x7B_v0.1-pretrain.yaml)           | |
-| Mixtral-8x22B-v0.1 | [mistralai/Mixtral-8x22B-v0.1 ](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)           | [mixtral_8x22B_v0.1-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/mixtral_8x22B_v0.1-pretrain.yaml)           | |
+| llama2_7B        | [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)         | [llama2_7B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml)               | |
+| llama2_70B       | [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)       | [llama2_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama2_70B-pretrain.yaml)             | |
+| llama3_8B        | [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)     | [llama3_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama3_8B-pretrain.yaml)               | |
+| llama3_70B       | [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)   | [llama3_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama3_70B-pretrain.yaml)             | |
+| llama3.1_8B      | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)           | [llama3.1_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml)           | [llama3.1_8B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml)|
+| llama3.1_70B     | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B)         | [llama3.1_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama3.1_70B-pretrain.yaml)         | [llama3.1_70B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/MI300X/llama3.1_70B-pretrain.yaml)|
+| llama3.1_405B     | [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B)         | [llama3.1_405B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/llama3.1_405B-pretrain.yaml)         | [llama3.1_405B-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/torchtitan/configs/MI300X/llama3.1_405B-pretrain.yaml)|
+| deepseek_v2_lite | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | [deepseek_v2_lite-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/deepseek_v2_lite-pretrain.yaml) | |
+| deepseek_v2      | [deepseek-ai/DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2)           | [deepseek_v2-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/deepseek_v2-pretrain.yaml)           | |
+| deepseek_v3      | [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3)           | [deepseek_v3-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/deepseek_v3-pretrain.yaml)           | |
+| Mixtral-8x7B-v0.1 | [mistralai/Mixtral-8x7B-v0.1 ](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)           | [mixtral_8x7B_v0.1-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/mixtral_8x7B_v0.1-pretrain.yaml)           | |
+| Mixtral-8x22B-v0.1 | [mistralai/Mixtral-8x22B-v0.1 ](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)           | [mixtral_8x22B_v0.1-pretrain.yaml](https://github.com/AMD-AIG-AIMA/Primus/blob/main/examples/megatron/configs/MI300X/mixtral_8x22B_v0.1-pretrain.yaml)           | |
 
 ---
 
@@ -203,15 +215,15 @@ The following models are supported out of the box via provided configuration fil
 Use the following command pattern to start training with a selected model configuration:
 
 ```bash
-EXP=examples/megatron/configs/<model_config> bash ./examples/run_local_pretrain.sh
+EXP=examples/megatron/configs/MI300X/<model_config> bash ./examples/run_local_pretrain.sh
 ```
 
 For example, to run the llama3.1_8B model quickly:
 
 ```bash
-EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
+EXP=examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
 
-EXP=examples/torchtitan/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
+EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_local_pretrain.sh
 ```
 
 
@@ -221,10 +233,10 @@ For multi-node training via SLURM, use:
 export NNODES=8
 
 #run megatron
-EXP=examples/megatron/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
+EXP=examples/megatron/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
 
 # run torchtitan
-EXP=examples/torchtitan/configs/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
+EXP=examples/torchtitan/configs/MI300X/llama3.1_8B-pretrain.yaml bash ./examples/run_slurm_pretrain.sh
 ```
 
 ## ☸️ Kubernetes Training Management (`run_k8s_pretrain.sh`)
@@ -285,7 +297,7 @@ Create a training workload with 2 replicas and custom config:
 
 ```bash
 bash examples/run_k8s_pretrain.sh --url http://api.example.com create --replica 2 --cpu 96 --gpu 4 \
-  --exp examples/megatron/configs/llama2_7B-pretrain.yaml --data_path /mnt/data/train \
+  --exp examples/megatron/configs/MI300X/llama2_7B-pretrain.yaml --data_path /mnt/data/train \
   --image docker.io/custom/image:latest --hf_token myhf_token --workspace team-dev
 
 #result: