This repository accompanies the ACIT4630 group project report on SHREK-HRM (Self-Corrective Hierarchical Reasoning Model). test.py reproduces the SHREK-Large and SHREK-Small evaluation numbers reported in the paper.
- Simula Research Laboratory cluster (required, the code depends on
flash-attnandadam-atan2which require Linux and NVIDIA CUDA. It will not run on Windows or macOS.) - NVIDIA GPU with CUDA 12.6 (tested on GH200 via the
gh200qpartition) - Python 3.10+
- ~20 GB free disk for the auto-downloaded checkpoints + datasets
The model was trained on Simula's gh200q partition; run_test.sh is a portable SLURM wrapper that runs test.py on the same partition. Logs go into ./logs/ relative to the directory you submit from — no paths are hardcoded.
From your local machine:
scp Project-Attachment-Group03.zip <user>@dnat.simula.no:~/ssh <user>@dnat.simula.no -p 60441unzip Project-Attachment-Group03.zip -d shrek-hrm
cd shrek-hrmmodule load cuda12.6/toolkit/12.6.3
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn==2.8.3 --no-build-isolationThe flash-attn install takes ~10 minutes as it compiles from source. It must be installed separately because it requires torch to be present during build.
module load slurm/21.08.8
mkdir -p logs
sbatch run_test.shsbatch prints a job ID — note it down.
squeue -u $USEROnce the job disappears from the queue (typically 10–15 min after it starts running), it is done. Queue wait time depends on cluster load.
cat logs/test_py_<JOBID>.logThe script prints an accuracy table with one row per (model, task) pair, matching the format in the report.
- On first run,
test.pyautomatically downloads the checkpoints and test datasets from HuggingFace intomodel/anddata/. Subsequent runs skip the download. - All paths and HuggingFace repo IDs live in
config.yamlsotest.pyitself runs unmodified. Editconfig.yamlonly if you need to point at a different checkpoint or dataset location. - Total runtime is ~10–15 min on a single NVIDIA GH200 (or comparable GPU).
Training scripts are provided in source/SHREK-HRM/script/train/. Each script is a SLURM job configured for the Simula gh200q partition. To train SHREK-Large on Maze-Hard, for example:
cd source/SHREK-HRM
sbatch script/train/train_shrek_large_maze.shAvailable training scripts:
| Script | Model | Dataset |
|---|---|---|
train_shrek_large_maze.sh |
SHREK-Large | Maze-Hard |
train_shrek_tiny_maze.sh |
SHREK-Small | Maze-Hard |
Ablation study scripts are in source/SHREK-HRM/script/train/AblationStudy/ and cover all configurations reported in Table IV of the paper. All hyperparameters are set within the scripts (learning rate, batch size, epochs, etc.) and match the values reported in the paper. Training requires a single NVIDIA GH200 GPU and the datasets from dataset/data/.
shrek-hrm/
├── source/ # Model code (SHREK-HRM, HRM, TRM)
├── flops/ # FLOPs measurement scripts
├── model/ # Auto-created on first test.py run (HuggingFace download)
├── data/ # Auto-created on first test.py run (HuggingFace download)
├── config.yaml # Paths + settings consumed by test.py
├── test.py # Evaluation entry point
├── run_test.sh # SLURM wrapper for the Simula cluster
├── requirements.txt # Pip dependencies
- Checkpoints: https://huggingface.co/ThomasHeim/HRM-Reproduction-Checkpoints
- Dataset: https://huggingface.co/datasets/ThomasHeim/HRM-dataset
Both are downloaded automatically by test.py — no manual setup required.