Minimal working example of a distributed Julia calculation across different Slurm nodes.
This project provides a template for running Julia-based parallel computations on remote clusters managed by Slurm. It automates code deployment, job submission, monitoring, and result synchronization, so you can focus on your scientific code in src/ and let the infrastructure handle the rest.
- Project Structure
- Quick Start
- How It Works
- Advanced Usage
- Requirements
- Notes
- Troubleshooting & Common Errors
- License
- Automatic deployment: Sync your code to the remote cluster with a single command.
- Flexible cluster configuration: Easily switch between clusters using a YAML config.
- Automated job submission: Generates and submits Slurm job scripts tailored to each cluster.
- Parallel execution: Handles all Julia parallelization and Slurm array job details for you.
- Result synchronization: Automatically syncs results back to your local machine after job completion.
- Separation of concerns: You only need to write your scientific code in
src/(especiallymain.jl); all cluster and parallelization logic is handled for you.
.
├── bin/ # Launch scripts (auto-generated and helpers)
├── config/ # Cluster configuration and prologue/epilogue scripts
├── data/ # Data/results (synced from cluster)
├── plots/ # Plotting scripts and related files
├── src/ # Your Julia source code (main.jl, functions.jl, etc.)
├── Makefile # Main interface for deployment and job management
├── Project.toml # Julia project dependencies
├── Manifest.toml # Julia project manifest
└── README.md # This file
Edit config/clusters.yaml to define your clusters. Each cluster entry should specify SSH info, Slurm options, and paths. Example:
esbirro:
short: es1
host: esbirro.example.com
path: /remote/home
partition: compute
ntasks_per_node: 4
nodes: 2
ntasks: 8
cpus_per_task: 1
mem_per_cpu: 2G
mail_user: your@email.com
mail_type: END,FAILNotice that shortis the handle through which passwordless ssh is configured. That is, you should be able to do ssh es1 from your terminal without password prompt.
- Place your main computation in
src/main.jl. - Add helper functions in
src/functions.jlor other files as needed. - Your code should be written as if running locally from
main.jl; all parallelization and cluster setup is handled for you.
From your project root, run:
make ARG="<arg>" CLUSTER=<cluster> run- Replace
<arg>with the argument yourmain.jlexpects. - Replace
<cluster>with the name of the cluster as defined inconfig/clusters.yaml.
Example:
make ARG="input1" CLUSTER=esbirro runThis will:
- Generate a Slurm launcher script tailored to the cluster.
- Deploy your code to the remote cluster.
- Submit the job to Slurm, handling all parallelization details.
- Print the job status in your terminal until it finishes.
- Sync the results from the cluster's
data/directory back to your localdata/.
After the job completes, results will be available in your local data/ directory.
- Makefile: Orchestrates deployment, job script generation, job submission, and result synchronization.
- bin/launcher.sh: Auto-generated Slurm job script, customized for each cluster.
- config/prologue.sh / prologue.jl: Set up the environment and Julia parallel workers on the cluster.
- src/main.jl: Your main Julia entry point. All parallelization is handled for you; just write your computation as usual.
- Result Sync: After the job, results in
data/on the cluster are synced back to your localdata/.
- To deploy code without running a job:
make CLUSTER=<cluster> deploy - To sync results from the cluster manually:
make CLUSTER=<cluster> sync - To regenerate the launcher script:
make CLUSTER=<cluster> gen_launcher
- Julia (with
Distributed.jlandSlurmClusterManager.jlin yourProject.toml) yq(YAML processor) installed locally for Makefile parsing- SSH access to your clusters
- Slurm installed on the remote cluster
- All cluster-specific options (partitions, resources, etc.) are set in
config/clusters.yaml. - The workflow assumes your code is run via
src/main.jl. - You can add more clusters by extending
config/clusters.yaml.
The workflow includes some explicit error checks to help you quickly diagnose configuration or environment issues:
Where: During make deploy or make sync (and any target that uses these).
Cause: The CLUSTER variable you provided does not match any entry in your config/clusters.yaml file, or the short field is missing for that cluster.
Where: At the start of a Slurm job, in the generated bin/launcher.sh script.
Cause: The config/prologue.sh script failed (non-zero exit code). This script is responsible for setting up the environment on the cluster (e.g., loading modules, activating environments).
How to fix:
- Check the contents and logic of config/prologue.sh.
- Make sure all commands in the prologue succeed on the cluster (e.g., module load julia or similar).
- Check the job's output log for error messages from the prologue.
These explicit errors are designed to fail fast and provide clear feedback if your configuration or environment is not set up correctly.
Where: During job monitoring, after job submission, in the config/epilogue.sh script.
Cause: The script exits with code 2 in the following cases:
- The job enters a terminal or error state (such as FAILED, CANCELLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, etc.) instead of RUNNING or PENDING.
- After running, if any job in the array is not in the COMPLETED state (e.g., failed, cancelled, or other non-successful state).
How to fix:
- Check the job's output and error logs in the logs/ directory for details on why the job failed or did not complete.
See LICENSE.md.
This project makes use of Distributed.jl and SlurmClusterManager.jl.