Slurm Julia Block

Minimal working example of a distributed Julia calculation across different Slurm nodes.

This project provides a template for running Julia-based parallel computations on remote clusters managed by Slurm. It automates code deployment, job submission, monitoring, and result synchronization, so you can focus on your scientific code in src/ and let the infrastructure handle the rest.

Features

Automatic deployment: Sync your code to the remote cluster with a single command.
Flexible cluster configuration: Easily switch between clusters using a YAML config.
Automated job submission: Generates and submits Slurm job scripts tailored to each cluster.
Parallel execution: Handles all Julia parallelization and Slurm array job details for you.
Result synchronization: Automatically syncs results back to your local machine after job completion.
Separation of concerns: You only need to write your scientific code in src/ (especially main.jl); all cluster and parallelization logic is handled for you.

Project Structure

.
├── bin/                # Launch scripts (auto-generated and helpers)
├── config/             # Cluster configuration and prologue/epilogue scripts
├── data/               # Data/results (synced from cluster)
├── plots/              # Plotting scripts and related files
├── src/                # Your Julia source code (main.jl, functions.jl, etc.)
├── Makefile            # Main interface for deployment and job management
├── Project.toml        # Julia project dependencies
├── Manifest.toml       # Julia project manifest
└── README.md           # This file

Quick Start

1. Configure Your Clusters

Edit config/clusters.yaml to define your clusters. Each cluster entry should specify SSH info, Slurm options, and paths. Example:

esbirro:
    short: es1
    host: esbirro.example.com
    path: /remote/home
    partition: compute
    ntasks_per_node: 4
    nodes: 2
    ntasks: 8
    cpus_per_task: 1
    mem_per_cpu: 2G
    mail_user: your@email.com
    mail_type: END,FAIL

Notice that shortis the handle through which passwordless ssh is configured. That is, you should be able to do ssh es1 from your terminal without password prompt.

2. Write Your Julia Code

Place your main computation in src/main.jl.
Add helper functions in src/functions.jl or other files as needed.
Your code should be written as if running locally from main.jl; all parallelization and cluster setup is handled for you.

3. Launch a Calculation

From your project root, run:

make ARG="<arg>" CLUSTER=<cluster> run

Replace <arg> with the argument your main.jl expects.
Replace <cluster> with the name of the cluster as defined in config/clusters.yaml.

Example:

make ARG="input1" CLUSTER=esbirro run

This will:

Generate a Slurm launcher script tailored to the cluster.
Deploy your code to the remote cluster.
Submit the job to Slurm, handling all parallelization details.
Print the job status in your terminal until it finishes.
Sync the results from the cluster's data/ directory back to your local data/.

4. Check Results

After the job completes, results will be available in your local data/ directory.

How It Works

Makefile: Orchestrates deployment, job script generation, job submission, and result synchronization.
bin/launcher.sh: Auto-generated Slurm job script, customized for each cluster.
config/prologue.sh / prologue.jl: Set up the environment and Julia parallel workers on the cluster.
src/main.jl: Your main Julia entry point. All parallelization is handled for you; just write your computation as usual.
Result Sync: After the job, results in data/ on the cluster are synced back to your local data/.

Advanced Usage

To deploy code without running a job:
make CLUSTER=<cluster> deploy
To sync results from the cluster manually:
make CLUSTER=<cluster> sync
To regenerate the launcher script:
make CLUSTER=<cluster> gen_launcher

Requirements

Julia (with Distributed.jl and SlurmClusterManager.jl in your Project.toml)
yq (YAML processor) installed locally for Makefile parsing
SSH access to your clusters
Slurm installed on the remote cluster

Notes

All cluster-specific options (partitions, resources, etc.) are set in config/clusters.yaml.
The workflow assumes your code is run via src/main.jl.
You can add more clusters by extending config/clusters.yaml.

Troubleshooting & Common Errors

The workflow includes some explicit error checks to help you quickly diagnose configuration or environment issues:

1. `Error: Unknown cluster '<cluster>' in <config/clusters.yaml>`

Where: During make deploy or make sync (and any target that uses these).

Cause: The CLUSTER variable you provided does not match any entry in your config/clusters.yaml file, or the short field is missing for that cluster.

2. `exit 1` after prologue failure

Where: At the start of a Slurm job, in the generated bin/launcher.sh script.

Cause: The config/prologue.sh script failed (non-zero exit code). This script is responsible for setting up the environment on the cluster (e.g., loading modules, activating environments).

How to fix: - Check the contents and logic of config/prologue.sh. - Make sure all commands in the prologue succeed on the cluster (e.g., module load julia or similar). - Check the job's output log for error messages from the prologue.

These explicit errors are designed to fail fast and provide clear feedback if your configuration or environment is not set up correctly.

3. `exit 2` errors in `epilogue.sh`

Where: During job monitoring, after job submission, in the config/epilogue.sh script.

Cause: The script exits with code 2 in the following cases: - The job enters a terminal or error state (such as FAILED, CANCELLED, TIMEOUT, NODE_FAIL, OUT_OF_MEMORY, etc.) instead of RUNNING or PENDING. - After running, if any job in the array is not in the COMPLETED state (e.g., failed, cancelled, or other non-successful state).

How to fix: - Check the job's output and error logs in the logs/ directory for details on why the job failed or did not complete.

License

See LICENSE.md.

This project makes use of Distributed.jl and SlurmClusterManager.jl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Slurm Julia Block

Table of Contents

Features

Project Structure

Quick Start

1. Configure Your Clusters

2. Write Your Julia Code

3. Launch a Calculation

4. Check Results

How It Works

Advanced Usage

Requirements

Notes

Troubleshooting & Common Errors

1. `Error: Unknown cluster '<cluster>' in <config/clusters.yaml>`

2. `exit 1` after prologue failure

3. `exit 2` errors in `epilogue.sh`

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
bin		bin
config		config
plots		plots
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

License

CarlosP24/slurm_julia_block

Folders and files

Latest commit

History

Repository files navigation

Slurm Julia Block

Table of Contents

Features

Project Structure

Quick Start

1. Configure Your Clusters

2. Write Your Julia Code

3. Launch a Calculation

4. Check Results

How It Works

Advanced Usage

Requirements

Notes

Troubleshooting & Common Errors

1. Error: Unknown cluster '<cluster>' in <config/clusters.yaml>

2. exit 1 after prologue failure

3. exit 2 errors in epilogue.sh

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `Error: Unknown cluster '<cluster>' in <config/clusters.yaml>`

2. `exit 1` after prologue failure

3. `exit 2` errors in `epilogue.sh`

Packages