Apache Spark in Standalone mode on SLURM-managed HPC with Spark Connect for R via sparklyr.
This repository contains scripts and documentation for deploying a multi-node Spark cluster on an HPC system and connecting to it from your local R environment
- Standalone Spark cluster across multiple HPC nodes
- Automatic driver/worker resource allocation
- Automatic multi-executor sizing based on SLURM cores and memory
- Per-job configuration and scratch directories
- Spark Connect server listening on a fixed port (
15002by default) - Ready-to-use
sparklyrconnection from your laptop via SSH tunnel - Optional port forwarding to the Spark Master and Application Web UIs for monitoring
- SLURM-managed HPC cluster
- SSH access to the HPC system
- Spark 3.4+ installed and available as a module (tested with 3.4.4)
sparklyr(≥ 1.8.4) andpysparklyr(≥ 0.1.3) installed locally in R
- SLURM resource requests (
--nodes,--cpus-per-task,--mem,--time) - Calls
spark-startto configure and launch the cluster - Sources per-job Spark environment
- Prints connection details and SSH tunnelling instructions
- Cleans up Spark services on job termination
- Loads the Spark module
- Validates SLURM environment variables
- Creates job-specific Spark configuration and scratch directories
- Starts Spark Master, captures URLs, and distributes worker start scripts
- Reserves CPU/memory for the driver on its node
- Launches Spark Connect server on
0.0.0.0:15002 - Makes connection details available via
spark-env.sh
- Small R script to test Spark Connect
- Demonstrates connecting to Spark from a local R session
- Clone this repository to your HPC home directory:
git clone https://github.com/lquayle88/spark_on_hpc.git
cd spark_on_hpc- Make
spark-startexecutable and move it into your~/bindirectory (so it’s in yourPATH):
chmod +x spark-start
mkdir -p ~/bin
mv spark-start ~/bin/- Verify it’s available:
which spark-startYou should see something like:
/${HOME}/${USER}/bin/spark-start
- Submit the SLURM batch job:
sbatch spark_cluster_launcher.sh-
Check job output to find:
- Spark Master URL
- Spark Connect host and port
- SSH tunnel commands for Web UI and Spark Connect
-
Forward the Spark Connect port from local machine:
The batch launcher prints the full SSH tunnel command dynamically, including the correct Spark Connect, Master UI, and Application UI ports.
ssh -N \
-L 15002:<spark_master_hostname>:15002 \
-L <master_ui_port>:<spark_master_hostname>:<master_ui_port> \
-L <app_ui_port>:<spark_master_hostname>:<app_ui_port> \
username@cluster.domainNote: Ensure spark_master_hostname, username cluster and domain are replaced with your HPC credentials
Then open:
- Master UI (cluster overview): http://localhost:8080
- Application UI (multi-tab): http://localhost:4040
- Connect from R using
sparklyr:
library(sparklyr)
sc <- spark_connect(
master = "sc://localhost:15002",
method = "spark_connect",
version = "3.4.4"
)- When finished:
Disconnect from Spark in R:
spark_disconnect(sc)Cancel the SLURM job:
scancel <jobid>- The scripts assume Spark is installed as a module (
module load apps/spark/3.4.4); adjust for your HPC environment. - Port
15002is fixed for Spark Connect — change in both scripts if needed. - Web UI is accessible via SSH tunnel to the Application and/or Master node’s web port.
- For best performance, request all CPU cores and memory per node in SLURM before moving to request resources from subsequent nodes.
MIT License - please see LICENSE for details.