- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 52
 
Open
Description
I was trying to setup a small test project to use batchtools on slurm. I am having an issue that the parent job exits from slurm before all the child jobs are completed. How can I solve this issue?
The main Rscript that submits jobs and the associated configuration files are as:
run_batchtools_job.R
library(batchtools)
reg <- makeRegistry(file.dir =  "slurm_registry", seed = 5081, conf.file = "Scripts/batch_tools_test/.batchtools.conf.R")
my_fun <- function(x) {
  Sys.sleep(x)  
  return(x^2)
}
ids <- batchMap(fun = my_fun, x = 100:150, reg = reg)
done <- submitJobs(ids = ids, reg = reg, resources = list(partition = "small", walltime = 86400, memory = 1024, ntasks = 1))
waitForJobs(ids = ids, reg = reg) 
getStatus(ids = ids, reg = reg)    
final_res <- reduceResultsList(ids = ids, reg = reg)
print(class(final_res))
.batchtools.conf.R
cluster.functions <- makeClusterFunctionsSlurm(template = "Scripts/batch_tools_test/slurm_config.tmpl", 
                                               array.jobs = TRUE, 
                                               scheduler.latency = 60,
                                               fs.latency = 30)
max.concurrent.jobs <- 5
slurm_config.tmpl
#!/bin/bash
#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --ntasks=<%= resources$ntasks %>
#SBATCH --mem=<%= resources$memory %>MB
#SBATCH --partition=<%= resources$partition %>
module load  r/4.3.3
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
I submit the  run_batchtools_job.R script to slurm using the following sbatch script.
run_batchtools.sh
#!/bin/bash
#SBATCH --job-name=batchtools_test
#SBATCH --output=batchtools_test.log
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --mem=2G
#SBATCH --partition=small
# Load R
module load  r/4.3.3
# Run your R script
Rscript Scripts/batch_tools_test/run_batchtools_job.R
I observed that the batchtools_test job exits before all the child jobs spawned using submitJobs end. As a result, there is nothing in final_res.
While checking getErrorMessages, I saw that several jobs are listed as 'not terminated'. But when I manually checked the logs and the results within the registry directories, everything completed as expected.
How can I overcome this issue?
Metadata
Metadata
Assignees
Labels
No labels