Skip to content

Problems with the internal jobstep submission #225

Open
@jd3234

Description

@jd3234

Software Versions

  • snakemake: v8.29.3
  • snakemake-executor-plugin-slurm 0.15.1 pypi_0 pypi
  • snakemake-executor-plugin-slurm-jobstep 0.2.1 pypi_0 pypi
  • slurm 23.11.3

Describe the bug
Often - but not always reproducible- I am getting this error:

srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.0 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted

This error seems to appear when running too many samples in parallel for the same rules. But when only few samples run it might not happen in the same rule.

I have also seen it when asking for too many cpus_per_thread that are then unused (e.g. I accidently asked for 20 cpus, but the tool does not support multithreading, when testing with fewer cpus (5) it worked again, with 1 cpu it always works).

Here is an example of a rule where I have seen the error:

rule clustering_1:
    name: TOOL + '.clustering_1'
    message: "Clustering reads to make a list of most abundant, representative reads. Rule {rule}."
    input:
        fasta = rules.read_filter.output.fasta
    output:
        clusterfasta = 'clustering_1/{sample}/{sample}_cluster_representatives.fasta',
        cluster = 'clustering_1/{sample}/{sample}_cluster_representatives.fasta.clstr'
    resources: 
        mem_mb = lambda wildcards, attempt: attempt * 2500,
        cpus_per_task = 16
    benchmark: 
        'benchmarks/clustering_1_{sample}.tsv'
    retries: 5
    container: os.environ['apptainer_container']
    params:
        clusteringidentity = IDENTITY,
        wordlen = WORDLEN,
        clustprog = '1' #accurate but slow mode
    log:
        cluster = 'logs/{sample}/{sample}_clustering1-report.log',
    shell:
       """
            # This is the actual clustering command
            cd-hit-est -T {resources.cpus_per_task} -i {input.fasta} -o {output.clusterfasta} -c {params.clusteringidentity} -n {params.wordlen} -d 0 -M {resources.mem_mb} -g {params.clustprog} > {log.cluster}
       """

Logs

host: [NODENAME]
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=2500, mem_mib=2385, disk_mb=1000, disk_mib=954, cpus_per_task=16
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.0 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Trying to restart job 0.
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.1 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Trying to restart job 0.
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.2 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Trying to restart job 0.
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.3 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Trying to restart job 0.
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.4 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Trying to restart job 0.
Select jobs to execute...
Execute 1 jobs...

[Tue Mar 11 14:06:39 2025]
Job 0: Clustering reads to make a list of most abundant, representative reads. Rule ont-amplicon_assembly.clustering_1.
Reason: Forced execution


         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
srun: error: JobId=125452 failed to distribute tasks (bind_type:threads,mask_cpu,one_thread) - this should never happen
srun: error: Task launch for StepId=125452.5 failed on node [NODENAME]: Unable to layout tasks on given cpus
srun: error: Application launch failed: Unable to layout tasks on given cpus
srun: Job step aborted
[Tue Mar 11 14:06:39 2025]
Error in rule ont-amplicon_assembly.clustering_1:
    jobid: 0
    input: filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta
    output: clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta, clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta.clstr
    log: logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log (check log file(s) for error details)
    shell:
        
         # This is the actual clustering command
         cd-hit-est -T 16 -i filtered/2000169_EB20S0G1077-0153_[string]_5401_5800_coverage_quality_size_filtered.fasta -o clustering_1/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_cluster_representatives.fasta -c 0.99 -n 10,11 -d 0 -M 2500 -g 1 > logs/2000169_EB20S0G1077-0153_[string]_5401_5800/2000169_EB20S0G1077-0153_[string]_5401_5800_clustering1-report.log
       
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
WorkflowError:
At least one job did not complete successfully.

Minimal example

Additional context
We have looked into the submission scripts and found that removing --executor slurm-jobstep --jobs 1 in

and returning an empty string solves the problem.

Why is slurm-jobstep used? Would it be possibile to have an option/parameter to switch that behavior off?

We are currently onboarding slurm and we are still a little bit inexperienced. It is also possible that our SLURM cluster is not properly configured yet. In a couple weeks we will upgrade slurm to v24.11.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions