Open
Description
I am using Snakemake version 8.25.5 and plugin version 1.1.0.
I am trying to figure out how to edit the snakemake rule to match this sbatch command (which works correctly):
sbatch -A tgen-332000 -t 96:00:00 --nodes=1 -p gpu-a100 --ntasks=1 --gres=gpu:A100:2 --cpus-per-gpu 16 --mem 384000 dorado1.sh
The rule I made has these resources:
resources:
mem_mb=384000,
gpu=2,
gpu_model="a100",
slurm_partition="gpu-a100",
runtime=4320,
cpus_per_gpu=16
However, whenever I ran "snakemake --profile profile" it started the jobs on the "compute" node even though I had requested the "gpu-a100" node. Another oddity I noticed in the log file was that it seemed like it was running everything twice:
host: g-h-1-8-07
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Execute 1 jobs...
[Sat Mar 22 05:57:01 2025]
rule herro_all_gpu:
input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
output: BJ/ONT/BJ.all.ONT.corrected.fasta
jobid: 0
reason: Forced execution
wildcards: sample=BJ
threads: 32
resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=<TBD>, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16
host: g-h-1-8-07
host: g-h-1-8-07
Building DAG of jobs...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Select jobs to execute...
Execute 1 jobs...
[Sat Mar 22 05:57:03 2025]
Execute 1 jobs...
localrule herro_all_gpu:
input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
output: BJ/ONT/BJ.all.ONT.corrected.fasta
jobid: 0
reason: Forced execution
wildcards: sample=BJ
threads: 16
resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16
[Sat Mar 22 05:57:03 2025]
localrule herro_all_gpu:
input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
output: BJ/ONT/BJ.all.ONT.corrected.fasta
jobid: 0
reason: Forced execution
wildcards: sample=BJ
threads: 16
resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16
[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.944] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:15.945] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:16.000] [info] - downloading herro-v1 with httplib
[2025-03-22 05:57:16.000] [info] - downloading herro-v1 with httplib
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info] - downloading herro-v1 with curl
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info] - downloading herro-v1 with curl
% Total % Received % Xferd Average Speed Time Time Time Current
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
Dload Upload Total Spent Left Speed
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 22.3M 100 22.3M 0 0 52.6M 0 --:--:-- --:--:-- --:--:-- 52.6M
^M 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0^M100 22.3M 100 22.3M 0 0 52.5M 0 --:--:-- --:--:-- --:--:-- 52.6M
[2025-03-22 05:57:17.454] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.499] [info] Starting
[2025-03-22 05:57:17.506] [info] Starting
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 18555518 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
slurmstepd: error: *** STEP 18555518.0 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
Will exit after finishing currently running jobs (scheduler).
Will exit after finishing currently running jobs (scheduler).
Perhaps because I requested 2 gpus there is a bug and it is trying to run the rule twice? Please let me know what advice you have. Thank you.
Metadata
Metadata
Assignees
Labels
No labels