Skip to content

Weird issues trying to run on GPU with v1.1.0 #246

Open
@tbenavi1

Description

@tbenavi1

I am using Snakemake version 8.25.5 and plugin version 1.1.0.

I am trying to figure out how to edit the snakemake rule to match this sbatch command (which works correctly):

sbatch -A tgen-332000 -t 96:00:00 --nodes=1 -p gpu-a100 --ntasks=1 --gres=gpu:A100:2 --cpus-per-gpu 16 --mem 384000 dorado1.sh

The rule I made has these resources:

  resources:
    mem_mb=384000,
    gpu=2,
    gpu_model="a100",
    slurm_partition="gpu-a100",
    runtime=4320,
    cpus_per_gpu=16

However, whenever I ran "snakemake --profile profile" it started the jobs on the "compute" node even though I had requested the "gpu-a100" node. Another oddity I noticed in the log file was that it seemed like it was running everything twice:

host: g-h-1-8-07
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Execute 1 jobs...

[Sat Mar 22 05:57:01 2025]
rule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 32
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=<TBD>, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16

host: g-h-1-8-07
host: g-h-1-8-07
Building DAG of jobs...
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, gpu=2, cpus_per_gpu=16
Select jobs to execute...
Select jobs to execute...
Execute 1 jobs...

[Sat Mar 22 05:57:03 2025]
Execute 1 jobs...
localrule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 16
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16


[Sat Mar 22 05:57:03 2025]
localrule herro_all_gpu:
    input: BJ/ONT/BJ.all.ONT.fastq, BJ/ONT/BJ.all.ONT.overlaps.paf
    output: BJ/ONT/BJ.all.ONT.corrected.fasta
    jobid: 0
    reason: Forced execution
    wildcards: sample=BJ
    threads: 16
    resources: mem_mb=384000, mem_mib=366211, disk_mb=772244, disk_mib=736470, tmpdir=/tmp, slurm_account=tgen-332000, gpu=2, gpu_model=A100, slurm_partition=gpu-a100, runtime=4320, cpus_per_gpu=16

[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.892] [info] Running: "correct" "BJ/ONT/BJ.all.ONT.fastq" "--from-paf" "BJ/ONT/BJ.all.ONT.overlaps.paf"
[2025-03-22 05:57:15.944] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:15.945] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify the location manually.
[2025-03-22 05:57:16.000] [info]  - downloading herro-v1 with httplib
[2025-03-22 05:57:16.000] [info]  - downloading herro-v1 with httplib
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info]  - downloading herro-v1 with curl
[2025-03-22 05:57:16.110] [error] Failed to download herro-v1: SSL server verification failed
[2025-03-22 05:57:16.110] [info]  - downloading herro-v1 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100 22.3M  100 22.3M    0     0  52.6M      0 --:--:-- --:--:-- --:--:-- 52.6M
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M100 22.3M  100 22.3M    0     0  52.5M      0 --:--:-- --:--:-- --:--:-- 52.6M
[2025-03-22 05:57:17.454] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2025-03-22 05:57:17.455] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2025-03-22 05:57:17.499] [info] Starting
[2025-03-22 05:57:17.506] [info] Starting
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 18555518 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
slurmstepd: error: *** STEP 18555518.0 ON g-h-1-8-07 CANCELLED AT 2025-03-22T06:01:31 ***
Will exit after finishing currently running jobs (scheduler).
Will exit after finishing currently running jobs (scheduler).

Perhaps because I requested 2 gpus there is a bug and it is trying to run the rule twice? Please let me know what advice you have. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions