Skip to content

Slurm issue with different taskset options #1340

@ufukozkan

Description

@ufukozkan

Hi all,

I am trying to run AWICM3-v3.3 on NHR@ZIB(formerly HLRN).

My branch is similar to 'origin/feat/awicm3_v3.3_and_v3.4_and_geomar'. We added cdo and nco modules, changed partition. We also modified configs/components/xios/xios.yaml file, since the names of the HDF5 libraries are different on NHR@ZIB.

My problem is that when I use 'taskset : True' as a setting, I get an error below;

Contents of hostfile_srun:
0-959 ./prog_fesom.sh
960-1151 ./prog_oifs.sh
1152-1152 ./prog_rnfmap.sh
1153-1156 ./prog_xios.sh

Submitting jobscript to batch system...
Output written by slurm:

cd /scratch/usr/hbkufuko/runtime/awicm3-v3.3//taskset-true/run_20000102-20000102/scripts/; sbatch taskset-true_compute_20000102-20000102.run
sbatch: error: --nodes is incompatible with --distribution=arbitrary
Exiting entire Python process!

If I manually submit the job (below) model is restarting without a problem.

sbatch run../scripts/...run

With the recommendation of @JanStreffing I tried 'taskset : False'. (Also there is another discussion one year ago; #1148)

However, the job never starts with a hetjob. Start time always iterates. When I kill the job and change the setting to True again, the job starts.

I have attached runscript and blogin yaml files. Let me know if something else is needed, please.

Kind regards,
Ufuk

blogin.txt

awicm3-v3.3-blogin-TL255L91-Arc01.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingslurm

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions