Open
Description
When run on a single node, allocated by slurm, I can trigger an OOM while writing content to files, and dfilemaker is killed. I'm not sure if this is flaw in dfilemaker or a flaw in something else (e.g. slurm config).
Below was on mutt, node allocated with "salloc -N1" and the target file system was Lustre
mpifileutils version was this:
* eb57445 (HEAD -> b-bad-option, olagit/b-bad-option) dfilemaker: remove duplicate longopts struct
* 999ecff dfilemaker: fail and stop execution on unrecognized option
and command run was this
bash-4.4$ srun -n32 ~/projects/mfu-install/bin/dfilemaker --fill=alternate --depth=1-30 -nitems=10000-$((10*1000*1000)) --verbose
[2024-11-25T17:04:49] Creating 1429103 directories
[2024-11-25T17:04:59] Created 144290 directories (10%) in 10.056 secs (14348.591 dirs/sec) 90 secs left ...
[2024-11-25T17:05:09] Created 293797 directories (21%) in 20.114 secs (14606.766 dirs/sec) 78 secs left ...
...
[2024-11-25T17:07:41] Created 1278791 items (89%) in 70.013 secs (18264.982 items/sec) 8 secs left ...
[2024-11-25T17:07:51] Created 1425534 items (100%) in 80.010 secs (17817.046 items/sec) 0 secs left ...
[2024-11-25T17:07:52] Created 1429953 items (100%) in 81.207 secs (17608.759 items/sec) done
[2024-11-25T17:07:52] Writing content to files.
slurmstepd: error: Detected 1 oom_kill event in StepId=60053.4. Some of the step tasks have been OOM Killed.
srun: error: mutt11: task 10: Out Of Memory
srun: First task exited 30s ago
srun: StepId=60053.4 tasks 0-9,11-24,26-31: running
srun: StepId=60053.4 tasks 10,25: exited abnormally
srun: Terminating StepId=60053.4
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 60053.4 ON mutt11 CANCELLED AT 2024-11-25T17:12:22 ***
Metadata
Metadata
Assignees
Labels
No labels