Skip to content

dfilemaker triggering OOM while writing content #611

Open
@ofaaland

Description

@ofaaland

When run on a single node, allocated by slurm, I can trigger an OOM while writing content to files, and dfilemaker is killed. I'm not sure if this is flaw in dfilemaker or a flaw in something else (e.g. slurm config).

Below was on mutt, node allocated with "salloc -N1" and the target file system was Lustre

mpifileutils version was this:

* eb57445 (HEAD -> b-bad-option, olagit/b-bad-option) dfilemaker: remove duplicate longopts struct
* 999ecff dfilemaker: fail and stop execution on unrecognized option

and command run was this

bash-4.4$ srun -n32 ~/projects/mfu-install/bin/dfilemaker --fill=alternate --depth=1-30 -nitems=10000-$((10*1000*1000)) --verbose
[2024-11-25T17:04:49] Creating 1429103 directories
[2024-11-25T17:04:59] Created 144290 directories (10%) in 10.056 secs (14348.591 dirs/sec) 90 secs left ...
[2024-11-25T17:05:09] Created 293797 directories (21%) in 20.114 secs (14606.766 dirs/sec) 78 secs left ...
...
[2024-11-25T17:07:41] Created 1278791 items (89%) in 70.013 secs (18264.982 items/sec) 8 secs left ...
[2024-11-25T17:07:51] Created 1425534 items (100%) in 80.010 secs (17817.046 items/sec) 0 secs left ...
[2024-11-25T17:07:52] Created 1429953 items (100%) in 81.207 secs (17608.759 items/sec) done
[2024-11-25T17:07:52] Writing content to files.
slurmstepd: error: Detected 1 oom_kill event in StepId=60053.4. Some of the step tasks have been OOM Killed.
srun: error: mutt11: task 10: Out Of Memory
srun: First task exited 30s ago
srun: StepId=60053.4 tasks 0-9,11-24,26-31: running
srun: StepId=60053.4 tasks 10,25: exited abnormally
srun: Terminating StepId=60053.4
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 60053.4 ON mutt11 CANCELLED AT 2024-11-25T17:12:22 ***

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions