Workarounds for Lustre I/O issues #4426

BenWibking · 2025-04-21T02:50:28Z

Summary

This adds workarounds for Lustre I/O write issues at scale (~128 nodes or more):

moving the flush after FAB writes until after the whole MultiFab has been written,
removing the flush after particle real data and particle int data (a flush is still done in amrex::NFilesIter), and
adding a ParallelDescriptor::Barrier() after writing each level (both for MultiFabs and particles).

This reduces plotfile write time on 1024 nodes on Frontier from 30+ minutes to 1 minute. It performs best with a stripe count of 1 and stripe size of 16M and 1 file per node.

Additional background

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

Src/Base/AMReX_VisMF.cpp

WeiqunZhang · 2025-05-08T17:41:09Z

Src/Base/AMReX_VisMF.cpp

@@ -1111,7 +1111,7 @@ VisMF::Write (const FabArray<FArrayBox>&    mf,
            nfi.Stream().flush();
            delete [] allFabData;

-        } else {    // ---- write fabs individually
+          } else {    // ---- write fabs individually


The white space change is unnecessary and I think is incorrect.

WeiqunZhang · 2025-05-08T17:42:07Z

Src/Base/AMReX_VisMF.cpp

                }
                Real const* fabdata = fab.dataPtr();
-#ifdef AMREX_USE_GPU
+        #ifdef AMREX_USE_GPU


Seems unnecessary.

Will revert the spurious whitespace changes.

WeiqunZhang · 2025-05-08T17:57:00Z

The flush calls were added in the past to avoid I/O issues on titan. It might still be needed on some systems. So maybe we can make this (and the mpi_barrier) a runtime parameter. We could make not flushing the default. Or we could make the default to be different on different machines. For example if (Machine::name() == "olcf.fronter").

BenWibking · 2025-05-08T20:20:08Z

The flush calls were added in the past to avoid I/O issues on titan. It might still be needed on some systems. So maybe we can make this (and the mpi_barrier) a runtime parameter. We could make not flushing the default. Or we could make the default to be different on different machines. For example if (Machine::name() == "olcf.fronter").

Ok, I'd be happy with making it a runtime parameter. I will update the PR.

WeiqunZhang · 2025-05-16T17:11:38Z

A WarpX user reported,

I tried this PR but it doesn’t seem to fix the issue. The simulations [on Frontier] ran for 1:30 h but it stalled on the checkpoint at the zeroth timestep.

BenWibking · 2025-05-16T18:44:29Z

A WarpX reported,

I tried this PR but it doesn’t seem to fix the issue. The simulations [on Frontier] ran for 1:30 h but it stalled on the checkpoint at the zeroth timestep.

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line:
https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156

Without doing this at the analogous location in our code, I also see hangs on Frontier.

BenWibking · 2025-05-16T18:48:57Z

Also, I set the striping on Frontier manually:

lfs setstripe -c 1 -S 16M $SLURM_SUBMIT_DIR

and also set the number of files to 1 per node:

warpx.field_io_nfiles = $NNODES
warpx.particle_io_nfiles = $NNODES

Sometimes, it still hangs in the particle writes every few checkpoints. I don't have a workaround for that.

titoiride · 2025-05-17T00:57:40Z

The change now worked for me after applying

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line: https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156

Without doing this at the analogous location in our code, I also see hangs on Frontier.

and setting the striping correctly.
The simulation was a 5200 nodes simulation on Frontier and generated checkpoints in about ~3 minutes. The simulation didn't run for very long so it did not incur into hangs in particle writes.

BenWibking · 2025-05-17T01:13:20Z

The change now worked for me after applying

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line: https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156
Without doing this at the analogous location in our code, I also see hangs on Frontier.

and setting the striping correctly. The simulation was a 5200 nodes simulation on Frontier and generated checkpoints in about ~3 minutes. The simulation didn't run for very long so it did not incur into hangs in particle writes.

That's great to hear. Maybe after the most recent maintenance window, it works now without hanging.

What is the size of each of your checkpoints? I'm curious to know what the effective write bandwidth you're seeing is.

titoiride · 2025-05-17T02:20:18Z

Uhm, I actually wonder if it was just a system fix. I tried the regular code too (latest release without changes) with the same output and checkpoints now work as long as I set the striping. The only note is that I did not go very far in the interaction and so I haven't really tested a very "mixed" situation (although the simulation is pretty unbalanced in the beginning).
Each checkpoint is ~280TB.
Maybe it's worth trying again going forward in the interaction and testing if the output time significantly increases.

lustre workaround

2b010ce

BenWibking force-pushed the lustre-workaround-io branch from 403631b to 2b010ce Compare April 21, 2025 12:53

BenWibking mentioned this pull request Apr 21, 2025

[WIP] Allow new levels on ParticleContainer restart #4424

Draft

5 tasks

BenWibking marked this pull request as ready for review April 21, 2025 13:11

BenWibking requested review from WeiqunZhang and atmyers April 21, 2025 13:12

Merge branch 'AMReX-Codes:development' into lustre-workaround-io

34832bc

BenWibking mentioned this pull request May 1, 2025

[WIP] Isolated disk galaxy quokka-astro/quokka#506

Draft

13 tasks

WeiqunZhang reviewed May 8, 2025

View reviewed changes

Src/Base/AMReX_VisMF.cpp Show resolved Hide resolved

WeiqunZhang reviewed May 8, 2025

View reviewed changes

titoiride mentioned this pull request May 17, 2025

Add barrier after each checkpoint level BLAST-WarpX/warpx#5899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workarounds for Lustre I/O issues #4426

Workarounds for Lustre I/O issues #4426

BenWibking commented Apr 21, 2025 •

edited

Loading

WeiqunZhang May 8, 2025

WeiqunZhang May 8, 2025

BenWibking May 8, 2025

WeiqunZhang commented May 8, 2025

BenWibking commented May 8, 2025

WeiqunZhang commented May 16, 2025 •

edited

Loading

BenWibking commented May 16, 2025 •

edited

Loading

BenWibking commented May 16, 2025

titoiride commented May 17, 2025

BenWibking commented May 17, 2025

titoiride commented May 17, 2025

Workarounds for Lustre I/O issues #4426

Are you sure you want to change the base?

Workarounds for Lustre I/O issues #4426

Conversation

BenWibking commented Apr 21, 2025 • edited Loading

Summary

Additional background

Checklist

WeiqunZhang May 8, 2025

Choose a reason for hiding this comment

WeiqunZhang May 8, 2025

Choose a reason for hiding this comment

BenWibking May 8, 2025

Choose a reason for hiding this comment

WeiqunZhang commented May 8, 2025

BenWibking commented May 8, 2025

WeiqunZhang commented May 16, 2025 • edited Loading

BenWibking commented May 16, 2025 • edited Loading

BenWibking commented May 16, 2025

titoiride commented May 17, 2025

BenWibking commented May 17, 2025

titoiride commented May 17, 2025

BenWibking commented Apr 21, 2025 •

edited

Loading

WeiqunZhang commented May 16, 2025 •

edited

Loading

BenWibking commented May 16, 2025 •

edited

Loading