Skip to content

Workarounds for Lustre I/O issues #4426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

BenWibking
Copy link
Contributor

@BenWibking BenWibking commented Apr 21, 2025

Summary

This adds workarounds for Lustre I/O write issues at scale (~128 nodes or more):

  • moving the flush after FAB writes until after the whole MultiFab has been written,
  • removing the flush after particle real data and particle int data (a flush is still done in amrex::NFilesIter), and
  • adding a ParallelDescriptor::Barrier() after writing each level (both for MultiFabs and particles).

This reduces plotfile write time on 1024 nodes on Frontier from 30+ minutes to 1 minute. It performs best with a stripe count of 1 and stripe size of 16M and 1 file per node.

Additional background

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@BenWibking BenWibking force-pushed the lustre-workaround-io branch from 403631b to 2b010ce Compare April 21, 2025 12:53
@BenWibking BenWibking marked this pull request as ready for review April 21, 2025 13:11
@@ -1111,7 +1111,7 @@ VisMF::Write (const FabArray<FArrayBox>& mf,
nfi.Stream().flush();
delete [] allFabData;

} else { // ---- write fabs individually
} else { // ---- write fabs individually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The white space change is unnecessary and I think is incorrect.

}
Real const* fabdata = fab.dataPtr();
#ifdef AMREX_USE_GPU
#ifdef AMREX_USE_GPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert the spurious whitespace changes.

@WeiqunZhang
Copy link
Member

The flush calls were added in the past to avoid I/O issues on titan. It might still be needed on some systems. So maybe we can make this (and the mpi_barrier) a runtime parameter. We could make not flushing the default. Or we could make the default to be different on different machines. For example if (Machine::name() == "olcf.fronter").

@BenWibking
Copy link
Contributor Author

The flush calls were added in the past to avoid I/O issues on titan. It might still be needed on some systems. So maybe we can make this (and the mpi_barrier) a runtime parameter. We could make not flushing the default. Or we could make the default to be different on different machines. For example if (Machine::name() == "olcf.fronter").

Ok, I'd be happy with making it a runtime parameter. I will update the PR.

@WeiqunZhang
Copy link
Member

WeiqunZhang commented May 16, 2025

A WarpX user reported,

I tried this PR but it doesn’t seem to fix the issue. The simulations [on Frontier] ran for 1:30 h but it stalled on the checkpoint at the zeroth timestep.

@BenWibking
Copy link
Contributor Author

BenWibking commented May 16, 2025

A WarpX reported,

I tried this PR but it doesn’t seem to fix the issue. The simulations [on Frontier] ran for 1:30 h but it stalled on the checkpoint at the zeroth timestep.

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line:
https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156

Without doing this at the analogous location in our code, I also see hangs on Frontier.

@BenWibking
Copy link
Contributor Author

Also, I set the striping on Frontier manually:

lfs setstripe -c 1 -S 16M $SLURM_SUBMIT_DIR

and also set the number of files to 1 per node:

warpx.field_io_nfiles = $NNODES
warpx.particle_io_nfiles = $NNODES

Sometimes, it still hangs in the particle writes every few checkpoints. I don't have a workaround for that.

@titoiride
Copy link

The change now worked for me after applying

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line: https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156

Without doing this at the analogous location in our code, I also see hangs on Frontier.

and setting the striping correctly.
The simulation was a 5200 nodes simulation on Frontier and generated checkpoints in about ~3 minutes. The simulation didn't run for very long so it did not incur into hangs in particle writes.

@BenWibking
Copy link
Contributor Author

The change now worked for me after applying

It may be necessary to manually add a ParallelDescriptor::Barrier() after each level writes immediately before this line: https://github.com/BLAST-WarpX/warpx/blob/e7f688ebe9a1a111488309e9fee340887ce8ed50/Source/Diagnostics/FlushFormats/FlushFormatCheckpoint.cpp#L156
Without doing this at the analogous location in our code, I also see hangs on Frontier.

and setting the striping correctly. The simulation was a 5200 nodes simulation on Frontier and generated checkpoints in about ~3 minutes. The simulation didn't run for very long so it did not incur into hangs in particle writes.

That's great to hear. Maybe after the most recent maintenance window, it works now without hanging.

What is the size of each of your checkpoints? I'm curious to know what the effective write bandwidth you're seeing is.

@titoiride
Copy link

Uhm, I actually wonder if it was just a system fix. I tried the regular code too (latest release without changes) with the same output and checkpoints now work as long as I set the striping. The only note is that I did not go very far in the interaction and so I haven't really tested a very "mixed" situation (although the simulation is pretty unbalanced in the beginning).
Each checkpoint is ~280TB.
Maybe it's worth trying again going forward in the interaction and testing if the output time significantly increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants