Skip to content

Optimizations for BP5 engine of ADIOS2: Use span-based API and avoid flushing before closing an Iteration #3200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: development
Choose a base branch
from

Conversation

franzpoeschel
Copy link

@franzpoeschel franzpoeschel commented Jun 24, 2022

Unlike previous engines in ADIOS2, BP5 can avoid copying data to internal buffers if one uses an optimized workflow. The following sequence of operations should be avoided:

engine.Put(dataPtr, …, adios2::Mode::Deferred);
//
engine.PerformPuts(); // this provokes a copy to internal buffers
//
engine.EndStep();

This PR uses two optimizations to avoid that workflow:

  1. Use the Span API of openPMD/ADIOS2. Instead of creating buffers inside WarpX, this asks the IO backend to create a buffer for us to write into it. The problem of data copies is avoided entirely since the data is created at the right place already.
    This optimization benefits all ADIOS2 engines that support the Span API (e.g. BP4), openPMD-api automatically switches to a fallback otherwise.
  2. In other places, the data already exists in WarpX in the right format as static data, or needs to be downloaded into pinned memory which is not supported by the Span API. The "normal" storeChunk/Put API must be used, so the above workflow should be avoided. Do this by deleting instances of Series::flush() between RecordComponent::storeChunk() and Iteration::endStep().

Notes:

  • For creating a memory profile and seeing if this PR makes a difference: https://github.com/KDE/heaptrack
  • Optimization (2) is hard to maintain. There is nothing that stops someone to put a Series::flush() somewhere in these routines at a later point, this way destroying the optimization again.
    Suggestion: Add something like Series::promiseNoFlushesUntilEndstep() to the openPMD-api that throws an error if a flush does indeed happen.
  • Another idea we had was to add a unique_ptr overload to RecordComponent::storeChunk(). This way, the backend knows that it is the unique owner of the data and can autonomously decide to delay showing the data to ADIOS2 until directly before EndStep. This is semantically only applicable to the pinned memory instance here in WarpX, the other places are static data, not unique data.

Question to Axel: The following shared_ptr does correctly handle destroying the pinned memory again, right? There is no custom destructor given, so I'm wondering.

#ifdef AMREX_USE_GPU
                if (fab.arena()->isManaged() || fab.arena()->isDevice()) {
                    amrex::BaseFab<amrex::Real> foo(local_box, 1, amrex::The_Pinned_Arena());
                    std::shared_ptr<amrex::Real> data_pinned(foo.release());
                    amrex::Gpu::dtoh_memcpy_async(data_pinned.get(), fab.dataPtr(icomp), local_box.numPts()*sizeof(amrex::Real));
                    // intentionally delayed until before we .flush(): amrex::Gpu::streamSynchronize();
                    mesh_comp.storeChunk(data_pinned, chunk_offset, chunk_size);
                } else
#endif

Close #3133

@franzpoeschel
Copy link
Author

franzpoeschel commented Jun 27, 2022

I've reverted the last commit (erasing instances of flush()) to check if that was what made the tests fail.
Also, I've begun working on the unique_ptr thing: openPMD/openPMD-api#1294

EDIT: Yep, looks like those flushes are (currently) needed. Erasing them should be what will help BP5 avoid those copies, but there might be further restructuring needed on WarpX's side?

@ax3l ax3l self-assigned this Jun 27, 2022
@ax3l ax3l requested review from ax3l and guj June 27, 2022 15:00
@franzpoeschel
Copy link
Author

Also relevant for memory usage in BP5: Specifying the right BufferchunkSize: ComputationalRadiationPhysics/picongpu#4127

If you know that WarpX will run on systems that have virtual memory, it's worth using 2GB as a default. In the upcoming openPMD-api release, openPMD::json::merge() can be used to specify defaults easily.

@ax3l ax3l mentioned this pull request Jan 25, 2023
33 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openPMD: Use Span API Reduce Host Memory Footprint
2 participants