-
Notifications
You must be signed in to change notification settings - Fork 51
Fix Iteration::open(), needed for correct use of Span API #1794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Iteration::open(), needed for correct use of Span API #1794
Conversation
docs/source/details/mpi.rst
Outdated
.. [4] We usually open iterations delayed on first access. This first access is usually the ``flush()`` call after a ``storeChunk``/``loadChunk`` operation. If the first access is non-collective, an explicit, collective ``Iteration::open()`` can be used to have the files already open. | ||
Alternatively, iterations might be accessed for the first time by immediate operations such as ``::availableChunks()``. | ||
.. [5] The Span-based ``storeChunk`` API calls return a backend-allocated pointer, requiring flush operations. These API calls hence inherit the collective properties of flushing. Use store calls with zero-size extent if a rank does not contribute anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @guj
Oh, but that has some severe implications for particle flushes and codes with dynamic load balancing, where one does have a non-equal number of calls to storeChunk
.
- Fields: generally, non-equal number of calls per MPI rank use span to allocate mesh data by default BLAST-WarpX/warpx#6123
- Particles: some ranks write zero particles, some call multiple times Optimizations for BP5 engine of ADIOS2: Use span-based API and avoid flushing before closing an Iteration BLAST-WarpX/warpx#3200
This makes usage very complicated for codes like WarpX.
@franzpoeschel Maybe we can relax the constraint t keep storeChunk
indepdnent - instead put more logic in resetDataset
or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative:
- Introduce
RecordComponent::open()
to do all necessary collective preparations - Check that
SkeletonOnly
flush does nothing collective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed:
- collective component only comes in for the first
storeChunk
(that means only particles would have a problem in most cases) - there is some confusion in the table there between the MPI concepts: collective/independent (described) and synchronizing Doc: MPI Sync? #1796 (not described)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Franz comment: Collective might only be the ADIOS engine open -- we probably do not need at all to delay this and could do this collectively and blocking in Series open...?
To double check if Iteration::open()
really opens the engine (x-ref)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is collective a performance boost requirement:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can keep the current API:
Iteration::open()
must be used in parallel setups to ensure collective file access- This was not working properly in random-access mode, now fixed
- The Span API can then be used non-collectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and:
- Due to the bug mentioned above, current releases require a workaround: Each rank must contribute at least one
storeChunk()
operation that may be empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and:
- Due to the bug mentioned above, current releases require a workaround: Each rank must contribute at least one
storeChunk()
operation that may be empty.
so span should require the next release version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably for the best, yeah. For older versions, you can alternatively include the workaround described above.
dfc8cea
to
58c08ea
Compare
Seen on #1791, fixes:
Iteration::open()
in random-access mode (did not open files, leading to later hangups due to non-collective file opening)