-
Notifications
You must be signed in to change notification settings - Fork 594
Split storage optimization #3656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Perhaps it would be natural to request your review @pshriwise, but I actually do not know what to do |
|
Currently the lack of shared secondaries is a real killer in terms of performance. The excess memory usage is an issue but, in my experience, is usually not a deal breaker compared to the lack of ability to share secondaries between threads. In my experimenting with a shared secondary bank, it can often speed up weight window calculations by 100x or more. Changing the banking routines to compress duplicated sites like in this PR solves the (seemingly lesser) memory problem but makes the sharing of secondaries a lot more complicated I think. Currently it will be pretty easy to sort secondaries by progeny and parent ID and then load balance across MPI ranks, but with this PR that logic gets much more intricate and tricky. Personally I don't think it's worth the added complexity but maybe there are use cases people are hitting where the secondary bank memory really is a killer issue? |
|
Thanks for implementing this @vitmog! It's appreciated when ppl find an open issue to tackle and go for it. @jtramm, good point about the shared secondary bank! I may have created #3492 issue before our discussions on that. It definitely would take some extra consideration in the context of a shared secondary bank. I can see some potential advantages in sorting, but that doesn't seem to be a bottleneck. I can certainly see how load balancing would become more complicated! At the end of the day, I don't think it's worth complicating the initial shared secondary bank implementation without more motivation for memory/communication savings -- agreed. Thanks again @vitmog! This is nice to have in mind if we need to return to this issue. Let's wait for the shared secondary implementation to come in and we'll revisit. If, after some further stress testing, we find better motivation for supporting this layout perhaps we can look into it more deeply. |
|
@jtramm thank you for the comment! The copying is not a killer of somewhat currently but it can become a killer asymptotically, although unlikely. The assumed point of this PR is just a free-cost improvement that now seems being not so free-cost because it conflicts with your idea, with which I will allow myself to argue. Furthermore, you had mention a so interesting and actual issue that I would to give an extended commentary below.
I can imagine such a situation for the only case when the sample computational time fluctuates hardly from one thread to another. It can be possible in the following two cases:
For example, let we have a single possible history split event with the probability is The example problem would be solved using an improved splitting scheme via multi-stage splits, i.e., for example, 3-level scheme of the probability Additionally, strictly speaking, the splitting particle history between threads breaks the PRNG history-wise sub-sequences separation policy, i.e. each history must be sampled sequentially but not in parallel. Potentially, the mixing can produce diverse and complicated auto-correlation effects of biasing estimators similar to the ones with run of over the period of a PRNG. In conclusion, the improvement can be obtained using advanced splitting and other variance reduction techniques but not with in-history distributed computations. I would appreciate any objections and corrections, if I am wrong. |
|
Thank you for the response @pshriwise! I change this PR status to draft then, anyway the tests are not passed. Please let me know if I do something wrong, because I am not yet familiar with GutHub manners. |
|
Yes, thanks @vitmog for proposing this optimization! This is a nice implementation for sure -- I would be on board for including it if there were a demonstrated use case where the memory savings became important, but as there is potential for this complicating other needed features, I would vote to keep this as draft and it can always be revisited later on. As for the comments on a shared secondary bank -- those are all solid points! On the topic of rare high-magnitude splitting events, can you explain more on how breaking the splitting up into several stages helps with load imbalance? For instance, if an extremely rare (but highly important) particle is sampled that is traveling down a long & narrow void streaming channel, if we split it multiple times over the length of the channel or just once at the beginning or end, in any event, you end up with the same total number of particles at the end stemming from that history. Perhaps I am not understanding multi-stage splitting though (please fill me in if there are more details/examples there!). It's true also that you can fix this by running with more particles such that each thread will sample about the same number of these rare particles (source biasing may also help this), but ultimately if it's a truly rare particle pathway (e.g., trying to resolve a narrow beam port through a thick bioshield), then you may need to sample billions of particles or more per thread in the entire simulation, spanning many nodes or tens of thousands of total threads. It would deliver major computation savings to just be able to sample enough of these rare events per simulation, so as to ensure variance is delivered, rather than having to over-converge things just to provide load balancing.
Absolutely, this is a fantastic point! If naively sharing secondaries between threads (and thus randomly changing which PRNG stream is handling which secondary based on what happens to occur in what order on a parallel system), then you get a total lack of reproducibility, and as you said, can also run into more subtle numerical issues with not respecting each particle's location within the global PRNG stream. However, there is actually a way to do a shared secondary bank that is reproducible (regardless of if running in serial or parallel) and that respects a deterministic global PRNG stream. Basically, you just do the same thing as in the shared fission bank algorithm that is already in OpenMC detailed here: F. B. Brown and T. M. Sutton. “Reproducibility and Monte Carlo Eigenvalue Calculations.” Transactions of the American Nuclear Society, volume 65, p. 235 (1990). and break batches up into a series of secondary generations. This shared secondary approach is already in use by other MC codes like Celeritas. |
|
Thanks @jtramm for the appreciate! I would be glad if this PR will be used in the future, but I hope also it will be useful at least by this discussion anyway. Thank you also for the interest to my propositions! It is an extremely important thing for me due to permanent lack of professional communication. I will tried to clarify below the mentioned above.
By the rule of thumb, in your example the crucial point is the split before (!) a particle entering into the void streaming channel, when it became likely, to make the entering rare event being not so rare. It does not cancel the mentioned by you continuation of splitting inside the channel. And yes, the mentioned by me increase of split stages idea involves the any event splitting scheme among others. Moreover, we always have at least two technically separated stages of particle sampling: the direction and free path ones, and we can split before each of them, so we have then 2 splitting stages on each collision instead of, say, only split after crossing a cell from some set. In principle, the breaking Markov chain on stages is unlimited at least via introduction of auxiliary variables (such artificial abstract stages), it is an absolutely creative thing. This is an important parameter of a splitting scheme too. It seems that a splitting technique, satisfies the requirements above, cannot be done with CADIS because the flux ratios before the entrance are not related to the problem of the channel passing simulation. And yes, it likely requires the definition of the splitting rate function in the full 6-D phase space including the direction, especially for gamma, rather than the reduced location-energy phase space. It again cannot be done with CADIS and it is problematic with ordinary FW-CADIS by the same reasons why use Monte Carlo codes rather deterministic solvers. By these reasons, I preferred work on development of automated adaptive single-stage pure Monte Carlo schemes. An approbation and investigations of a designed simplified splitting scheme implemented in a maintained code shown that the variance reduction efficiency is though coupled with batch computational time stability. That is, if we have a high-fluctuating batch time then we also likely have a low-quality variance reduction scheme. Otherwise, the efficiency of a scheme means its stability. Likely, it can be substantiated theoretically.
I agree with all you wrote and I would to clarify that I do not have expectations from the sample sizes increases to align threads because it requires asymptotically infinite sizes for difficult problems. I just mentioned it for the completeness as a theoretically available option. My way is the variance reduction. On the topic about PRNG, before all I would thank you for the explanation and references to the paper and code! You are absolutely right about the equivalence of sharing split points and secondary particles. In yesterday evening, I was wrote the first doubt that became on naively mind because I am not well familiar with parallel computation issues and solutions. Happy to see that the problem is already resolved before, I will follow the references. Thanks! ConclusionAs in result of rethinking, I got that while it is never possible to use an ideal variance reduction scheme, the shared splitting bank will always being (less or more but) actual for parallel computations. I am sure that it is possible to reduce this actuality in many cases. I keep in mind the existence of shared splitting and secondary particles banks if will try to implement some splitting technique in OpenMC. I also got now that the stability of computational scheme execution time is also an important characteristic related to performance issues in parallel computations and the gains of a successful scheme are not limited by theoretical variance reduction and implementation characteristics. It is so nice because I did not even think on it before! What is about this small PR, I agree with it may be deferred because the current gains are not significant. Please feel free to contact about this or other issues! Thank you for your time! UPD 02.12.25 |
Description
Based on the raised by @pshriwise problem in #3492 and followed the given there accompanying technical aspects description, it is implemented in this small PR an approach of simple count of remaining particles in a bank specific to each split event instead of the current copying.
To store the counter value, it has been added a new field of the
SourceSitestruct namedn_replthat has also been reflected in the related MPI struct.This approach prevents excess copying and storing of particle data
n_split-2times after each splitting event (whenn_split > 1) that reduces both the computational cost (negligibly, but does not increase guaranteed) and used memory (as strong as the splitting/roulette technique is desirable). The last one improves the entire computational scheme stability in extreme cases.Fixes #3492
Checklist