Skip to content

Conversation

@vitmog
Copy link

@vitmog vitmog commented Nov 26, 2025

Description

Based on the raised by @pshriwise problem in #3492 and followed the given there accompanying technical aspects description, it is implemented in this small PR an approach of simple count of remaining particles in a bank specific to each split event instead of the current copying.

To store the counter value, it has been added a new field of the SourceSite struct named n_repl that has also been reflected in the related MPI struct.

This approach prevents excess copying and storing of particle data n_split-2 times after each splitting event (when n_split > 1) that reduces both the computational cost (negligibly, but does not increase guaranteed) and used memory (as strong as the splitting/roulette technique is desirable). The last one improves the entire computational scheme stability in extreme cases.

Fixes #3492

Checklist

  • I have performed a self-review of my own code
  • I have run clang-format (version 15) on any C++ source files (if applicable)
  • I have followed the style guidelines for Python source files (if applicable)
  • I have made corresponding changes to the documentation (if applicable)
  • I have added tests that prove my fix is effective or that my feature works (if applicable)

@vitmog vitmog marked this pull request as ready for review November 26, 2025 14:11
@vitmog
Copy link
Author

vitmog commented Nov 26, 2025

Perhaps it would be natural to request your review @pshriwise, but I actually do not know what to do

@jtramm
Copy link
Contributor

jtramm commented Nov 26, 2025

Currently the lack of shared secondaries is a real killer in terms of performance. The excess memory usage is an issue but, in my experience, is usually not a deal breaker compared to the lack of ability to share secondaries between threads. In my experimenting with a shared secondary bank, it can often speed up weight window calculations by 100x or more.

Changing the banking routines to compress duplicated sites like in this PR solves the (seemingly lesser) memory problem but makes the sharing of secondaries a lot more complicated I think. Currently it will be pretty easy to sort secondaries by progeny and parent ID and then load balance across MPI ranks, but with this PR that logic gets much more intricate and tricky. Personally I don't think it's worth the added complexity but maybe there are use cases people are hitting where the secondary bank memory really is a killer issue?

@pshriwise
Copy link
Contributor

Thanks for implementing this @vitmog! It's appreciated when ppl find an open issue to tackle and go for it.

@jtramm, good point about the shared secondary bank! I may have created #3492 issue before our discussions on that. It definitely would take some extra consideration in the context of a shared secondary bank. I can see some potential advantages in sorting, but that doesn't seem to be a bottleneck. I can certainly see how load balancing would become more complicated!

At the end of the day, I don't think it's worth complicating the initial shared secondary bank implementation without more motivation for memory/communication savings -- agreed.

Thanks again @vitmog! This is nice to have in mind if we need to return to this issue. Let's wait for the shared secondary implementation to come in and we'll revisit. If, after some further stress testing, we find better motivation for supporting this layout perhaps we can look into it more deeply.

@vitmog
Copy link
Author

vitmog commented Nov 26, 2025

@jtramm thank you for the comment! The copying is not a killer of somewhat currently but it can become a killer asymptotically, although unlikely. The assumed point of this PR is just a free-cost improvement that now seems being not so free-cost because it conflicts with your idea, with which I will allow myself to argue. Furthermore, you had mention a so interesting and actual issue that I would to give an extended commentary below.

In my experimenting with a shared secondary bank, it can often speed up weight window calculations by 100x or more.

I can imagine such a situation for the only case when the sample computational time fluctuates hardly from one thread to another. It can be possible in the following two cases:

  1. An applied splitting scheme is not successful due to too rare high-rate splitting events;
  2. The chosen sample size is too small for the used splitting scheme.

For example, let we have a single possible history split event with the probability is 1E-6 [split/history], rate is 1E9 [branches/split], and the sample size is 1E4 [histories/sample]. In this case, with the average split rate of 1E-6 * 1E4 = 1E-2 [split/sample] it will be produced a run that perhaps much longer (as related as the costs of 1E9 branches and 1E4 histories) than a non-split history run. I suppose that this rare long run you would to distribute to idling threads, or no?

The example problem would be solved using an improved splitting scheme via multi-stage splits, i.e., for example, 3-level scheme of the probability 1E-2 and rate 1E3 branches/split instead of the singe-level one. From my experience, a successful splitting scheme always produces the quite stable computational time of batches, even for high-rate splitting as it is for high-magnitude attenuation (1E-28 in this case) problems. Also, with the sample size higher than 1E7 histories, this scheme becomes almost stable (but still far away from the optimal one).

Additionally, strictly speaking, the splitting particle history between threads breaks the PRNG history-wise sub-sequences separation policy, i.e. each history must be sampled sequentially but not in parallel. Potentially, the mixing can produce diverse and complicated auto-correlation effects of biasing estimators similar to the ones with run of over the period of a PRNG.

In conclusion, the improvement can be obtained using advanced splitting and other variance reduction techniques but not with in-history distributed computations.

I would appreciate any objections and corrections, if I am wrong.

@vitmog
Copy link
Author

vitmog commented Nov 26, 2025

Thank you for the response @pshriwise! I change this PR status to draft then, anyway the tests are not passed. Please let me know if I do something wrong, because I am not yet familiar with GutHub manners.

@vitmog vitmog marked this pull request as draft November 26, 2025 20:22
@jtramm
Copy link
Contributor

jtramm commented Nov 26, 2025

Yes, thanks @vitmog for proposing this optimization! This is a nice implementation for sure -- I would be on board for including it if there were a demonstrated use case where the memory savings became important, but as there is potential for this complicating other needed features, I would vote to keep this as draft and it can always be revisited later on.

As for the comments on a shared secondary bank -- those are all solid points!

On the topic of rare high-magnitude splitting events, can you explain more on how breaking the splitting up into several stages helps with load imbalance? For instance, if an extremely rare (but highly important) particle is sampled that is traveling down a long & narrow void streaming channel, if we split it multiple times over the length of the channel or just once at the beginning or end, in any event, you end up with the same total number of particles at the end stemming from that history. Perhaps I am not understanding multi-stage splitting though (please fill me in if there are more details/examples there!).

It's true also that you can fix this by running with more particles such that each thread will sample about the same number of these rare particles (source biasing may also help this), but ultimately if it's a truly rare particle pathway (e.g., trying to resolve a narrow beam port through a thick bioshield), then you may need to sample billions of particles or more per thread in the entire simulation, spanning many nodes or tens of thousands of total threads. It would deliver major computation savings to just be able to sample enough of these rare events per simulation, so as to ensure variance is delivered, rather than having to over-converge things just to provide load balancing.

Additionally, strictly speaking, the splitting particle history between threads breaks the PRNG history-wise sub-sequences separation policy, i.e. each history must be sampled sequentially but not in parallel. Potentially, the mixing can produce diverse and complicated auto-correlation effects of biasing estimators similar to the ones with run of over the period of a PRNG.

Absolutely, this is a fantastic point! If naively sharing secondaries between threads (and thus randomly changing which PRNG stream is handling which secondary based on what happens to occur in what order on a parallel system), then you get a total lack of reproducibility, and as you said, can also run into more subtle numerical issues with not respecting each particle's location within the global PRNG stream.

However, there is actually a way to do a shared secondary bank that is reproducible (regardless of if running in serial or parallel) and that respects a deterministic global PRNG stream. Basically, you just do the same thing as in the shared fission bank algorithm that is already in OpenMC detailed here:

F. B. Brown and T. M. Sutton. “Reproducibility and Monte Carlo Eigenvalue Calculations.” Transactions of the American Nuclear Society, volume 65, p. 235 (1990).

and break batches up into a series of secondary generations. This shared secondary approach is already in use by other MC codes like Celeritas.

@vitmog
Copy link
Author

vitmog commented Nov 27, 2025

Thanks @jtramm for the appreciate! I would be glad if this PR will be used in the future, but I hope also it will be useful at least by this discussion anyway.

Thank you also for the interest to my propositions! It is an extremely important thing for me due to permanent lack of professional communication. I will tried to clarify below the mentioned above.

On the topic of rare high-magnitude splitting events, can you explain more on how breaking the splitting up into several stages helps with load imbalance? For instance, if an extremely rare (but highly important) particle is sampled that is traveling down a long & narrow void streaming channel, if we split it multiple times over the length of the channel or just once at the beginning or end, in any event, you end up with the same total number of particles at the end stemming from that history. Perhaps I am not understanding multi-stage splitting though (please fill me in if there are more details/examples there!).

By the rule of thumb, in your example the crucial point is the split before (!) a particle entering into the void streaming channel, when it became likely, to make the entering rare event being not so rare. It does not cancel the mentioned by you continuation of splitting inside the channel.

And yes, the mentioned by me increase of split stages idea involves the any event splitting scheme among others. Moreover, we always have at least two technically separated stages of particle sampling: the direction and free path ones, and we can split before each of them, so we have then 2 splitting stages on each collision instead of, say, only split after crossing a cell from some set. In principle, the breaking Markov chain on stages is unlimited at least via introduction of auxiliary variables (such artificial abstract stages), it is an absolutely creative thing. This is an important parameter of a splitting scheme too.

It seems that a splitting technique, satisfies the requirements above, cannot be done with CADIS because the flux ratios before the entrance are not related to the problem of the channel passing simulation. And yes, it likely requires the definition of the splitting rate function in the full 6-D phase space including the direction, especially for gamma, rather than the reduced location-energy phase space. It again cannot be done with CADIS and it is problematic with ordinary FW-CADIS by the same reasons why use Monte Carlo codes rather deterministic solvers.

By these reasons, I preferred work on development of automated adaptive single-stage pure Monte Carlo schemes. An approbation and investigations of a designed simplified splitting scheme implemented in a maintained code shown that the variance reduction efficiency is though coupled with batch computational time stability. That is, if we have a high-fluctuating batch time then we also likely have a low-quality variance reduction scheme. Otherwise, the efficiency of a scheme means its stability. Likely, it can be substantiated theoretically.

It's true also that you can fix this by running with more particles such that each thread will sample about the same number of these rare particles (source biasing may also help this), but ultimately if it's a truly rare particle pathway (e.g., trying to resolve a narrow beam port through a thick bioshield), then you may need to sample billions of particles or more per thread in the entire simulation, spanning many nodes or tens of thousands of total threads. It would deliver major computation savings to just be able to sample enough of these rare events per simulation, so as to ensure variance is delivered, rather than having to over-converge things just to provide load balancing.

I agree with all you wrote and I would to clarify that I do not have expectations from the sample sizes increases to align threads because it requires asymptotically infinite sizes for difficult problems. I just mentioned it for the completeness as a theoretically available option. My way is the variance reduction.

On the topic about PRNG, before all I would thank you for the explanation and references to the paper and code! You are absolutely right about the equivalence of sharing split points and secondary particles. In yesterday evening, I was wrote the first doubt that became on naively mind because I am not well familiar with parallel computation issues and solutions. Happy to see that the problem is already resolved before, I will follow the references. Thanks!

Conclusion

As in result of rethinking, I got that while it is never possible to use an ideal variance reduction scheme, the shared splitting bank will always being (less or more but) actual for parallel computations. I am sure that it is possible to reduce this actuality in many cases. I keep in mind the existence of shared splitting and secondary particles banks if will try to implement some splitting technique in OpenMC.

I also got now that the stability of computational scheme execution time is also an important characteristic related to performance issues in parallel computations and the gains of a successful scheme are not limited by theoretical variance reduction and implementation characteristics. It is so nice because I did not even think on it before!

What is about this small PR, I agree with it may be deferred because the current gains are not significant. Please feel free to contact about this or other issues!

Thank you for your time!

UPD 02.12.25
The balancing via shared secondaries (splits) rather than simple shared histories can be necessary for the cases of combined lots of used threads as just a few history-long computational time. In difficult problems, the optimally branched histories may have unlimited duration although still remaining stable. Thus, not a large number of histories may fit into, say, in several computational minutes or even seconds. Such the conditions can be especially actual to some iterative search series of calculations for optimization problems or coupled time-dependent neutron-thermohydraulics calculations, I guess. I will keep in mind the shared secondaries approach in a case of further work on OpenMC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce split particle memory in secondary banks

3 participants