Skip to content

Conversation

@RGates94
Copy link
Collaborator

This branch will add OpenACC directives, intended to increase performance of dcon and related workflows by utilizing the GPU clusters.

More details will be added as work progresses.

@logan-nc logan-nc self-assigned this Feb 11, 2020
@stephethier
Copy link
Collaborator

stephethier commented Feb 11, 2020

@RGates94 It's better to disable the OpenMP directives with an #ifdef statement instead of removing them completely. How about something like:

  #ifdef GPU
  !$acc ...
  #else 
  !$omp ...
  #endif  

Actually, we could also use #ifdef _OPENACC. It gets defined automatically when compiling with the OpenACC compiler option for both PGI and gfortran.

@logan-nc
Copy link
Contributor

Hi @RGates94 and @stephethier. Would you be ready to present any progress at the DCON users roundtable in 2 weeks? Or maybe the one after that? Just let me know whenever you feel you have something to show.

In the meantime, I'll note this replaces #97 (which, in turn, was meant to be an extension of #82). @RGates94 can find nice summary flowcharts of the codes in those PR discussions.

@logan-nc
Copy link
Contributor

I'll also record @parkjk's recommendations for getting a fast kinetic run to use for debugging here so everyone can just refer to it whenever we have another PR or issue.

In equil.in: equil_control,

psilow=0.3 (rather than 0.01)
psihigh=0.5 (rather than 0.992)
mpsi=64 (instead of 128. This controls grid # in addition to the domain size. should be ok)

And make sure it actually stops and ends at those psi limits by turning dcon.in:dcon_control

sas_flag=f (instead of t)

Then kinetic dcon ran either by GPEC-1.3 release version and a develop branch in my area within 18 seconds (with 9 OMP threads) : See /p/gpec/users/jpark/runs/gpec_DIIID_kinetic_example.

The poloidal harmonic numbers can also be reduced in dcon.in: dcon_control

delta_mlow=2 (instead of 8)
delta_mhigh=4 (instead of 8)

and the loop over bounce harmonic numbers in pentrc.in: pent_input

nl=1

You may also want to try to increase tolerance of kinetic (or lambda) integration in pentrc.in: pent_control

atol_xlmda = 1e-6 (rather than 1e-9)

@stephethier
Copy link
Collaborator

@logan-nc Thanks Nik. 18 seconds is good enough. It becomes difficult to measure speedups when the run is too short (< 1 sec).

@RGates94
Copy link
Collaborator Author

@stephethier I will definitely change the way I am disabling the OMP directives to what you are suggesting.

@logan-nc I'm not sure whether I will have something to present for the next roundtable, however I am very confident that something will be ready for the one after that. Most of the work so far has been in preparing the codebase to use OpenACC directives, and while the code is now compiling with OpenACC enabled, tests so far don't seem to indicate any speedups compared to runs on the same compiler. (Although gfortran seems to run around 14% faster than intel overall)

The largest current roadblock is compiler specific bugs in pgfortran, as the Nvidia profiler only works with executables produced with this compiler.

@stephethier
Copy link
Collaborator

@logan-nc (and @RGates94 ) Nik, the "tpsi" function in torque.F90 has a lot of allocation and deallocation:

        allocate(dbfun(0:mthsurf),dxfun(0:mthsurf))
        call spline_alloc(tspl,mthsurf,5)
        ...
        deallocate(dbfun,dxfun)
        ...

This is not good for GPU. We can either allocate everything in advance, or bring the nested loops inside of "tpsi". There are some performance implications for both but I'm trying to figure out which one would be the easiest to implement.

@logan-nc
Copy link
Contributor

thanks for the update @RGates94. Just keep recording progress here and we can decide when there is sufficient material to warrant a devoted agenda item. In the meantime, feel free to join the meetings and ask/state any minor questions/observations as you go.

For the allocations in tpsi: This is simply my pythonic programming naivety at work. The allocations you call out are always the same size (mthsurf) and are not large (no real reason to deallocate for the memory). Please move them outside the function as torque module wide variables.

Note, mthsurf is set by the inputs module when it uses the dcon_interface to read input files OR by dcon when it calls set_eq to set things in memory without going through the file interface. You might consider adding a set_torque_vars or something like that to organize all the allocations you pull out of tpsi. You just need to make sure it gets called prior to its use in DCON and the stand-alone PENTRC uses.

@logan-nc logan-nc changed the title Open acc Adds OpenACC for optional GPU utilization Mar 4, 2020
@logan-nc
Copy link
Contributor

logan-nc commented Mar 4, 2020

In reply to an email from @stephethier:

The "tpsi" function is much too large and contains too many calls to other functions and subroutines, as well as memory allocations and deallocations. Robert has been diving right in and has found out that the call to lsode is where the code spends most of its time. The problem is that the lsode1 user-provided subroutine itself calls lsode!! (lsode2 in energy.f90). The integrated function in lsode2 is "xintgrnd", which doesn't seem overly complicated. So my question is: do you really need to use lsode to integrate that energy function? Could we use something simpler, like a high resolution Simpson method? That would probably help the GPU port.

Indeed the tpsi function is large. At each radial grid point, there are complicated, embedded pitch angle and energy integrations (see Eq. (19) in Ref. [1]). I had spent some time to try and figure out how to do embedded integrals with lsode by pausing the outer one, storing its info, doing the inner one, resuming the outer, and so on but was defeated by the lack of documentation/tutorials. So eventually I just did the hacky approach and copied out all the parts fo lsode that use shared variables to create separate lsode1 and lsode2 routines that don't mess with each others namespaces. Or so I thought....
What do you mean when you say "the lsode1 user-provided subroutine itself calls lsode!!"?

These integrations can involve very sharp resonances. I tried analytically estimating where they would be and using a smart-packed grid early on, but it was not accurate / never as good as LSODE.

Here are some examples of the energy integrand:
image

Here are some of the pitch:
image
You can see both can be very sharp if certain plasma conditions are met (balances of ω_E, ω_D, and ν).

I am 100% open to changing the integrator. I only used LSODE because it was what DCON used and a quick googling (back when I was very new to fortran) confirmed it was still one of the "best" (at least "tried and true") dynamic integrators around. That seemed good for doing lots of integrals that might or might not be very sharp. Do you have something better?

[1] N. C. Logan, J.-K. Park, K. Kim, Z. Wang, and J. W. Berkery, “Neoclassical toroidal viscosity in perturbed equilibria with general tokamak geometry,” Phys. Plasmas, vol. 20, no. 12, p. 122507, Dec. 2013.

@logan-nc
Copy link
Contributor

logan-nc commented Mar 4, 2020

@stephethier @RGates94 I had a thought.
When developing this code, I had occasion to benchmark it in the zero collisionality case. In this case, the resonant spikes shown above become true delta functions. To integrate in this case, I had to add the ximag parameter to use complex integrate and avoid the poles:
image

I never checked this explicitly, but I guess if we used ximag > 0 (default is 0) then the integrads could all be smooth and we might be able to dramatically reduce the number of steps used in each integration.... or even use a different technique. Think about it at least. You can find the relavent code by just searching ximag in energy.f90.

@RGates94 RGates94 marked this pull request as ready for review March 13, 2020 18:29
@RGates94
Copy link
Collaborator Author

Since today is my last day at the lab, I thought it would be beneficial to share an update on current progress and my thoughts on where to continue from here.

After I got OpenACC working and checked performance, I found there were no significant speedups, so I went and looked into where most of the time was being spent in the program. I found that a vast majority of the time was spent on a call to lsode1 in pentrc/pitch.f90. Therefore on the suggestion of @stephethier I began replacing LSODE with CVODE. As this call to lsode1 took lintgrnd as a parameter, and lintgrnd called xintgrl_lsode which calls lsode2 which takes the xintgrnd subroutine as a parameter, I started by replacing xintgrnd with a wrapper over a C function.

To do this I first added a module that called into C shared objects, I then added a subroutine newxintgrnd which wraps a C function with the same behavior. The next steps will be to go up this call stack, replacing each fortran subroutine with a C function with the same behavior. This will allow use of CVODE in place of lsode calls. I expect that this process will allow for significant speedups, and moreover by moving this portion of the code into C, it opens up the ability to access the larger C and CUDA ecosystem, along with the ecosystems of other languages that expose a C ABI.

One other area I would have liked to tackle if I had more time is the warnings that occur in various modules, many of these are simply formatting issues or legacy, however some of these are serious and are very likely causing undefined behavior. For example this warning:

vacuum_math.f:348:11:

  346 |       do 115  l=2,n
      |                                                                        2
  347 |       ia=x(l-1)
  348 |       ib=x(l)
      |           1
Warning: Array reference at (1) out of bounds (2 > 1) in loop beginning at (2)

could cause correctness errors in other parts of the code, and this could be triggered by changes that are themselves correct.

I appreciate the time that I have worked here and I wish the best of luck to everyone in continuing to improve this code. If you have any questions for me before I leave, please let me know, and I will do my best to answer them.

@logan-nc
Copy link
Contributor

👍 Thanks for the wrap-up @RGates94. I appreciate the work you put into this and wish you all the best in your new position!

@logan-nc
Copy link
Contributor

@stephethier, I noticed when merging the codes that STRIDE uses ZVODE. In DCON & PENTRC we are using LSODE to integrate complex functions by splitting the real and imaginary parts and doubling the number of equations. This is a non-intuitive "trick" of the LSODE routine that Alan Glasser used decades ago and got propagated from there. Do you think perhaps switching PENTRC to ZVODE would speed things up?

@logan-nc
Copy link
Contributor

logan-nc commented Mar 13, 2020

@RGates94 do you have any numbers associated with

a vast majority of the time was spent on a call to lsode1 in pentrc/pitch.f90

I expect it is +90%? I would then expect most of that time to be spent in lsode2 in energy.f90... is that correct?

In my experience, LSODE spends a lot of unnecessary time at the start when the integral is low. I think it is figuring out what is safe based on relative tolerances. I wonder if there are balances of atol and rtol we can use to prevent this, or better ways of pre-normalizing the problem (absolute tolerance atol=0.001 impacts int(x) differently than int(x/10000)).

@stephethier
Copy link
Collaborator

Hi Nik @logan-nc
It might be worth trying. We still may end up with the same number of operations but it would simplify the code. What's the purpose of "STRIDE"? I remember that you mentioned it as a faster version of dcon or rdcon (???) but I don't know the details.

@logan-nc
Copy link
Contributor

Citations can always be found on the GPEC website.

The abstract from [Glasser, Phys. Plasmas 2018] is,

Active feedback control of ideal MHD stability in a tokamak requires rapid plasma stability analysis. Toward this end, we reformulate the δW stability method with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the generic tokamak ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD matrix Riccati differential equation. Since Riccati equations are prevalent in the control theory literature, such a shift in perspective brings to bear a range of numerical methods that are well-suited to the robust, fast solution of control problems. We discuss the usefulness of Riccati techniques in solving the stiff ordinary differential equations often encountered in ideal MHD stability analyses—for example, in tokamak edge and stellarator physics. We demonstrate the applicability of such methods to an existing 2D ideal MHD stability code—DCON [A. H. Glasser, Phys. Plasmas 23, 072505 (2016)]—enabling its parallel operation in near real-time, with wall-clock time ≪1s. Such speed may help enable active feedback ideal MHD stability control, especially in tokamak plasmas whose ideal MHD equilibria evolve with inductive timescale τ≳ 1s—as in ITER.

Basically, it speeds things up by solving the same equation using a different numerical approach that lends itself well to parallelization (calculate a bunch of transition matrices in distinct regions). Another way to look at it is: It is just DCON with the lsode call replaced 😛.

@logan-nc
Copy link
Contributor

I should clarify: STRIDE does NOT contain the kinetic physics that is taking up all the time in the DCON runs we are trying to speed up. It only reproduces the ideal DCON (kin_flag=f).

We form the kinetic matrices before starting the ODE solve, so even if we used something like STRIDE it would not actually help speed up the kinetic time sink - just the ~6s integration after all the kinetic matrices are formed up.

@RGates94
Copy link
Collaborator Author

RGates94 commented Mar 13, 2020

I expect it is +90%? I would then expect most of that time to be spent in lsode2 in energy.f90... is that correct?

The exact amount of time spent on the lsode1 call depends on the exact parameters used for the run, in the DIIID_kinetic example it is approximately 90%, but I don't have a simple way to give exact numbers at the moment.

Regarding how much is spent on the lsode2 call, I am not sure exactly, however I strongly suspect that it takes up nearly all of the time.

@matt-pharr matt-pharr mentioned this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants