Adds OpenACC for optional GPU utilization #116

RGates94 · 2020-01-30T17:27:53Z

This branch will add OpenACC directives, intended to increase performance of dcon and related workflows by utilizing the GPU clusters.

More details will be added as work progresses.

…ilation on gfortran

stephethier · 2020-02-11T18:36:31Z

@RGates94 It's better to disable the OpenMP directives with an #ifdef statement instead of removing them completely. How about something like:

  #ifdef GPU
  !$acc ...
  #else 
  !$omp ...
  #endif

Actually, we could also use #ifdef _OPENACC. It gets defined automatically when compiling with the OpenACC compiler option for both PGI and gfortran.

logan-nc · 2020-02-11T20:13:35Z

Hi @RGates94 and @stephethier. Would you be ready to present any progress at the DCON users roundtable in 2 weeks? Or maybe the one after that? Just let me know whenever you feel you have something to show.

In the meantime, I'll note this replaces #97 (which, in turn, was meant to be an extension of #82). @RGates94 can find nice summary flowcharts of the codes in those PR discussions.

logan-nc · 2020-02-11T20:17:55Z

I'll also record @parkjk's recommendations for getting a fast kinetic run to use for debugging here so everyone can just refer to it whenever we have another PR or issue.

In equil.in: equil_control,

psilow=0.3 (rather than 0.01)
psihigh=0.5 (rather than 0.992)
mpsi=64 (instead of 128. This controls grid # in addition to the domain size. should be ok)

And make sure it actually stops and ends at those psi limits by turning dcon.in:dcon_control

sas_flag=f (instead of t)

Then kinetic dcon ran either by GPEC-1.3 release version and a develop branch in my area within 18 seconds (with 9 OMP threads) : See /p/gpec/users/jpark/runs/gpec_DIIID_kinetic_example.

The poloidal harmonic numbers can also be reduced in dcon.in: dcon_control

delta_mlow=2 (instead of 8)
delta_mhigh=4 (instead of 8)

and the loop over bounce harmonic numbers in pentrc.in: pent_input

nl=1

You may also want to try to increase tolerance of kinetic (or lambda) integration in pentrc.in: pent_control

atol_xlmda = 1e-6 (rather than 1e-9)

stephethier · 2020-02-11T20:23:09Z

@logan-nc Thanks Nik. 18 seconds is good enough. It becomes difficult to measure speedups when the run is too short (< 1 sec).

RGates94 · 2020-02-12T15:52:52Z

@stephethier I will definitely change the way I am disabling the OMP directives to what you are suggesting.

@logan-nc I'm not sure whether I will have something to present for the next roundtable, however I am very confident that something will be ready for the one after that. Most of the work so far has been in preparing the codebase to use OpenACC directives, and while the code is now compiling with OpenACC enabled, tests so far don't seem to indicate any speedups compared to runs on the same compiler. (Although gfortran seems to run around 14% faster than intel overall)

The largest current roadblock is compiler specific bugs in pgfortran, as the Nvidia profiler only works with executables produced with this compiler.

stephethier · 2020-02-12T16:36:32Z

@logan-nc (and @RGates94 ) Nik, the "tpsi" function in torque.F90 has a lot of allocation and deallocation:

        allocate(dbfun(0:mthsurf),dxfun(0:mthsurf))
        call spline_alloc(tspl,mthsurf,5)
        ...
        deallocate(dbfun,dxfun)
        ...

This is not good for GPU. We can either allocate everything in advance, or bring the nested loops inside of "tpsi". There are some performance implications for both but I'm trying to figure out which one would be the easiest to implement.

…iable

logan-nc · 2020-02-12T17:27:50Z

thanks for the update @RGates94. Just keep recording progress here and we can decide when there is sufficient material to warrant a devoted agenda item. In the meantime, feel free to join the meetings and ask/state any minor questions/observations as you go.

For the allocations in tpsi: This is simply my pythonic programming naivety at work. The allocations you call out are always the same size (mthsurf) and are not large (no real reason to deallocate for the memory). Please move them outside the function as torque module wide variables.

Note, mthsurf is set by the inputs module when it uses the dcon_interface to read input files OR by dcon when it calls set_eq to set things in memory without going through the file interface. You might consider adding a set_torque_vars or something like that to organize all the allocations you pull out of tpsi. You just need to make sure it gets called prior to its use in DCON and the stand-alone PENTRC uses.

…ot 512

logan-nc · 2020-03-04T22:00:29Z

In reply to an email from @stephethier:

The "tpsi" function is much too large and contains too many calls to other functions and subroutines, as well as memory allocations and deallocations. Robert has been diving right in and has found out that the call to lsode is where the code spends most of its time. The problem is that the lsode1 user-provided subroutine itself calls lsode!! (lsode2 in energy.f90). The integrated function in lsode2 is "xintgrnd", which doesn't seem overly complicated. So my question is: do you really need to use lsode to integrate that energy function? Could we use something simpler, like a high resolution Simpson method? That would probably help the GPU port.

Indeed the tpsi function is large. At each radial grid point, there are complicated, embedded pitch angle and energy integrations (see Eq. (19) in Ref. [1]). I had spent some time to try and figure out how to do embedded integrals with lsode by pausing the outer one, storing its info, doing the inner one, resuming the outer, and so on but was defeated by the lack of documentation/tutorials. So eventually I just did the hacky approach and copied out all the parts fo lsode that use shared variables to create separate lsode1 and lsode2 routines that don't mess with each others namespaces. Or so I thought....
What do you mean when you say "the lsode1 user-provided subroutine itself calls lsode!!"?

These integrations can involve very sharp resonances. I tried analytically estimating where they would be and using a smart-packed grid early on, but it was not accurate / never as good as LSODE.

Here are some examples of the energy integrand:

Here are some of the pitch:

You can see both can be very sharp if certain plasma conditions are met (balances of ω_E, ω_D, and ν).

I am 100% open to changing the integrator. I only used LSODE because it was what DCON used and a quick googling (back when I was very new to fortran) confirmed it was still one of the "best" (at least "tried and true") dynamic integrators around. That seemed good for doing lots of integrals that might or might not be very sharp. Do you have something better?

[1] N. C. Logan, J.-K. Park, K. Kim, Z. Wang, and J. W. Berkery, “Neoclassical toroidal viscosity in perturbed equilibria with general tokamak geometry,” Phys. Plasmas, vol. 20, no. 12, p. 122507, Dec. 2013.

logan-nc · 2020-03-04T22:37:53Z

@stephethier @RGates94 I had a thought.
When developing this code, I had occasion to benchmark it in the zero collisionality case. In this case, the resonant spikes shown above become true delta functions. To integrate in this case, I had to add the ximag parameter to use complex integrate and avoid the poles:

I never checked this explicitly, but I guess if we used ximag > 0 (default is 0) then the integrads could all be smooth and we might be able to dramatically reduce the number of steps used in each integration.... or even use a different technique. Think about it at least. You can find the relavent code by just searching ximag in energy.f90.

RGates94 · 2020-03-13T19:07:08Z

Since today is my last day at the lab, I thought it would be beneficial to share an update on current progress and my thoughts on where to continue from here.

After I got OpenACC working and checked performance, I found there were no significant speedups, so I went and looked into where most of the time was being spent in the program. I found that a vast majority of the time was spent on a call to lsode1 in pentrc/pitch.f90. Therefore on the suggestion of @stephethier I began replacing LSODE with CVODE. As this call to lsode1 took lintgrnd as a parameter, and lintgrnd called xintgrl_lsode which calls lsode2 which takes the xintgrnd subroutine as a parameter, I started by replacing xintgrnd with a wrapper over a C function.

To do this I first added a module that called into C shared objects, I then added a subroutine newxintgrnd which wraps a C function with the same behavior. The next steps will be to go up this call stack, replacing each fortran subroutine with a C function with the same behavior. This will allow use of CVODE in place of lsode calls. I expect that this process will allow for significant speedups, and moreover by moving this portion of the code into C, it opens up the ability to access the larger C and CUDA ecosystem, along with the ecosystems of other languages that expose a C ABI.

One other area I would have liked to tackle if I had more time is the warnings that occur in various modules, many of these are simply formatting issues or legacy, however some of these are serious and are very likely causing undefined behavior. For example this warning:

vacuum_math.f:348:11:

  346 |       do 115  l=2,n
      |                                                                        2
  347 |       ia=x(l-1)
  348 |       ib=x(l)
      |           1
Warning: Array reference at (1) out of bounds (2 > 1) in loop beginning at (2)

could cause correctness errors in other parts of the code, and this could be triggered by changes that are themselves correct.

I appreciate the time that I have worked here and I wish the best of luck to everyone in continuing to improve this code. If you have any questions for me before I leave, please let me know, and I will do my best to answer them.

logan-nc · 2020-03-13T21:12:05Z

👍 Thanks for the wrap-up @RGates94. I appreciate the work you put into this and wish you all the best in your new position!

logan-nc · 2020-03-13T21:17:05Z

@stephethier, I noticed when merging the codes that STRIDE uses ZVODE. In DCON & PENTRC we are using LSODE to integrate complex functions by splitting the real and imaginary parts and doubling the number of equations. This is a non-intuitive "trick" of the LSODE routine that Alan Glasser used decades ago and got propagated from there. Do you think perhaps switching PENTRC to ZVODE would speed things up?

logan-nc · 2020-03-13T21:23:13Z

@RGates94 do you have any numbers associated with

a vast majority of the time was spent on a call to lsode1 in pentrc/pitch.f90

I expect it is +90%? I would then expect most of that time to be spent in lsode2 in energy.f90... is that correct?

In my experience, LSODE spends a lot of unnecessary time at the start when the integral is low. I think it is figuring out what is safe based on relative tolerances. I wonder if there are balances of atol and rtol we can use to prevent this, or better ways of pre-normalizing the problem (absolute tolerance atol=0.001 impacts int(x) differently than int(x/10000)).

stephethier · 2020-03-13T21:25:09Z

Hi Nik @logan-nc
It might be worth trying. We still may end up with the same number of operations but it would simplify the code. What's the purpose of "STRIDE"? I remember that you mentioned it as a faster version of dcon or rdcon (???) but I don't know the details.

logan-nc · 2020-03-13T21:34:15Z

Citations can always be found on the GPEC website.

The abstract from [Glasser, Phys. Plasmas 2018] is,

Active feedback control of ideal MHD stability in a tokamak requires rapid plasma stability analysis. Toward this end, we reformulate the δW stability method with a Hamilton-Jacobi theory, elucidating analytical and numerical features of the generic tokamak ideal MHD stability problem. The plasma response matrix is demonstrated to be the solution of an ideal MHD matrix Riccati differential equation. Since Riccati equations are prevalent in the control theory literature, such a shift in perspective brings to bear a range of numerical methods that are well-suited to the robust, fast solution of control problems. We discuss the usefulness of Riccati techniques in solving the stiff ordinary differential equations often encountered in ideal MHD stability analyses—for example, in tokamak edge and stellarator physics. We demonstrate the applicability of such methods to an existing 2D ideal MHD stability code—DCON [A. H. Glasser, Phys. Plasmas 23, 072505 (2016)]—enabling its parallel operation in near real-time, with wall-clock time ≪1s. Such speed may help enable active feedback ideal MHD stability control, especially in tokamak plasmas whose ideal MHD equilibria evolve with inductive timescale τ≳ 1s—as in ITER.

Basically, it speeds things up by solving the same equation using a different numerical approach that lends itself well to parallelization (calculate a bunch of transition matrices in distinct regions). Another way to look at it is: It is just DCON with the lsode call replaced 😛.

logan-nc · 2020-03-13T21:37:58Z

I should clarify: STRIDE does NOT contain the kinetic physics that is taking up all the time in the DCON runs we are trying to speed up. It only reproduces the ideal DCON (kin_flag=f).

We form the kinetic matrices before starting the ODE solve, so even if we used something like STRIDE it would not actually help speed up the kinetic time sink - just the ~6s integration after all the kinetic matrices are formed up.

RGates94 · 2020-03-13T21:57:32Z

I expect it is +90%? I would then expect most of that time to be spent in lsode2 in energy.f90... is that correct?

The exact amount of time spent on the lsode1 call depends on the exact parameters used for the run, in the DIIID_kinetic example it is approximately 90%, but I don't have a simple way to give exact numbers at the moment.

Regarding how much is spent on the lsode2 call, I am not sure exactly, however I strongly suspect that it takes up nearly all of the time.

RGates94 added 5 commits January 30, 2020 10:58

disable some OpenMP directives and default libblas flag to allow comp…

1a523c2

…ilation on gfortran

Modify global makefile to correctly set OpenMP flag when using pgfortran

a0ecab5

Compile dcon with OpenACC flags

a6de7c0

Force OpenMP off in dcon and pentrc

3f562a3

Fix runtime pointer exception during kinetic calculation on gfortran

0e381f5

logan-nc self-assigned this Feb 11, 2020

logan-nc requested review from parkjk and stephethier February 11, 2020 18:10

logan-nc mentioned this pull request Feb 11, 2020

MPI speedup of kinetic integration #97

Closed

Make OMP optimizations in fourfit conditional on OACC environment var…

8759fb7

…iable

RGates94 added 2 commits February 14, 2020 12:24

modify allocation of bpts in torque.F90 so it works when mthsurf is n…

d37fcc6

…ot 512

Fully disable OpenMP on pgfortran and fix compilation of stride

eda3aaf

logan-nc changed the title ~~Open acc~~ Adds OpenACC for optional GPU utilization Mar 4, 2020

RGates94 added 3 commits March 12, 2020 15:49

rewrite xintgrnd subroutine in C and pass to lsode2 calls

12a4f93

Move match of xnutype and xf0type up a level, remove unused variables

8e83653

Add compiler flag for debug prints, use in tpsi and cxintgrnd

14c3e3d

RGates94 marked this pull request as ready for review March 13, 2020 18:29

matt-pharr mentioned this pull request Jun 17, 2025

Makefile changes #207

Merged

Adds OpenACC for optional GPU utilization #116

Are you sure you want to change the base?

Adds OpenACC for optional GPU utilization #116

Uh oh!

Conversation

RGates94 commented Jan 30, 2020

Uh oh!

stephethier commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logan-nc commented Feb 11, 2020

Uh oh!

logan-nc commented Feb 11, 2020

Uh oh!

stephethier commented Feb 11, 2020

Uh oh!

RGates94 commented Feb 12, 2020

Uh oh!

stephethier commented Feb 12, 2020

Uh oh!

logan-nc commented Feb 12, 2020

Uh oh!

logan-nc commented Mar 4, 2020

Uh oh!

logan-nc commented Mar 4, 2020

Uh oh!

RGates94 commented Mar 13, 2020

Uh oh!

logan-nc commented Mar 13, 2020

Uh oh!

logan-nc commented Mar 13, 2020

Uh oh!

logan-nc commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephethier commented Mar 13, 2020

Uh oh!

logan-nc commented Mar 13, 2020

Uh oh!

logan-nc commented Mar 13, 2020

Uh oh!

RGates94 commented Mar 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stephethier commented Feb 11, 2020 •

edited

Loading

logan-nc commented Mar 13, 2020 •

edited

Loading

RGates94 commented Mar 13, 2020 •

edited

Loading