-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable executing mpas_atm_get_bdy_tend on GPUs #1277
base: develop
Are you sure you want to change the base?
Conversation
return_tend(:,:) = tend_scalars(idx,:,:) | ||
!$acc parallel default(present) | ||
if (associated(tend)) then | ||
!$acc loop vector collapse(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how much of a performance gain there will be, but you may also want to try acc loop gang vector collapse(2)
. That's also more consistent with what we've been using elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the semantics of adding gang
to a loop vector
directive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a timer around the parallel region and ran the regional test case, before and after adding the gang
specifier. The timing was significantly improved using gang
timer_name total calls
mpas_atm_get_bdy_tend [compute] 3.06004 162 (without gang)
mpas_atm_get_bdy_tend [compute] 0.01475 162 (with gang)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! and good to know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I changed the loop directive to be loop gang vector collapse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and could you also do the same for the loop vector collapse
in the else condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh! Thanks for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and could you also do the same for the
loop vector collapse
in the else condition.
@abishekg7 I've added the gang
directive to the loop in the else condition.
end do | ||
end do | ||
else | ||
idx = idx_ptr !don't use integer pointers in OpenACC code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment could be rephrased to be more specific, i.e. using pointers to refer to scalars (in loop bounds, or other indices) runs into problems alongside the default(present)
clause in the parallel regions, as this clause doesn't implicitly copy non-scalar/pointer variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow. Are you saying a default(present)
statement ensures scalars referenced in the parallel region are copied from the cpu, but that scalar pointers are not? That is, the pointer will be copied, but not what the pointer points to? And the explicit assignment ensures the scalar pointed to by the pointer gets copied over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you'd just used an !$acc parallel
without the default(present)
, the compiler implicitly copies both non-scalar arrays and scalars to the device. When we add the default(present)
here, then it only copies over scalars, and array copies to device are left up to us. In the latter scenario, the scalar pointers are treated as non-scalars/arrays and aren't copied over to the device. And then we get a runtime error about the scalar pointer not present on the device. This was at least from my (limited) experience.
Now dereferencing the pointer is good because the compiler correctly copies the dereferenced scalar onto the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, the pointer will be copied, but not what the pointer points to?
I'm not a 100% sure about this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abishekg7
So how does this sound:
! Ensure the integer pointed to by idx_ptr is copied to the gpu device
idx = idx_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment and initialize the idx
integer immediately after calling mpas_pool_get_dimension
#define MPAS_ACC_TIMER_STOP(X) | ||
#endif | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a style convention here re. only 1 new line, instead of 2.
real (kind=RKIND), dimension(:,:), pointer :: tend | ||
real (kind=RKIND), dimension(:,:,:), pointer :: tend_scalars | ||
integer :: ierr | ||
integer :: idx, ierr, i, j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgduda ierr
seems to be unused here. Can it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it would be fine to remove ierr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ierr
is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the comments above, I'm getting bit identical results with the limited area case.
64c7be9
to
7aa5e8e
Compare
…PUs. Note this commit adds "mpas_atm_get_bdy_tend [ACC_data_xfer]" timers to time the data transfers done in mpas_atm_get_bdy_tend, but there is no timer for the actual computation done in mpas_atm_get_bdy_tend.
7aa5e8e
to
8b3a3d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This PR enables executing the
mpas_atm_get_bdy_tend
subroutine on GPUs.This is accomplished using OpenACC directives.
Tested with a regional test case.
Baseline results were obtained from building the develop branch with:
make -j32 nvhpc CORE=atmosphere PRECISION=single OPENACC=true
Then the changes in this PR were made and compiled in the same way.
Comparing the results stored in the restart.*.nc file showed no changes.