Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable executing mpas_atm_get_bdy_tend on GPUs #1277

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jim-p-w
Copy link
Contributor

@jim-p-w jim-p-w commented Jan 24, 2025

This PR enables executing the mpas_atm_get_bdy_tend subroutine on GPUs.
This is accomplished using OpenACC directives.

Tested with a regional test case.
Baseline results were obtained from building the develop branch with:
make -j32 nvhpc CORE=atmosphere PRECISION=single OPENACC=true
Then the changes in this PR were made and compiled in the same way.

Comparing the results stored in the restart.*.nc file showed no changes.

@mgduda mgduda added Atmosphere OpenACC Work related to OpenACC acceleration of code labels Jan 24, 2025
@mgduda mgduda requested review from mgduda and abishekg7 January 24, 2025 23:00
return_tend(:,:) = tend_scalars(idx,:,:)
!$acc parallel default(present)
if (associated(tend)) then
!$acc loop vector collapse(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much of a performance gain there will be, but you may also want to try acc loop gang vector collapse(2). That's also more consistent with what we've been using elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the semantics of adding gang to a loop vector directive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a timer around the parallel region and ran the regional test case, before and after adding the gang specifier. The timing was significantly improved using gang

timer_name                                         total       calls
mpas_atm_get_bdy_tend [compute]                    3.06004       162 (without gang)
mpas_atm_get_bdy_tend [compute]                    0.01475       162 (with gang)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! and good to know!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I changed the loop directive to be loop gang vector collapse

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and could you also do the same for the loop vector collapse in the else condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Thanks for catching that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and could you also do the same for the loop vector collapse in the else condition.

@abishekg7 I've added the gang directive to the loop in the else condition.

end do
end do
else
idx = idx_ptr !don't use integer pointers in OpenACC code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment could be rephrased to be more specific, i.e. using pointers to refer to scalars (in loop bounds, or other indices) runs into problems alongside the default(present) clause in the parallel regions, as this clause doesn't implicitly copy non-scalar/pointer variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. Are you saying a default(present) statement ensures scalars referenced in the parallel region are copied from the cpu, but that scalar pointers are not? That is, the pointer will be copied, but not what the pointer points to? And the explicit assignment ensures the scalar pointed to by the pointer gets copied over?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you'd just used an !$acc parallel without the default(present), the compiler implicitly copies both non-scalar arrays and scalars to the device. When we add the default(present) here, then it only copies over scalars, and array copies to device are left up to us. In the latter scenario, the scalar pointers are treated as non-scalars/arrays and aren't copied over to the device. And then we get a runtime error about the scalar pointer not present on the device. This was at least from my (limited) experience.

Now dereferencing the pointer is good because the compiler correctly copies the dereferenced scalar onto the device.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, the pointer will be copied, but not what the pointer points to?

I'm not a 100% sure about this part.

Copy link
Contributor Author

@jim-p-w jim-p-w Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abishekg7
So how does this sound:

! Ensure the integer pointed to by idx_ptr is copied to the gpu device
idx = idx_ptr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment and initialize the idx integer immediately after calling mpas_pool_get_dimension

#define MPAS_ACC_TIMER_STOP(X)
#endif


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a style convention here re. only 1 new line, instead of 2.

real (kind=RKIND), dimension(:,:), pointer :: tend
real (kind=RKIND), dimension(:,:,:), pointer :: tend_scalars
integer :: ierr
integer :: idx, ierr, i, j
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgduda ierr seems to be unused here. Can it be removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it would be fine to remove ierr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ierr is removed.

Copy link
Collaborator

@abishekg7 abishekg7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comments above, I'm getting bit identical results with the limited area case.

@jim-p-w jim-p-w force-pushed the atmosphere/mpas_atm_get_bdy_tend branch 3 times, most recently from 64c7be9 to 7aa5e8e Compare January 31, 2025 20:27
…PUs.

Note this commit adds "mpas_atm_get_bdy_tend [ACC_data_xfer]" timers to
time the data transfers done in mpas_atm_get_bdy_tend, but there is no
timer for the actual computation done in mpas_atm_get_bdy_tend.
@jim-p-w jim-p-w force-pushed the atmosphere/mpas_atm_get_bdy_tend branch from 7aa5e8e to 8b3a3d4 Compare February 3, 2025 18:01
Copy link
Collaborator

@abishekg7 abishekg7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere OpenACC Work related to OpenACC acceleration of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants