Skip to content

Implement parallelization for hydrogen bond list setup in gfnff_hbset #1240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Thomas3R
Copy link
Contributor

As mentioned in #1239, the HB setup benefits from the parallelization implemented here.
I use local copies of the variables from TNeigh and TGFFNeighbourList types so the constants in the construct (nlist_nhb1, nlist_nhb2, neigh_numctr, neigh_nTrans, hbthr1, hbthr2) can be used as firstprivate.
The !$omp atomic capture ensures that each thread works on a different/unique entry (index nlist_nhb1 or nlist_nhb2) of the respective lists.
I confirmed improved scaling on the X23 benchmark and a larger peptide crystal. However, gfnff_hbset0 takes significantly longer and should be parallelized next.

@Thomas3R Thomas3R requested a review from marvinfriede March 27, 2025 16:18
Comment on lines 821 to 822
rih=NORM2(xyz(:,nh)-(xyz(:,i)+neigh%transVec(:,iTri)))**2
rjh=NORM2(xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj)))**2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rih=NORM2(xyz(:,nh)-(xyz(:,i)+neigh%transVec(:,iTri)))**2
rjh=NORM2(xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj)))**2
rih = sum((xyz(:, nh) - xyz(:, i) - neigh%transVec(:, iTri))**2)
rjh = sum((xyz(:, nh) - xyz(:, j) + neigh%transVec(:, iTrj))**2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Taking the square root to square it again is unnecessary. However, I think what we want here is the intrinsic dot_product(vec, vec) with vec=xyz(:, nh) - (xyz(:, i) + neigh%transVec(:, iTri)). Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Albkat just mentioned that dot_product is a old Fortran function, and if one wants to use the dot product here BLAS level 1 functions should be used. However, I think just using sum and the square does the job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the sum and square led to a change in energy, probably for numerical reasons. Using mctc_dot (which is the BLAS level 1 function) works as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second line has a plus instead of a minus beforethe translation vector. That is probably the source of the error.

iTrDum=neigh%fTrSum(neigh%iTrNeg(iTri),iTrj)
! iTrDum=-1 is valid here because we are only interested if ij is a bond (ijnonbond)
! However, iTrDum is not and should not be used as an index without excluding -1 value
if((iTrDum.le.neigh_nTrans.and.iTrDum.gt.0).or.iTrDum.eq.-1) then ! cycle invalid values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please convert these ifs to early returns instead to make the code more readable?
I.e., invert the condition and use continue.

Copy link
Contributor Author

@Thomas3R Thomas3R Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, continue is a "do-nothing" statement in Fortran. The previously implemented cycle statement is not allowed in OpenMP parallel do regions. Therefore, I changed it to the less readable if statement.

Copy link
Member

@marvinfriede marvinfriede Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I meant cycle. However, I do see why it should not be allowed. Actually, we use that rather often (example). You are only not allowed to break out of a loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right! Somehow my compiler told me I could not use cycle at the time, but it seems it has changed its mind about the current version of the code. I reversed it to the original version as requested.

if(iTri.le.neigh_numctr) then ! nh is not shifted so bpair works without adjustment
if(neigh%bpair(i,nh,iTri).eq.1.and.ijnonbond) then
!$omp atomic capture
nlist_nhb2 = nlist%nhb2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you introduce private variables instead of updating the shared variables under atomic protection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert here, but so far, all approaches using only

!$omp atomic
nlist%nhb2 = nlist%nhb2 + 1

led to a wrong energy (mostly NaN). To my understanding, using a private variable in an atomic capture environment is the safest way to go here. The atomic capture ensures that each thread gets a unique index and the updates do not interfere with each other, and it requires a private variable. Also, the speed-up is reasonable for my test cases, and this implementation is working as expected.

!$omp parallel do default(none) private(i, j, k, nh, ix, iTri, iTrj, iTrDum, rab, &
!$omp rih, rjh,ijnonbond,free ) &
!$omp firstprivate (nlist_nhb1, nlist_nhb2, neigh_numctr, neigh_nTrans, hbthr1, hbthr2) &
!$omp shared(nlist, topo, neigh, xyz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe schedule(dynamic) improves performance here. The loop work could be different due to the conditions in the nested loops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will test the influence of the schedule in the implementation in gfnff_hbset0. The parallelization in gfnff_hbset has little impact in comparison to gfnff_hbset0, so it is easier for me to check there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use collapse(3) with evaluation of i/j variables in inner loops to increase OpenMP iteration space. Playing with second arg of dynamic scheduling may also help: I usually use 32 or 64.

@Thomas3R Thomas3R marked this pull request as draft March 28, 2025 14:54
Comment on lines +801 to +802
do iTri=1, neigh_nTrans ! go through i shifts
do iTrj=1, neigh_nTrans ! go through j shifts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice naming of iterators. :|

Comment on lines +849 to +853
nlist%hblist2(1,nlist_nhb2)=j
nlist%hblist2(2,nlist_nhb2)=i
nlist%hblist2(3,nlist_nhb2)=nh
nlist%hblist2(4,nlist_nhb2)=iTrj
nlist%hblist2(5,nlist_nhb2)=iTri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you may have data race with nlist_nhb2 variable. On x86/x86_64 it works fine, but AArch64 can say "Hi" for you :)


! for nxb list only i is not shifted
nlist%nxb =0
do ix=1,topo%natxbAB
i =topo%xbatABl(1,ix) ! A
j =topo%xbatABl(2,ix) ! B
iTrj=topo%xbatABl(4,ix) ! iTrB
if(iTrj.gt.neigh%nTrans.or.iTrj.lt.-1.or.iTrj.eq.0) cycle ! cycle nonsense
if(iTrj.gt.neigh_nTrans.or.iTrj.lt.-1.or.iTrj.eq.0) cycle ! cycle nonsense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it changed?

Comment on lines +824 to +826
rih = mctc_dot(vec_ih, vec_ih) ! square of distance iH
vec_jh = xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj))
rjh = mctc_dot(vec_jh, vec_jh) ! square of distance jH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use dot_product intrinsic (which can be optimized by compiler) rather than using external procedure mctc_dot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants