-
Notifications
You must be signed in to change notification settings - Fork 158
Implement parallelization for hydrogen bond list setup in gfnff_hbset #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/gfnff/gfnff_ini2.f90
Outdated
rih=NORM2(xyz(:,nh)-(xyz(:,i)+neigh%transVec(:,iTri)))**2 | ||
rjh=NORM2(xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj)))**2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rih=NORM2(xyz(:,nh)-(xyz(:,i)+neigh%transVec(:,iTri)))**2 | |
rjh=NORM2(xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj)))**2 | |
rih = sum((xyz(:, nh) - xyz(:, i) - neigh%transVec(:, iTri))**2) | |
rjh = sum((xyz(:, nh) - xyz(:, j) + neigh%transVec(:, iTrj))**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Taking the square root to square it again is unnecessary. However, I think what we want here is the intrinsic dot_product(vec, vec)
with vec=xyz(:, nh) - (xyz(:, i) + neigh%transVec(:, iTri))
. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Albkat just mentioned that dot_product
is a old Fortran function, and if one wants to use the dot product here BLAS level 1 functions should be used. However, I think just using sum and the square does the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the sum and square led to a change in energy, probably for numerical reasons. Using mctc_dot (which is the BLAS level 1 function) works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second line has a plus instead of a minus beforethe translation vector. That is probably the source of the error.
src/gfnff/gfnff_ini2.f90
Outdated
iTrDum=neigh%fTrSum(neigh%iTrNeg(iTri),iTrj) | ||
! iTrDum=-1 is valid here because we are only interested if ij is a bond (ijnonbond) | ||
! However, iTrDum is not and should not be used as an index without excluding -1 value | ||
if((iTrDum.le.neigh_nTrans.and.iTrDum.gt.0).or.iTrDum.eq.-1) then ! cycle invalid values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please convert these if
s to early returns instead to make the code more readable?
I.e., invert the condition and use continue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, continue
is a "do-nothing" statement in Fortran. The previously implemented cycle
statement is not allowed in OpenMP parallel do regions. Therefore, I changed it to the less readable if
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I meant cycle
. However, I do see why it should not be allowed. Actually, we use that rather often (example). You are only not allowed to break out of a loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right! Somehow my compiler told me I could not use cycle
at the time, but it seems it has changed its mind about the current version of the code. I reversed it to the original version as requested.
src/gfnff/gfnff_ini2.f90
Outdated
if(iTri.le.neigh_numctr) then ! nh is not shifted so bpair works without adjustment | ||
if(neigh%bpair(i,nh,iTri).eq.1.and.ijnonbond) then | ||
!$omp atomic capture | ||
nlist_nhb2 = nlist%nhb2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you introduce private variables instead of updating the shared variables under atomic protection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert here, but so far, all approaches using only
!$omp atomic
nlist%nhb2 = nlist%nhb2 + 1
led to a wrong energy (mostly NaN). To my understanding, using a private variable in an atomic capture
environment is the safest way to go here. The atomic capture
ensures that each thread gets a unique index and the updates do not interfere with each other, and it requires a private variable. Also, the speed-up is reasonable for my test cases, and this implementation is working as expected.
!$omp parallel do default(none) private(i, j, k, nh, ix, iTri, iTrj, iTrDum, rab, & | ||
!$omp rih, rjh,ijnonbond,free ) & | ||
!$omp firstprivate (nlist_nhb1, nlist_nhb2, neigh_numctr, neigh_nTrans, hbthr1, hbthr2) & | ||
!$omp shared(nlist, topo, neigh, xyz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe schedule(dynamic)
improves performance here. The loop work could be different due to the conditions in the nested loops.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will test the influence of the schedule in the implementation in gfnff_hbset0. The parallelization in gfnff_hbset has little impact in comparison to gfnff_hbset0, so it is easier for me to check there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use collapse(3)
with evaluation of i/j variables in inner loops to increase OpenMP iteration space. Playing with second arg of dynamic scheduling may also help: I usually use 32 or 64.
do iTri=1, neigh_nTrans ! go through i shifts | ||
do iTrj=1, neigh_nTrans ! go through j shifts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice naming of iterators. :|
nlist%hblist2(1,nlist_nhb2)=j | ||
nlist%hblist2(2,nlist_nhb2)=i | ||
nlist%hblist2(3,nlist_nhb2)=nh | ||
nlist%hblist2(4,nlist_nhb2)=iTrj | ||
nlist%hblist2(5,nlist_nhb2)=iTri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you may have data race with nlist_nhb2 variable. On x86/x86_64 it works fine, but AArch64 can say "Hi" for you :)
|
||
! for nxb list only i is not shifted | ||
nlist%nxb =0 | ||
do ix=1,topo%natxbAB | ||
i =topo%xbatABl(1,ix) ! A | ||
j =topo%xbatABl(2,ix) ! B | ||
iTrj=topo%xbatABl(4,ix) ! iTrB | ||
if(iTrj.gt.neigh%nTrans.or.iTrj.lt.-1.or.iTrj.eq.0) cycle ! cycle nonsense | ||
if(iTrj.gt.neigh_nTrans.or.iTrj.lt.-1.or.iTrj.eq.0) cycle ! cycle nonsense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it changed?
rih = mctc_dot(vec_ih, vec_ih) ! square of distance iH | ||
vec_jh = xyz(:,nh)-(xyz(:,j)+neigh%transVec(:,iTrj)) | ||
rjh = mctc_dot(vec_jh, vec_jh) ! square of distance jH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use dot_product
intrinsic (which can be optimized by compiler) rather than using external procedure mctc_dot
As mentioned in #1239, the HB setup benefits from the parallelization implemented here.
I use local copies of the variables from TNeigh and TGFFNeighbourList types so the constants in the construct (nlist_nhb1, nlist_nhb2, neigh_numctr, neigh_nTrans, hbthr1, hbthr2) can be used as firstprivate.
The
!$omp atomic capture
ensures that each thread works on a different/unique entry (index nlist_nhb1 or nlist_nhb2) of the respective lists.I confirmed improved scaling on the X23 benchmark and a larger peptide crystal. However, gfnff_hbset0 takes significantly longer and should be parallelized next.