As noted here: #146 (comment)
If you write a PyKokkos kernel that uses a team barrier synchronization and probably other related hierarchical parallelism features, it seems that you can get a hard segfault if you have OMP_NUM_THREADS=1 in your environment.
While Kokkos core probably has a case for not behaving so well here, since the code is already compiled, if we have ahead-of-compile-time knowledge of the number of threads that will be available, I wonder if we should do something more useful than segfaulting by default.
I checked that deleting the barrier syncs isn't sufficient to make the segfault go away, so something broader about the hierarchical kernel is likely to blame.
Copy of the crashing workunit below the fold, in case it gets mutated a lot in the matching PR:
Details
@pk.workunit
def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
k_a: int,
alpha: float,
view_a: pk.View2D[pk.double],
view_b: pk.View2D[pk.double],
out: pk.View2D[pk.double]):
printf("tiled workunit checkpoint 1")
# early attempt at tiled matrix multiplication in PyKokkos
# for now, let's assume a 2x2 tiling arrangement and
# that `view_a`, `view_b`, and `out` views are all 4 x 4 matrices
tile_size: int = 4 # this is really just the team size...
width: int = 4
# start off by getting a global thread id
global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
printf("tiled workunit checkpoint 2 for thread id: %d\n", global_tid)
# TODO: I have no idea how to get 2D scratch memory views?
scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
printf("tiled workunit checkpoint 3 for thread id: %d\n", global_tid)
# in a 4 x 4 matrix with 2 x 2 tiling the leagues
# and teams have matching row/col assignment approaches
bx: int = team_member.league_rank() / 2
by: int = 0
if team_member.league_rank() % 2 != 0:
by = 1
tx: int = team_member.team_rank() / 2
ty: int = 0
if team_member.team_rank() % 2 != 0:
ty = 1
tmp: float = 0
col: int = by * 2 + ty
row: int = bx * 2 + tx
printf("tiled workunit checkpoint 4 for thread id: %d\n", global_tid)
# these variables are a bit silly--can we not get
# 2D scratch memory indexing?
a_index: int = 0
b_index: int = 0
for i in range(out.extent(1) / 2):
scratch_mem_a[team_member.team_rank()] = view_a[row][i * 2 + ty]
scratch_mem_b[team_member.team_rank()] = view_b[i * 2 + tx][col]
printf("tiled workunit checkpoint 5 for thread id: %d\n", global_tid)
team_member.team_barrier()
printf("tiled workunit checkpoint 6 for thread id: %d\n", global_tid)
for k in range(2):
a_index = k + ((team_member.team_rank() // 2) * 2)
b_index = ty + (k * 2)
tmp += scratch_mem_a[a_index] * scratch_mem_b[b_index]
team_member.team_barrier()
printf("tiled workunit checkpoint 7 for thread id: %d\n", global_tid)
printf("tiled workunit checkpoint 8 for thread id: %d\n", global_tid)
out[row][col] = tmp
As noted here: #146 (comment)
If you write a PyKokkos kernel that uses a team barrier synchronization and probably other related hierarchical parallelism features, it seems that you can get a hard segfault if you have
OMP_NUM_THREADS=1in your environment.While Kokkos core probably has a case for not behaving so well here, since the code is already compiled, if we have ahead-of-compile-time knowledge of the number of threads that will be available, I wonder if we should do something more useful than segfaulting by default.
I checked that deleting the barrier syncs isn't sufficient to make the segfault go away, so something broader about the hierarchical kernel is likely to blame.
Copy of the crashing workunit below the fold, in case it gets mutated a lot in the matching PR:
Details