Skip to content

BUG: handle segfault with hierarchical kernel when team size is larger than available threads #185

@tylerjereddy

Description

@tylerjereddy

As noted here: #146 (comment)

If you write a PyKokkos kernel that uses a team barrier synchronization and probably other related hierarchical parallelism features, it seems that you can get a hard segfault if you have OMP_NUM_THREADS=1 in your environment.

While Kokkos core probably has a case for not behaving so well here, since the code is already compiled, if we have ahead-of-compile-time knowledge of the number of threads that will be available, I wonder if we should do something more useful than segfaulting by default.

I checked that deleting the barrier syncs isn't sufficient to make the segfault go away, so something broader about the hierarchical kernel is likely to blame.

Copy of the crashing workunit below the fold, in case it gets mutated a lot in the matching PR:

Details
@pk.workunit
def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
                               k_a: int,
                               alpha: float,
                               view_a: pk.View2D[pk.double],
                               view_b: pk.View2D[pk.double],
                               out: pk.View2D[pk.double]):
    printf("tiled workunit checkpoint 1")
    # early attempt at tiled matrix multiplication in PyKokkos

    # for now, let's assume a 2x2 tiling arrangement and
    # that `view_a`, `view_b`, and `out` views are all 4 x 4 matrices
    tile_size: int = 4 # this is really just the team size...
    width: int = 4

    # start off by getting a global thread id
    global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
    printf("tiled workunit checkpoint 2 for thread id: %d\n", global_tid)

    # TODO: I have no idea how to get 2D scratch memory views?
    scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
    printf("tiled workunit checkpoint 3 for thread id: %d\n", global_tid)
    # in a 4 x 4 matrix with 2 x 2 tiling the leagues
    # and teams have matching row/col assignment approaches
    bx: int = team_member.league_rank() / 2
    by: int = 0
    if team_member.league_rank() % 2 != 0:
        by = 1
    tx: int = team_member.team_rank() / 2
    ty: int = 0
    if team_member.team_rank() % 2 != 0:
        ty = 1
    tmp: float = 0
    col: int = by * 2 + ty
    row: int = bx * 2 + tx
    printf("tiled workunit checkpoint 4 for thread id: %d\n", global_tid)

    # these variables are a bit silly--can we not get
    # 2D scratch memory indexing?
    a_index: int = 0
    b_index: int = 0

    for i in range(out.extent(1) / 2):
        scratch_mem_a[team_member.team_rank()] = view_a[row][i * 2 + ty]
        scratch_mem_b[team_member.team_rank()] = view_b[i * 2 + tx][col]
        printf("tiled workunit checkpoint 5 for thread id: %d\n", global_tid)
        team_member.team_barrier()
        printf("tiled workunit checkpoint 6 for thread id: %d\n", global_tid)

        for k in range(2):
            a_index = k + ((team_member.team_rank() // 2) * 2)
            b_index = ty + (k * 2)
            tmp += scratch_mem_a[a_index] * scratch_mem_b[b_index]
            team_member.team_barrier()
            printf("tiled workunit checkpoint 7 for thread id: %d\n", global_tid)

    printf("tiled workunit checkpoint 8 for thread id: %d\n", global_tid)
    out[row][col] = tmp

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions