Skip to content

BUG: no guardrails on shared/scratch alloc requests #183

@tylerjereddy

Description

@tylerjereddy

Modifying the tiled DGEMM kernel code in gh-146 as below can lead to a segfault. While I realize the C++ Kokkos docs do advise checking the size of the shared memory caches before allocating them, this isn't really a Pythonic experience so we may need some kind of (arguably default-on) mode for auto-querying the size of the i.e., L1 (and so on) cache and refusing to compile it.

The argument in favor of default-on is similar to that for Cython--you need to explicitly opt out of helpful guardrails like bounds checking and so on to get the full-blown performance (i.e., you develop with guardrails on, then deploy to production/releases with i.e., decorators that disable the guardrails).

--- a/pykokkos/linalg/workunits.py
+++ b/pykokkos/linalg/workunits.py
@@ -46,7 +46,7 @@ def dgemm_impl_tiled_no_view_c(team_member: pk.TeamMember,
     global_tid: int = team_member.league_rank() * team_member.team_size() + team_member.team_rank()
 
     # TODO: I have no idea how to get 2D scratch memory views?
-    scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
+    scratch_mem_a: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size * 100000)
     scratch_mem_b: pk.ScratchView1D[float] = pk.ScratchView1D(team_member.team_scratch(0), tile_size)
     # in a 4 x 4 matrix with 2 x 2 tiling the leagues
     # and teams have matching row/col assignment approaches
tests/test_linalg.py ........Fatal Python error: Fatal Python error: Fatal Python error: Segmentation faultFatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: 

Segmentation faultSegmentation faultSegmentation faultThread 0xSegmentation fault00007fd9779c81c0 (most recent call first):
Segmentation fault (core dumped)

I wonder if the CI segfault we see over in the matching PR is related to some kind of prohibition on using L1 cache in the virtual machine or something??

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions