diff --git a/book/src/puzzle_23/elementwise.md b/book/src/puzzle_23/elementwise.md index d9935c65..ed7abdff 100644 --- a/book/src/puzzle_23/elementwise.md +++ b/book/src/puzzle_23/elementwise.md @@ -90,7 +90,7 @@ This performs element-wise addition across the entire SIMD vector (if supported) ### 5. **SIMD storing** ```mojo -output.store[simd_width](idx, 0, result) # Store 4 results at once (GPU-dependent) +output.store[simd_width](Index(idx), result) # Store 4 results at once (GPU-dependent) ``` Writes the entire SIMD vector back to memory in one operation. @@ -237,7 +237,7 @@ idx = indices[0] # Linear index: 0, 4, 8, 12... a_simd = a.aligned_load[simd_width](Index(idx)) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load) b_simd = b.aligned_load[simd_width](Index(idx)) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load) ret = a_simd + b_simd # SIMD: 4 additions in parallel (GPU-dependent) -output.store[simd_width](Index(global_start), ret) # Store: 4 results simultaneously (GPU-dependent) +output.store[simd_width](Index(idx), ret) # Store: 4 results simultaneously (GPU-dependent) ``` **Execution Hierarchy Visualization:** diff --git a/book/src/puzzle_23/gpu-thread-vs-simd.md b/book/src/puzzle_23/gpu-thread-vs-simd.md index 119980f1..17d19666 100644 --- a/book/src/puzzle_23/gpu-thread-vs-simd.md +++ b/book/src/puzzle_23/gpu-thread-vs-simd.md @@ -38,10 +38,10 @@ Each GPU thread can process multiple data elements simultaneously using **SIMD ( ```mojo # Within one GPU thread: -a_simd = a.load[simd_width](idx, 0) # Load 4 floats simultaneously -b_simd = b.load[simd_width](idx, 0) # Load 4 floats simultaneously +a_simd = a.load[simd_width](Index(idx)) # Load 4 floats simultaneously +b_simd = b.load[simd_width](Index(idx)) # Load 4 floats simultaneously result = a_simd + b_simd # Add 4 pairs simultaneously -output.store[simd_width](idx, 0, result) # Store 4 results simultaneously +output.store[simd_width](Index(idx), result) # Store 4 results simultaneously ``` ## Pattern comparison and thread-to-work mapping diff --git a/book/src/puzzle_23/tile.md b/book/src/puzzle_23/tile.md index 58cbd981..2a26d174 100644 --- a/book/src/puzzle_23/tile.md +++ b/book/src/puzzle_23/tile.md @@ -84,10 +84,10 @@ This `@parameter` loop unrolls at compile-time for optimal performance. ### 4. **SIMD operations within tile elements** ```mojo -a_vec = a_tile.load[simd_width](i, 0) # Load from position i in tile -b_vec = b_tile.load[simd_width](i, 0) # Load from position i in tile +a_vec = a_tile.load[simd_width](Index(i)) # Load from position i in tile +b_vec = b_tile.load[simd_width](Index(i)) # Load from position i in tile result = a_vec + b_vec # SIMD addition (GPU-dependent width) -out_tile.store[simd_width](i, 0, result) # Store to position i in tile +out_tile.store[simd_width](Index(i), result) # Store to position i in tile ``` ### 5. **Thread configuration difference** @@ -232,10 +232,10 @@ Tile 31 (thread 31): [992, 993, ..., 1023] ← Elements 992-1023 ```mojo @parameter for i in range(tile_size): - a_vec = a_tile.load[simd_width](i, 0) - b_vec = b_tile.load[simd_width](i, 0) + a_vec = a_tile.load[simd_width](Index(i)) + b_vec = b_tile.load[simd_width](Index(i)) ret = a_vec + b_vec - out_tile.store[simd_width](i, 0, ret) + out_tile.store[simd_width](Index(i), ret) ``` **Why sequential processing?** diff --git a/book/src/puzzle_23/vectorize.md b/book/src/puzzle_23/vectorize.md index 41543057..5b2fd5fd 100644 --- a/book/src/puzzle_23/vectorize.md +++ b/book/src/puzzle_23/vectorize.md @@ -68,8 +68,8 @@ This calculates the exact global position for each SIMD vector within the chunk. ### 3. **Direct tensor access** ```mojo -a_vec = a.load[simd_width](global_start, 0) # Load from global tensor -output.store[simd_width](global_start, 0, ret) # Store to global tensor +a_vec = a.aligned_load[simd_width](Index(global_start)) # Load from global tensor +output.store[simd_width](Index(global_start), ret) # Store to global tensor ``` Note: Access the original tensors, not the tile views.