Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions book/src/puzzle_23/elementwise.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ This performs element-wise addition across the entire SIMD vector (if supported)
### 5. **SIMD storing**

```mojo
output.store[simd_width](idx, 0, result) # Store 4 results at once (GPU-dependent)
output.store[simd_width](Index(idx), result) # Store 4 results at once (GPU-dependent)
```

Writes the entire SIMD vector back to memory in one operation.
Expand Down Expand Up @@ -237,7 +237,7 @@ idx = indices[0] # Linear index: 0, 4, 8, 12...
a_simd = a.aligned_load[simd_width](Index(idx)) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
b_simd = b.aligned_load[simd_width](Index(idx)) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
ret = a_simd + b_simd # SIMD: 4 additions in parallel (GPU-dependent)
output.store[simd_width](Index(global_start), ret) # Store: 4 results simultaneously (GPU-dependent)
output.store[simd_width](Index(idx), ret) # Store: 4 results simultaneously (GPU-dependent)
```

**Execution Hierarchy Visualization:**
Expand Down
6 changes: 3 additions & 3 deletions book/src/puzzle_23/gpu-thread-vs-simd.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,10 +38,10 @@ Each GPU thread can process multiple data elements simultaneously using **SIMD (

```mojo
# Within one GPU thread:
a_simd = a.load[simd_width](idx, 0) # Load 4 floats simultaneously
b_simd = b.load[simd_width](idx, 0) # Load 4 floats simultaneously
a_simd = a.load[simd_width](Index(idx)) # Load 4 floats simultaneously
b_simd = b.load[simd_width](Index(idx)) # Load 4 floats simultaneously
result = a_simd + b_simd # Add 4 pairs simultaneously
output.store[simd_width](idx, 0, result) # Store 4 results simultaneously
output.store[simd_width](Index(idx), result) # Store 4 results simultaneously
```

## Pattern comparison and thread-to-work mapping
Expand Down
12 changes: 6 additions & 6 deletions book/src/puzzle_23/tile.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,10 @@ This `@parameter` loop unrolls at compile-time for optimal performance.
### 4. **SIMD operations within tile elements**

```mojo
a_vec = a_tile.load[simd_width](i, 0) # Load from position i in tile
b_vec = b_tile.load[simd_width](i, 0) # Load from position i in tile
a_vec = a_tile.load[simd_width](Index(i)) # Load from position i in tile
b_vec = b_tile.load[simd_width](Index(i)) # Load from position i in tile
result = a_vec + b_vec # SIMD addition (GPU-dependent width)
out_tile.store[simd_width](i, 0, result) # Store to position i in tile
out_tile.store[simd_width](Index(i), result) # Store to position i in tile
```

### 5. **Thread configuration difference**
Expand Down Expand Up @@ -232,10 +232,10 @@ Tile 31 (thread 31): [992, 993, ..., 1023] ← Elements 992-1023
```mojo
@parameter
for i in range(tile_size):
a_vec = a_tile.load[simd_width](i, 0)
b_vec = b_tile.load[simd_width](i, 0)
a_vec = a_tile.load[simd_width](Index(i))
b_vec = b_tile.load[simd_width](Index(i))
ret = a_vec + b_vec
out_tile.store[simd_width](i, 0, ret)
out_tile.store[simd_width](Index(i), ret)
```

**Why sequential processing?**
Expand Down
4 changes: 2 additions & 2 deletions book/src/puzzle_23/vectorize.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@ This calculates the exact global position for each SIMD vector within the chunk.
### 3. **Direct tensor access**

```mojo
a_vec = a.load[simd_width](global_start, 0) # Load from global tensor
output.store[simd_width](global_start, 0, ret) # Store to global tensor
a_vec = a.aligned_load[simd_width](Index(global_start)) # Load from global tensor
output.store[simd_width](Index(global_start), ret) # Store to global tensor
```

Note: Access the original tensors, not the tile views.
Expand Down