Skip to content

Commit 43fce11

Browse files
authored
Update for the 01162026 nightly. (#203)
1 parent 5487783 commit 43fce11

File tree

32 files changed

+98
-98
lines changed

32 files changed

+98
-98
lines changed

book/src/puzzle_17/puzzle_17.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ Let's break down how this works in the larger context:
129129
1. **Python side (<a href="{{#include ../_includes/repo_url.md}}/blob/main/problems/p17/p17.py" class="filename">problems/p17/p17.py</a>)**:
130130
- Creates NumPy arrays for input and kernel
131131
- Calls `conv_1d()` function which wraps our operation in MAX Graph
132-
- Converts NumPy arrays to [MAX driver](https://docs.modular.com/max/api/python/driver) Tensors with `Tensor.from_numpy(input).to(device)`
132+
- Converts NumPy arrays to [MAX driver](https://docs.modular.com/max/api/python/driver) Buffers with `Buffer.from_numpy(input).to(device)`
133133
- Loads the custom operation package with `custom_extensions=[mojo_kernels]`
134134

135135
2. **Graph building**:

book/src/puzzle_23/elementwise.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ This `idx` represents the **starting position** for a SIMD vector, not a single
7373
### 3. **SIMD loading pattern**
7474

7575
```mojo
76-
a_simd = a.aligned_load[simd_width](idx, 0) # Load 4 consecutive floats (GPU-dependent)
77-
b_simd = b.aligned_load[simd_width](idx, 0) # Load 4 consecutive floats (GPU-dependent)
76+
a_simd = a.aligned_load[simd_width](Index(idx)) # Load 4 consecutive floats (GPU-dependent)
77+
b_simd = b.aligned_load[simd_width](Index(idx)) # Load 4 consecutive floats (GPU-dependent)
7878
```
7979

8080
The second parameter `0` is the dimension offset (always 0 for 1D vectors). This loads a **vectorized chunk** of data in a single operation. The exact number of elements loaded depends on your GPU's SIMD capabilities.
@@ -234,10 +234,10 @@ fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
234234

235235
```mojo
236236
idx = indices[0] # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
237-
a_simd = a.aligned_load[simd_width](idx, 0) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
238-
b_simd = b.aligned_load[simd_width](idx, 0) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
237+
a_simd = a.aligned_load[simd_width](Index(idx)) # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
238+
b_simd = b.aligned_load[simd_width](Index(idx)) # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
239239
ret = a_simd + b_simd # SIMD: 4 additions in parallel (GPU-dependent)
240-
output.aligned_store[simd_width](idx, 0, ret) # Store: 4 results simultaneously (GPU-dependent)
240+
output.store[simd_width](Index(global_start), ret) # Store: 4 results simultaneously (GPU-dependent)
241241
```
242242

243243
**Execution Hierarchy Visualization:**
@@ -266,7 +266,7 @@ GPU Architecture:
266266
### 4. **Memory access pattern analysis**
267267

268268
```mojo
269-
a.aligned_load[simd_width](idx, 0) // Coalesced memory access
269+
a.aligned_load[simd_width](Index(idx)) // Coalesced memory access
270270
```
271271

272272
**Memory Coalescing Benefits:**

book/src/puzzle_24/puzzle_24.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ GPU Block (e.g., 256 threads)
3636

3737
### **Warp operations available in Mojo**
3838

39-
Learn the core warp primitives from `gpu.warp`:
39+
Learn the core warp primitives from `gpu.primitives.warp`:
4040

4141
1. **`sum(value)`**: Sum all values across warp lanes
4242
2. **`shuffle_idx(value, lane)`**: Get value from specific lane

book/src/puzzle_24/warp_extra.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,12 @@
4545
### ✅ Perfect for warps
4646
```mojo
4747
# Reduction operations
48-
from gpu.warp import sum, max
48+
from gpu.primitives.warp import sum, max
4949
var total = sum(partial_values)
5050
var maximum = max(partial_values)
5151
5252
# Communication patterns
53-
from gpu.warp import shuffle_idx, prefix_sum
53+
from gpu.primitives.warp import shuffle_idx, prefix_sum
5454
var broadcast = shuffle_idx(my_value, 0)
5555
var running_sum = prefix_sum(my_value)
5656
```

book/src/puzzle_24/warp_simt.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ A **warp** is a group of 32 (or 64) GPU threads that execute **the same instruct
99
**Simple example:**
1010

1111
```mojo
12-
from gpu.warp import sum
12+
from gpu.primitives.warp import sum
1313
# All 32 threads in the warp execute this simultaneously:
1414
var my_value = input[my_thread_id] # Each gets different data
1515
var warp_total = sum(my_value) # All contribute to one sum
@@ -49,7 +49,7 @@ var result = a + b # Add 8 pairs simultaneously
4949

5050
```mojo
5151
# Thread-based code that becomes vector operations
52-
from gpu.warp import sum
52+
from gpu.primitives.warp import sum
5353
5454
var my_data = input[thread_id] # Each thread gets its element
5555
var partial = my_data * coefficient # All threads compute simultaneously
@@ -149,7 +149,7 @@ Each thread within a warp has a **lane ID** from 0 to `WARP_SIZE-1`:
149149

150150
```mojo
151151
from gpu import lane_id
152-
from gpu.warp import WARP_SIZE
152+
from gpu.primitives.warp import WARP_SIZE
153153
154154
# Within a kernel function:
155155
my_lane = lane_id() # Returns 0-31 (NVIDIA/RDNA) or 0-63 (CDNA)
@@ -170,7 +170,7 @@ barrier() # Explicit synchronization required
170170
var total = shared[0] + shared[1] + ... + shared[WARP_SIZE] # Sum reduction
171171
172172
# 2. Warp approach:
173-
from gpu.warp import sum
173+
from gpu.primitives.warp import sum
174174
175175
var total = sum(partial_result) # Implicit synchronization!
176176
```
@@ -282,7 +282,7 @@ else:
282282
### NVIDIA vs AMD warp sizes
283283

284284
```mojo
285-
from gpu.warp import WARP_SIZE
285+
from gpu.primitives.warp import WARP_SIZE
286286
287287
# NVIDIA GPUs: WARP_SIZE = 32
288288
# AMD RDNA GPUs: WARP_SIZE = 32 (wavefront32 mode)

book/src/puzzle_24/warp_sum.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -290,18 +290,18 @@ else:
290290
total = warp_sum(partial_product)
291291
292292
if lane_id() == 0:
293-
output.store[1](idx // WARP_SIZE, 0, total)
293+
output.store[1](Index(idx // WARP_SIZE), total)
294294
```
295295

296-
**Storage pattern:** `output.store[1](idx // WARP_SIZE, 0, total)` stores 1 element at position `(idx // WARP_SIZE, 0)` in the output tensor.
296+
**Storage pattern:** `output.store[1](Index(idx // WARP_SIZE), 0, total)` stores 1 element at position `(idx // WARP_SIZE, 0)` in the output tensor.
297297

298298
**Same warp logic:** `warp_sum()` and lane 0 writing work identically in functional approach.
299299

300300
### 4. **Available functions from imports**
301301

302302
```mojo
303303
from gpu import lane_id
304-
from gpu.warp import sum as warp_sum, WARP_SIZE
304+
from gpu.primitives.warp import sum as warp_sum, WARP_SIZE
305305
306306
# Inside your function:
307307
my_lane = lane_id() # 0 to WARP_SIZE-1

book/src/puzzle_25/puzzle_25.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Lane 0 ──broadcast──> All lanes (0, 1, 2, ..., 31)
3535

3636
### **Warp communication operations in Mojo**
3737

38-
Learn the core communication primitives from `gpu.warp`:
38+
Learn the core communication primitives from `gpu.primitives.warp`:
3939

4040
1. **[`shuffle_down(value, offset)`](https://docs.modular.com/mojo/stdlib/gpu/warp/shuffle_down)**: Get value from lane at higher index (neighbor access)
4141
2. **[`broadcast(value)`](https://docs.modular.com/mojo/stdlib/gpu/warp/broadcast)**: Share lane 0's value with all other lanes (one-to-many)

book/src/puzzle_26/puzzle_26.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Output: [1, 3, 6, 10, 15, 21, 28, 36, ...] (inclusive scan)
3636

3737
### **Advanced warp operations in Mojo**
3838

39-
Learn the sophisticated communication primitives from `gpu.warp`:
39+
Learn the sophisticated communication primitives from `gpu.primitives.warp`:
4040

4141
1. **[`shuffle_xor(value, mask)`](https://docs.modular.com/mojo/stdlib/gpu/warp/shuffle_xor)**: XOR-based butterfly communication for tree algorithms
4242
2. **[`prefix_sum(value)`](https://docs.modular.com/mojo/stdlib/gpu/warp/prefix_sum)**: Hardware-accelerated parallel scan operations

book/src/puzzle_27/block_broadcast.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# block.broadcast() Vector Normalization
22

3-
Implement vector mean normalization by combining [block.sum](https://docs.modular.com/mojo/stdlib/gpu/block/sum) and [block.broadcast](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) operations to demonstrate the complete block-level communication workflow. Each thread will contribute to computing the mean, then receive the broadcast mean to normalize its element, showcasing how block operations work together to solve real parallel algorithms.
3+
Implement vector mean normalization by combining [block.sum](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/sum) and [block.broadcast](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) operations to demonstrate the complete block-level communication workflow. Each thread will contribute to computing the mean, then receive the broadcast mean to normalize its element, showcasing how block operations work together to solve real parallel algorithms.
44

5-
**Key insight:** _The [block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) operation enables one-to-all communication, completing the fundamental block communication patterns: reduction (all→one), scan (all→each), and broadcast (one→all)._
5+
**Key insight:** _The [block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) operation enables one-to-all communication, completing the fundamental block communication patterns: reduction (all→one), scan (all→each), and broadcast (one→all)._
66

77
## Key concepts
88

@@ -126,7 +126,7 @@ if local_i == 0:
126126

127127
**Why thread 0?** Consistent with `block.sum()` pattern where thread 0 receives the result.
128128

129-
### 4. **[block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) API concepts**
129+
### 4. **[block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) API concepts**
130130

131131
Study the function signature - it needs:
132132

book/src/puzzle_27/block_prefix_sum.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# block.prefix_sum() Parallel Histogram Binning
22

3-
This puzzle implements parallel histogram binning using block-level [block.prefix_sum](https://docs.modular.com/mojo/stdlib/gpu/block/prefix_sum) operations for advanced parallel filtering and extraction. Each thread determines its element's target bin, then applies `block.prefix_sum()` to compute write positions for extracting elements from a specific bin, showing how prefix sum enables sophisticated parallel partitioning beyond simple reductions.
3+
This puzzle implements parallel histogram binning using block-level [block.prefix_sum](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/prefix_sum) operations for advanced parallel filtering and extraction. Each thread determines its element's target bin, then applies `block.prefix_sum()` to compute write positions for extracting elements from a specific bin, showing how prefix sum enables sophisticated parallel partitioning beyond simple reductions.
44

5-
**Key insight:** _The [block.prefix_sum()](https://docs.modular.com/mojo/stdlib/gpu/block/prefix_sum) operation provides parallel filtering and extraction by computing cumulative write positions for matching elements across all threads in a block._
5+
**Key insight:** _The [block.prefix_sum()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/prefix_sum) operation provides parallel filtering and extraction by computing cumulative write positions for matching elements across all threads in a block._
66

77
## Key concepts
88

0 commit comments

Comments
 (0)