modular
diff --git a/‎book/src/puzzle_17/puzzle_17.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_17/puzzle_17.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_23/elementwise.md‎
Lines changed: 6 additions & 6 deletions b/‎book/src/puzzle_23/elementwise.md‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎book/src/puzzle_24/puzzle_24.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_24/puzzle_24.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_24/warp_extra.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_24/warp_extra.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_24/warp_simt.md‎
Lines changed: 5 additions & 5 deletions b/‎book/src/puzzle_24/warp_simt.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎book/src/puzzle_24/warp_sum.md‎
Lines changed: 3 additions & 3 deletions b/‎book/src/puzzle_24/warp_sum.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎book/src/puzzle_25/puzzle_25.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_25/puzzle_25.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_26/puzzle_26.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_26/puzzle_26.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_27/block_broadcast.md‎
Lines changed: 3 additions & 3 deletions b/‎book/src/puzzle_27/block_broadcast.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎book/src/puzzle_27/block_prefix_sum.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_27/block_prefix_sum.md‎
Lines changed: 2 additions & 2 deletions
@@ -129,7 +129,7 @@ Let's break down how this works in the larger context:
 1. **Python side (<a href="{{#include ../_includes/repo_url.md}}/blob/main/problems/p17/p17.py" class="filename">problems/p17/p17.py</a>)**:
    - Creates NumPy arrays for input and kernel
    - Calls `conv_1d()` function which wraps our operation in MAX Graph
-   - Converts NumPy arrays to [MAX driver](https://docs.modular.com/max/api/python/driver) Tensors with `Tensor.from_numpy(input).to(device)`
+   - Converts NumPy arrays to [MAX driver](https://docs.modular.com/max/api/python/driver) Buffers with `Buffer.from_numpy(input).to(device)`
    - Loads the custom operation package with `custom_extensions=[mojo_kernels]`
 
 2. **Graph building**:
 
@@ -73,8 +73,8 @@ This `idx` represents the **starting position** for a SIMD vector, not a single
 ### 3. **SIMD loading pattern**
 
 ```mojo
-a_simd = a.aligned_load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)
-b_simd = b.aligned_load[simd_width](idx, 0)  # Load 4 consecutive floats (GPU-dependent)
+a_simd = a.aligned_load[simd_width](Index(idx))  # Load 4 consecutive floats (GPU-dependent)
+b_simd = b.aligned_load[simd_width](Index(idx))  # Load 4 consecutive floats (GPU-dependent)
 ```
 
 The second parameter `0` is the dimension offset (always 0 for 1D vectors). This loads a **vectorized chunk** of data in a single operation. The exact number of elements loaded depends on your GPU's SIMD capabilities.
@@ -234,10 +234,10 @@ fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
 
 ```mojo
 idx = indices[0]                                  # Linear index: 0, 4, 8, 12... (GPU-dependent spacing)
-a_simd = a.aligned_load[simd_width](idx, 0)       # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
-b_simd = b.aligned_load[simd_width](idx, 0)       # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
+a_simd = a.aligned_load[simd_width](Index(idx))       # Load: [a[0:4], a[4:8], a[8:12]...] (4 elements per load)
+b_simd = b.aligned_load[simd_width](Index(idx))       # Load: [b[0:4], b[4:8], b[8:12]...] (4 elements per load)
 ret = a_simd + b_simd                             # SIMD: 4 additions in parallel (GPU-dependent)
-output.aligned_store[simd_width](idx, 0, ret)     # Store: 4 results simultaneously (GPU-dependent)
+output.store[simd_width](Index(global_start), ret)     # Store: 4 results simultaneously (GPU-dependent)
 ```
 
 **Execution Hierarchy Visualization:**
@@ -266,7 +266,7 @@ GPU Architecture:
 ### 4. **Memory access pattern analysis**
 
 ```mojo
-a.aligned_load[simd_width](idx, 0)  // Coalesced memory access
+a.aligned_load[simd_width](Index(idx))  // Coalesced memory access
 ```
 
 **Memory Coalescing Benefits:**
 
@@ -36,7 +36,7 @@ GPU Block (e.g., 256 threads)
 
 ### **Warp operations available in Mojo**
 
-Learn the core warp primitives from `gpu.warp`:
+Learn the core warp primitives from `gpu.primitives.warp`:
 
 1. **`sum(value)`**: Sum all values across warp lanes
 2. **`shuffle_idx(value, lane)`**: Get value from specific lane
 
@@ -45,12 +45,12 @@
 ### ✅ Perfect for warps
 ```mojo
 # Reduction operations
-from gpu.warp import sum, max
+from gpu.primitives.warp import sum, max
 var total = sum(partial_values)
 var maximum = max(partial_values)
 
 # Communication patterns
-from gpu.warp import shuffle_idx, prefix_sum
+from gpu.primitives.warp import shuffle_idx, prefix_sum
 var broadcast = shuffle_idx(my_value, 0)
 var running_sum = prefix_sum(my_value)
 ```
 
@@ -9,7 +9,7 @@ A **warp** is a group of 32 (or 64) GPU threads that execute **the same instruct
 **Simple example:**
 
 ```mojo
-from gpu.warp import sum
+from gpu.primitives.warp import sum
 # All 32 threads in the warp execute this simultaneously:
 var my_value = input[my_thread_id]     # Each gets different data
 var warp_total = sum(my_value)         # All contribute to one sum
@@ -49,7 +49,7 @@ var result = a + b # Add 8 pairs simultaneously
 
 ```mojo
 # Thread-based code that becomes vector operations
-from gpu.warp import sum
+from gpu.primitives.warp import sum
 
 var my_data = input[thread_id]         # Each thread gets its element
 var partial = my_data * coefficient    # All threads compute simultaneously
@@ -149,7 +149,7 @@ Each thread within a warp has a **lane ID** from 0 to `WARP_SIZE-1`:
 
 ```mojo
 from gpu import lane_id
-from gpu.warp import WARP_SIZE
+from gpu.primitives.warp import WARP_SIZE
 
 # Within a kernel function:
 my_lane = lane_id()  # Returns 0-31 (NVIDIA/RDNA) or 0-63 (CDNA)
@@ -170,7 +170,7 @@ barrier()  # Explicit synchronization required
 var total = shared[0] + shared[1] + ... + shared[WARP_SIZE] # Sum reduction
 
 # 2. Warp approach:
-from gpu.warp import sum
+from gpu.primitives.warp import sum
 
 var total = sum(partial_result)  # Implicit synchronization!
 ```
@@ -282,7 +282,7 @@ else:
 ### NVIDIA vs AMD warp sizes
 
 ```mojo
-from gpu.warp import WARP_SIZE
+from gpu.primitives.warp import WARP_SIZE
 
 # NVIDIA GPUs:     WARP_SIZE = 32
 # AMD RDNA GPUs:   WARP_SIZE = 32 (wavefront32 mode)
 
@@ -290,18 +290,18 @@ else:
 total = warp_sum(partial_product)
 
 if lane_id() == 0:
-    output.store[1](idx // WARP_SIZE, 0, total)
+    output.store[1](Index(idx // WARP_SIZE), total)
 ```
 
-**Storage pattern:** `output.store[1](idx // WARP_SIZE, 0, total)` stores 1 element at position `(idx // WARP_SIZE, 0)` in the output tensor.
+**Storage pattern:** `output.store[1](Index(idx // WARP_SIZE), 0, total)` stores 1 element at position `(idx // WARP_SIZE, 0)` in the output tensor.
 
 **Same warp logic:** `warp_sum()` and lane 0 writing work identically in functional approach.
 
 ### 4. **Available functions from imports**
 
 ```mojo
 from gpu import lane_id
-from gpu.warp import sum as warp_sum, WARP_SIZE
+from gpu.primitives.warp import sum as warp_sum, WARP_SIZE
 
 # Inside your function:
 my_lane = lane_id()           # 0 to WARP_SIZE-1
 
@@ -35,7 +35,7 @@ Lane 0 ──broadcast──> All lanes (0, 1, 2, ..., 31)
 
 ### **Warp communication operations in Mojo**
 
-Learn the core communication primitives from `gpu.warp`:
+Learn the core communication primitives from `gpu.primitives.warp`:
 
 1. **[`shuffle_down(value, offset)`](https://docs.modular.com/mojo/stdlib/gpu/warp/shuffle_down)**: Get value from lane at higher index (neighbor access)
 2. **[`broadcast(value)`](https://docs.modular.com/mojo/stdlib/gpu/warp/broadcast)**: Share lane 0's value with all other lanes (one-to-many)
 
@@ -36,7 +36,7 @@ Output: [1, 3, 6, 10, 15, 21, 28, 36, ...] (inclusive scan)
 
 ### **Advanced warp operations in Mojo**
 
-Learn the sophisticated communication primitives from `gpu.warp`:
+Learn the sophisticated communication primitives from `gpu.primitives.warp`:
 
 1. **[`shuffle_xor(value, mask)`](https://docs.modular.com/mojo/stdlib/gpu/warp/shuffle_xor)**: XOR-based butterfly communication for tree algorithms
 2. **[`prefix_sum(value)`](https://docs.modular.com/mojo/stdlib/gpu/warp/prefix_sum)**: Hardware-accelerated parallel scan operations
 
@@ -1,8 +1,8 @@
 # block.broadcast() Vector Normalization
 
-Implement vector mean normalization by combining [block.sum](https://docs.modular.com/mojo/stdlib/gpu/block/sum) and [block.broadcast](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) operations to demonstrate the complete block-level communication workflow. Each thread will contribute to computing the mean, then receive the broadcast mean to normalize its element, showcasing how block operations work together to solve real parallel algorithms.
+Implement vector mean normalization by combining [block.sum](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/sum) and [block.broadcast](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) operations to demonstrate the complete block-level communication workflow. Each thread will contribute to computing the mean, then receive the broadcast mean to normalize its element, showcasing how block operations work together to solve real parallel algorithms.
 
-**Key insight:** _The [block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) operation enables one-to-all communication, completing the fundamental block communication patterns: reduction (all→one), scan (all→each), and broadcast (one→all)._
+**Key insight:** _The [block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) operation enables one-to-all communication, completing the fundamental block communication patterns: reduction (all→one), scan (all→each), and broadcast (one→all)._
 
 ## Key concepts
 
@@ -126,7 +126,7 @@ if local_i == 0:
 
 **Why thread 0?** Consistent with `block.sum()` pattern where thread 0 receives the result.
 
-### 4. **[block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/block/broadcast) API concepts**
+### 4. **[block.broadcast()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/broadcast) API concepts**
 
 Study the function signature - it needs:
 
 
@@ -1,8 +1,8 @@
 # block.prefix_sum() Parallel Histogram Binning
 
-This puzzle implements parallel histogram binning using block-level [block.prefix_sum](https://docs.modular.com/mojo/stdlib/gpu/block/prefix_sum) operations for advanced parallel filtering and extraction. Each thread determines its element's target bin, then applies `block.prefix_sum()` to compute write positions for extracting elements from a specific bin, showing how prefix sum enables sophisticated parallel partitioning beyond simple reductions.
+This puzzle implements parallel histogram binning using block-level [block.prefix_sum](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/prefix_sum) operations for advanced parallel filtering and extraction. Each thread determines its element's target bin, then applies `block.prefix_sum()` to compute write positions for extracting elements from a specific bin, showing how prefix sum enables sophisticated parallel partitioning beyond simple reductions.
 
-**Key insight:** _The [block.prefix_sum()](https://docs.modular.com/mojo/stdlib/gpu/block/prefix_sum) operation provides parallel filtering and extraction by computing cumulative write positions for matching elements across all threads in a block._
+**Key insight:** _The [block.prefix_sum()](https://docs.modular.com/mojo/stdlib/gpu/primitives/block/prefix_sum) operation provides parallel filtering and extraction by computing cumulative write positions for matching elements across all threads in a block._
 
 ## Key concepts