Tutorials/Accelerated Python: Fixes and improvements to memory spaces, asynchrony, and kernel authoring notebooks#148
Merged
brycelelbach merged 48 commits intomainfrom Mar 9, 2026
Conversation
Contributor
❌ Commit Signature Check FailedFound 1 unsigned commit(s):
All commits must be signedHow to fix:
|
Collaborator
Author
|
/ok to test e9e650d |
Contributor
❌ Link Check FailedBroken links were detected in this PR. Please check the workflow run logs for details on which links are broken. Common fixes:
To test links locally:./brev/test-links.bash . |
1a8ecc0 to
bc7da88
Compare
Collaborator
Author
|
/ok to test eab9896 |
eab9896 to
7f7e7a4
Compare
… plot formatting, add title and dataset size display.
…ctness check to copy kernel launch function.
… dataset size in megabytes instead of bytes. Made-with: Cursor
…cell before profiling copy_blocked kernel. Add a verification cell that runs the script immediately after the %%writefile cell, matching the pattern used in the book histogram notebook. Made-with: Cursor
…age of cpyx.profiler.profile and NVTX.
…ion header, and text updates to solution notebooks. These changes were made to the exercise notebooks in 2f5c4fc but were not applied to the corresponding solution notebooks. Made-with: Cursor
…y kernel scripts to print problem size and dtype. Made-with: Cursor
…power iteration. The savetxt checkpoint I/O was removed from the 05 memory spaces notebooks in 2f5c4fc. This I/O is needed to set up the narrative for Notebook 06 (Asynchrony), whose baseline is the synchronous device-to-host copy + file write pattern introduced here. Made-with: Cursor
…ance check before cp.asarray(). The isinstance(A, np.ndarray) guard added in 2f5c4fc is unnecessary because cp.asarray() already handles both cases: it copies a host array to the GPU, and is a no-op when the array is already on the GPU. Teaching users to call cp.asarray() unconditionally is the intended lesson. Made-with: Cursor
…ercise and generate_device_exercise back to estimate_device and generate_device. The _exercise suffix was added in 2f5c4fc but breaks the naming symmetry with estimate_host and generate_host. The host/device naming convention is cleaner and mirrors the pattern used in the Notebook 06 (Asynchrony) notebooks. Made-with: Cursor
… to avoid unnecessary computation. Made-with: Cursor
…e the same input matrix for host and device. Different matrices converge at different rates, so it's only valid to benchmark on the same inputs. Use A_host for both host and device benchmarks instead of comparing A_host (CPU) against A_device (GPU-generated). Made-with: Cursor
…econds with mean ± relative stdev format. - 05 Memory Spaces: Use cupyx.profiler.benchmark with mean/stdev/runs format. - 06 Asynchrony: Use time.perf_counter for single-run timing in ms. - 40/41 Kernel Authoring: Convert benchmark output from seconds to ms. - Rename timing variable from D to T across all notebooks. Made-with: Cursor
… own cell, fix capitalization, restore eigvals timing. - Split expensive np.linalg.eigvals call into a dedicated cell timed with time.perf_counter. - Report eigvals timing alongside host/device benchmarks. - Capitalize print labels consistently (Power Iteration, Relative Error). - Use "Timing Host"/"Timing Device" instead of "Timing CPU"/"Timing GPU". Made-with: Cursor
…tions with accurate step ranges and per-step regions. Made-with: Cursor
…g from sections 3 and 4. Restore the original style from before 2f5c4fc: just call the functions, print the estimates, and show both matrices side by side. Benchmarking belongs in section 5 where cupyx.profiler.benchmark is used properly. Made-with: Cursor
…t cupy.linalg.eigvals not being implemented. Made-with: Cursor
…CPU wall-clock times for both host and device. The benchmarking cell was using .gpu_times[0] for the device benchmark but .cpu_times for the host benchmark, which is an apples-to-oranges comparison. The GPU time measures only device execution, excluding kernel launch overhead, synchronization, and other CPU-side costs. The CPU time (wall-clock) is the end-to-end time, which is the fair metric for both. This was introduced in 0c365ed. Made-with: Cursor
…ck" and normalize notebook formatting. Made-with: Cursor
…adata in histogram SOLUTION. Add missing required nbformat properties (name, metadata, execution_count) to cell outputs that were causing nbconvert validation failures in CI. Made-with: Cursor
…CPU benchmarking into separate cells.
…r iteration code as exercise starting point and remove commented-out code. Exercise cells now start from the prior iteration (e.g. NumPy code for CuPy porting exercises, NVTX-annotated baseline for async exercise) so students can focus on the meaningful changes. Verification, comparison, and benchmarking cells run directly instead of requiring students to uncomment code. Made-with: Cursor
…ynchronize() instead of cp.cuda.Device().synchronize(). Made-with: Cursor
… benchmarking from estimate_host call. Made-with: Cursor
…ercise notebook. Made-with: Cursor
…ype annotations and NumPy array assertions to estimate_host/estimate_device. Add `-> np.ndarray` return type annotations to `estimate_host` and `estimate_device` across the 05 and 06 notebooks (exercise + solution). Add assertions verifying the return value is a NumPy array at warmup call sites (06 notebooks) and after the initial call (05 notebooks). Remove redundant `synchronize()` calls after warmup since the assertion already forces synchronization. Fix `estimate_device` in the 05 solution to return `cp.asnumpy(result)` instead of `result.item()` to match the declared return type. Made-with: Cursor
…th blank lines between items. Made-with: Cursor
…ment Setup subsection, merge into parent Setup section. Made-with: Cursor
…() calls and instructions in favor of cp.asnumpy(). Since estimate_host/estimate_device now return np.ndarray, the .item() calls on their results are unnecessary. Remove all .item() calls from code and all instructions/comments that teach students to use .item() for host-device transfers. Students should learn only cp.asarray() and cp.asnumpy() for moving data between host and device. Made-with: Cursor
…ema validation. Apply automated notebook formatting from brev/test-notebook-format.py --fix after rebasing on main. Adds missing metadata fields (accelerator, colab, language_info, toc_visible), normalizes nbformat_minor, removes extra widget metadata, and fixes missing outputs/execution_count fields in 03 NumPy to CuPy SOLUTION notebook. Made-with: Cursor
1146f96 to
59890e1
Compare
…NumPy scalars from matmul inner products. The `@` operator on two 1-D arrays returns a bare `numpy.float64` scalar, not a 0-D `np.ndarray`. Broaden isinstance checks to accept `np.generic` (the base class for all NumPy scalar types) alongside `np.ndarray`. Made-with: Cursor
Made-with: Cursor
…ze from 2 GB to 100 MB. The CPU sequential_math benchmark with (1000, 500, 500) arrays (~2 GB) exceeded the 600s per-cell test timeout due to cupyx.profiler.benchmark running n_repeat=10 plus n_warmup=10 iterations of expensive CPU-side array operations. Reduce to (50, 500, 500) (~100 MB) to keep CPU benchmarks well within timeout while still demonstrating GPU speedup. Made-with: Cursor
…g container ports. Set HTTP_PORT=8081 on the ncu service so it listens on 8081 directly, making the port mapping 8081:8081 instead of 8081:8080. Re-export HTTP_PORT in entrypoint-nsight.bash and refactor the optional variable handling into a loop. Made-with: Cursor
Standardize all notebook titles to "## Topic - Focus" format and enforce consistent heading hierarchy (## title, ### sections, #### subsections). Solution notebooks use "- SOLUTION" suffix. Fixes across 37 notebooks. Made-with: Cursor
…ab setup cells. Move all non-%%writefile imports to setup cells. Add Google Colab install cells with sentinel file, progress prints, and quoted packages. Use module-qualified names (cupyx as cpx, cuda.core as cc, cv2, urllib.request). Remove unnecessary numba_config PYNVJITLINK and cuDF Colab installs. Replace manual timeit benchmarking with cpx.profiler.benchmark in CCCL notebook. Made-with: Cursor
Contributor
❌ Notebook Format Check FailedOne or more Jupyter notebooks have format or metadata issues. Please check the workflow run logs for details on which notebooks have issues. What is checked:
How to fix:# Check all tutorials
python3 brev/test-notebook-format.py
# Check a specific tutorial
python3 brev/test-notebook-format.py <tutorial-name>
# Auto-fix all issues
python3 brev/test-notebook-format.py --fix
# Auto-fix a specific tutorial
python3 brev/test-notebook-format.py <tutorial-name> --fix |
…tlib import. Use `cuda.core.experimental` instead of `cuda.core` for Device, Program, LaunchConfig, and launch APIs. Add missing `import matplotlib.pyplot as plt` in the NumPy-to-CuPy solution notebook's visualization cell. Made-with: Cursor
a854351 to
ceeda3b
Compare
…from current container build. Made-with: Cursor
Rewrite brev/test-notebook-format.py to check byte-for-byte canonical format (nbformat.write output: 1-space indent, sorted keys, cell IDs) instead of only checking metadata values. Fix --fix to actually resolve all issues including malformed stream outputs and missing cell IDs. Reformat all 84 non-canonical notebooks. Made-with: Cursor
Set up pre-commit with hooks for notebook format (pre-commit), Git LFS tracking (pre-commit), commit signatures (pre-push), and link checking (manual). Both notebook-format and git-lfs scripts now accept individual file paths so pre-commit only checks staged files, not the entire repo. Made-with: Cursor
Replace inconsistent "--- N. ALL CAPS ---" and "--- Step N: Title ---" comment styles with consistent "Step N.) Title" format. Made-with: Cursor
dev-start and dev-shell default to --mount (bind-mount local repo); dev-test defaults to --no-mount (image content). Centralize volume management in dev-common.bash:setup_docker_volume(). Forward extra args from dev-test through to each tutorial's test.bash; bare words like "03" become pytest -k filters for notebook tests. Made-with: Cursor
… cuda-cccl to 0.4.5. Notebook 07 passed CuPy arrays directly to cuda.core launch(), which only accepts raw device pointers; restore .data.ptr usage. Upgrade cuda-cccl from 0.4.3 to 0.4.5 to fix cuda.compute and cuda.coop import failures caused by incompatibility with cuda-python 13.1.1. Made-with: Cursor
…changes. system.driver_version and system.num_devices were replaced by system.get_driver_version() and system.get_num_devices() in cuda-core 0.3.x. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes and improvements across Accelerated Python tutorial notebooks, infrastructure, and dev tooling.
Memory Spaces (Power Iteration)
time.time()benchmarking from sections 3 and 4.eigvalsinto its own cell, fix capitalization, and restore eigvals timing.cupy.linalg.eigvalsnot being implemented.isinstancecheck beforecp.asarray().estimate_device_exercise/generate_device_exerciseback toestimate_device/generate_device.Asynchrony (Power Iteration)
Cross-Cutting (Memory Spaces + Asynchrony)
cupyx.profiler.profileand NVTX..item()calls and instructions in favor ofcp.asnumpy().estimate_host/estimate_device.Kernel Authoring (Copy)
copy_blockedkernel.Kernel Authoring (Book Histogram)
NumPy to CuPy
cuda.core (Devices, Streams, and Memory)
.data.ptr) tolaunch()instead of CuPy arrays.CCCL (Customizing Algorithms)
cp.cuda.get_current_stream().synchronize()instead ofcp.cuda.Device().synchronize().cuDF
Cross-Cutting (All Notebooks)
Infrastructure / Dev Tooling
cuda-ccclfrom 0.4.3 to 0.4.5 to fixcuda.compute/cuda.coopimport failures withcuda-python13.1.1.test_cuda_pythonforcuda.coreAPI changes (system.get_driver_version(),system.get_num_devices()).--mount/--no-mountflag and test arg forwarding to dev scripts.HTTP_PORTto set ncu listen port instead of remapping container ports..vscodeto.gitignore.