Skip to content

Tutorials/Accelerated Python: Fixes and improvements to memory spaces, asynchrony, and kernel authoring notebooks#148

Merged
brycelelbach merged 48 commits intomainfrom
fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks
Mar 9, 2026
Merged

Tutorials/Accelerated Python: Fixes and improvements to memory spaces, asynchrony, and kernel authoring notebooks#148
brycelelbach merged 48 commits intomainfrom
fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks

Conversation

@brycelelbach
Copy link
Collaborator

@brycelelbach brycelelbach commented Mar 3, 2026

Summary

Fixes and improvements across Accelerated Python tutorial notebooks, infrastructure, and dev tooling.

Memory Spaces (Power Iteration)

  • Fix benchmark to compare CPU wall-clock times for both host and device, and revert to using the same input matrix for both.
  • Remove inline time.time() benchmarking from sections 3 and 4.
  • Separate eigvals into its own cell, fix capitalization, and restore eigvals timing.
  • Remove outdated note about cupy.linalg.eigvals not being implemented.
  • Remove unnecessary isinstance check before cp.asarray().
  • Rename estimate_device_exercise / generate_device_exercise back to estimate_device / generate_device.
  • Re-add checkpoint I/O to power iteration.
  • Fix assertions to accept NumPy scalars from matmul inner products.

Asynchrony (Power Iteration)

  • Fix compute step NVTX annotations with accurate step ranges and per-step regions.
  • Limit warmup calls to 1 step to avoid unnecessary computation.
  • Resolve merge conflict in exercise notebook.
  • Remove redundant 1.1 Environment Setup subsection, merge into parent Setup section.
  • Fix markdown bullet lists with blank lines between items.

Cross-Cutting (Memory Spaces + Asynchrony)

  • Unify benchmark reporting to use milliseconds with mean ± relative stdev format.
  • Fix typos in example usage of cupyx.profiler.profile and NVTX.
  • Remove .item() calls and instructions in favor of cp.asnumpy().
  • Add return type annotations and NumPy array assertions to estimate_host/estimate_device.
  • Provide prior iteration code as exercise starting point and remove commented-out code.

Kernel Authoring (Copy)

  • Rename "output" CLI arg to "check" and normalize notebook formatting.
  • Add output mode to copy kernel scripts to print problem size and dtype.
  • Add configurable correctness check to the copy kernel launch function.
  • Add correctness check cell before profiling copy_blocked kernel.

Kernel Authoring (Book Histogram)

  • Improve plot formatting with titles and dataset size display.
  • Display dataset size in megabytes instead of bytes.
  • Fix invalid notebook output metadata in histogram SOLUTION.

NumPy to CuPy

  • Fix benchmark to compare CPU wall-clock times for both host and device.
  • Split initialization and CPU benchmarking into separate cells.
  • Reduce benchmark array size from 2 GB to 100 MB.
  • Unify step comment style in notebook 03.

cuda.core (Devices, Streams, and Memory)

  • Fix cuda.core import and missing matplotlib import.
  • Fix notebook 07 SOLUTION to pass raw device pointers (.data.ptr) to launch() instead of CuPy arrays.

CCCL (Customizing Algorithms)

  • Use cp.cuda.get_current_stream().synchronize() instead of cp.cuda.Device().synchronize().

cuDF

  • Fix kernelspec to use RAPIDS kernel.

Cross-Cutting (All Notebooks)

  • Consolidate imports and standardize Colab setup cells.
  • Unify notebook titles and heading levels.
  • Apply title, TOC, section header, and text updates to solution notebooks.
  • Normalize notebook metadata and fix schema validation.

Infrastructure / Dev Tooling

  • Add pre-commit hooks for lint checks (notebook canonical format, LFS binary tracking, commit signatures).
  • Enforce canonical 1-space indented notebook format to prevent editor reformatting diffs.
  • Pin pip dependencies to exact versions from current container build.
  • Upgrade cuda-cccl from 0.4.3 to 0.4.5 to fix cuda.compute/cuda.coop import failures with cuda-python 13.1.1.
  • Fix test_cuda_python for cuda.core API changes (system.get_driver_version(), system.get_num_devices()).
  • Add --mount/--no-mount flag and test arg forwarding to dev scripts.
  • Use HTTP_PORT to set ncu listen port instead of remapping container ports.
  • Silence sysctl profiling permission commands in entrypoint.
  • Add .vscode to .gitignore.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

❌ Commit Signature Check Failed

Found 1 unsigned commit(s):

🔗 View workflow run logs

  • f4add4b: Tutorials/Accelerated Python/Kernel Authoring: Add configurable correctness check to copy kernel launch function. (unsigned)

All commits must be signed

How to fix:

  1. Configure commit signing (if not already done):

    # For GPG signing
    git config --global commit.gpgsign true
    
    # Or for SSH signing (Git 2.34+)
    git config --global gpg.format ssh
    git config --global user.signingkey ~/.ssh/id_ed25519.pub
  2. Re-sign your commits:

    git rebase -i origin/main --exec "git commit --amend --no-edit -S"
    git push --force-with-lease

📚 GitHub documentation on signing commits

@brycelelbach
Copy link
Collaborator Author

/ok to test e9e650d

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

❌ Link Check Failed

Broken links were detected in this PR.

Please check the workflow run logs for details on which links are broken.

Common fixes:

  1. Typo in URL - Check for spelling mistakes in the link
  2. Outdated link - The page may have moved or been deleted
  3. Relative path issue - Ensure relative links use the correct path
  4. External site down - If the external site is temporarily down, you can add it to brev/.lycheeignore

To test links locally:

./brev/test-links.bash .

📚 Lychee documentation

@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from 1a8ecc0 to bc7da88 Compare March 3, 2026 20:48
@brycelelbach
Copy link
Collaborator Author

/ok to test eab9896

@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from eab9896 to 7f7e7a4 Compare March 3, 2026 22:48
… plot formatting, add title and dataset size display.
…ctness check to copy kernel launch function.
… dataset size in megabytes instead of bytes.

Made-with: Cursor
…cell before profiling copy_blocked kernel.

Add a verification cell that runs the script immediately after the %%writefile cell, matching the pattern used in the book histogram notebook.

Made-with: Cursor
…ion header, and text updates to solution notebooks.

These changes were made to the exercise notebooks in 2f5c4fc but
were not applied to the corresponding solution notebooks.

Made-with: Cursor
…y kernel scripts to print problem size and dtype.

Made-with: Cursor
…power iteration.

The savetxt checkpoint I/O was removed from the 05 memory spaces
notebooks in 2f5c4fc. This I/O is needed to set up the narrative for
Notebook 06 (Asynchrony), whose baseline is the synchronous
device-to-host copy + file write pattern introduced here.

Made-with: Cursor
…ance check before cp.asarray().

The isinstance(A, np.ndarray) guard added in 2f5c4fc is unnecessary
because cp.asarray() already handles both cases: it copies a host
array to the GPU, and is a no-op when the array is already on the
GPU. Teaching users to call cp.asarray() unconditionally is the
intended lesson.

Made-with: Cursor
…ercise and generate_device_exercise back to estimate_device and generate_device.

The _exercise suffix was added in 2f5c4fc but breaks the naming
symmetry with estimate_host and generate_host. The host/device naming
convention is cleaner and mirrors the pattern used in the Notebook 06
(Asynchrony) notebooks.

Made-with: Cursor
… to avoid unnecessary computation.

Made-with: Cursor
…e the same input matrix for host and device.

Different matrices converge at different rates, so it's only valid to
benchmark on the same inputs. Use A_host for both host and device benchmarks
instead of comparing A_host (CPU) against A_device (GPU-generated).

Made-with: Cursor
…econds with mean ± relative stdev format.

- 05 Memory Spaces: Use cupyx.profiler.benchmark with mean/stdev/runs format.
- 06 Asynchrony: Use time.perf_counter for single-run timing in ms.
- 40/41 Kernel Authoring: Convert benchmark output from seconds to ms.
- Rename timing variable from D to T across all notebooks.

Made-with: Cursor
… own cell, fix capitalization, restore eigvals timing.

- Split expensive np.linalg.eigvals call into a dedicated cell timed with
  time.perf_counter.
- Report eigvals timing alongside host/device benchmarks.
- Capitalize print labels consistently (Power Iteration, Relative Error).
- Use "Timing Host"/"Timing Device" instead of "Timing CPU"/"Timing GPU".

Made-with: Cursor
…tions with accurate step ranges and per-step regions.

Made-with: Cursor
…g from sections 3 and 4.

Restore the original style from before 2f5c4fc: just call the functions, print
the estimates, and show both matrices side by side. Benchmarking belongs in
section 5 where cupyx.profiler.benchmark is used properly.

Made-with: Cursor
…t cupy.linalg.eigvals not being implemented.

Made-with: Cursor
…CPU wall-clock times for both host and device.

The benchmarking cell was using .gpu_times[0] for the device benchmark
but .cpu_times for the host benchmark, which is an apples-to-oranges
comparison. The GPU time measures only device execution, excluding kernel
launch overhead, synchronization, and other CPU-side costs. The CPU time
(wall-clock) is the end-to-end time, which is the fair metric for both.
This was introduced in 0c365ed.

Made-with: Cursor
…ck" and normalize notebook formatting.

Made-with: Cursor
…adata in histogram SOLUTION.

Add missing required nbformat properties (name, metadata, execution_count)
to cell outputs that were causing nbconvert validation failures in CI.

Made-with: Cursor
…r iteration code as exercise starting point and remove commented-out code.

Exercise cells now start from the prior iteration (e.g. NumPy code for CuPy
porting exercises, NVTX-annotated baseline for async exercise) so students
can focus on the meaningful changes. Verification, comparison, and
benchmarking cells run directly instead of requiring students to uncomment
code.

Made-with: Cursor
…ynchronize() instead of cp.cuda.Device().synchronize().

Made-with: Cursor
… benchmarking from estimate_host call.

Made-with: Cursor
…ype annotations and NumPy array assertions to estimate_host/estimate_device.

Add `-> np.ndarray` return type annotations to `estimate_host` and
`estimate_device` across the 05 and 06 notebooks (exercise + solution).
Add assertions verifying the return value is a NumPy array at warmup
call sites (06 notebooks) and after the initial call (05 notebooks).
Remove redundant `synchronize()` calls after warmup since the assertion
already forces synchronization. Fix `estimate_device` in the 05 solution
to return `cp.asnumpy(result)` instead of `result.item()` to match the
declared return type.

Made-with: Cursor
…th blank lines between items.

Made-with: Cursor
…ment Setup subsection, merge into parent Setup section.

Made-with: Cursor
…() calls and instructions in favor of cp.asnumpy().

Since estimate_host/estimate_device now return np.ndarray, the .item()
calls on their results are unnecessary. Remove all .item() calls from
code and all instructions/comments that teach students to use .item()
for host-device transfers. Students should learn only cp.asarray() and
cp.asnumpy() for moving data between host and device.

Made-with: Cursor
…ema validation.

Apply automated notebook formatting from brev/test-notebook-format.py --fix
after rebasing on main. Adds missing metadata fields (accelerator, colab,
language_info, toc_visible), normalizes nbformat_minor, removes extra
widget metadata, and fixes missing outputs/execution_count fields in
03 NumPy to CuPy SOLUTION notebook.

Made-with: Cursor
@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from 1146f96 to 59890e1 Compare March 6, 2026 18:23
…NumPy scalars from matmul inner products.

The `@` operator on two 1-D arrays returns a bare `numpy.float64` scalar,
not a 0-D `np.ndarray`. Broaden isinstance checks to accept `np.generic`
(the base class for all NumPy scalar types) alongside `np.ndarray`.

Made-with: Cursor
…ze from 2 GB to 100 MB.

The CPU sequential_math benchmark with (1000, 500, 500) arrays (~2 GB)
exceeded the 600s per-cell test timeout due to cupyx.profiler.benchmark
running n_repeat=10 plus n_warmup=10 iterations of expensive CPU-side
array operations. Reduce to (50, 500, 500) (~100 MB) to keep CPU
benchmarks well within timeout while still demonstrating GPU speedup.

Made-with: Cursor
…g container ports.

Set HTTP_PORT=8081 on the ncu service so it listens on 8081 directly,
making the port mapping 8081:8081 instead of 8081:8080. Re-export
HTTP_PORT in entrypoint-nsight.bash and refactor the optional variable
handling into a loop.

Made-with: Cursor
Standardize all notebook titles to "## Topic - Focus" format and enforce
consistent heading hierarchy (## title, ### sections, #### subsections).
Solution notebooks use "- SOLUTION" suffix. Fixes across 37 notebooks.

Made-with: Cursor
…ab setup cells.

Move all non-%%writefile imports to setup cells. Add Google Colab install
cells with sentinel file, progress prints, and quoted packages. Use
module-qualified names (cupyx as cpx, cuda.core as cc, cv2, urllib.request).
Remove unnecessary numba_config PYNVJITLINK and cuDF Colab installs.
Replace manual timeit benchmarking with cpx.profiler.benchmark in CCCL
notebook.

Made-with: Cursor
@github-actions
Copy link
Contributor

github-actions bot commented Mar 7, 2026

❌ Notebook Format Check Failed

One or more Jupyter notebooks have format or metadata issues.

Please check the workflow run logs for details on which notebooks have issues.

What is checked:

  1. Schema integrity — notebooks must be valid according to the Jupyter notebook JSON schema
  2. Metadata conformance — notebooks must have the standard metadata block (accelerator, colab, kernelspec, language_info)
  3. Clean outputs — non-SOLUTION notebooks must have outputs, execution counts, and execution timing metadata cleared

How to fix:

# Check all tutorials
python3 brev/test-notebook-format.py

# Check a specific tutorial
python3 brev/test-notebook-format.py <tutorial-name>

# Auto-fix all issues
python3 brev/test-notebook-format.py --fix

# Auto-fix a specific tutorial
python3 brev/test-notebook-format.py <tutorial-name> --fix

…tlib import.

Use `cuda.core.experimental` instead of `cuda.core` for Device, Program,
LaunchConfig, and launch APIs. Add missing `import matplotlib.pyplot as plt`
in the NumPy-to-CuPy solution notebook's visualization cell.

Made-with: Cursor
@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from a854351 to ceeda3b Compare March 7, 2026 19:43
…from current container build.

Made-with: Cursor
Rewrite brev/test-notebook-format.py to check byte-for-byte canonical
format (nbformat.write output: 1-space indent, sorted keys, cell IDs)
instead of only checking metadata values. Fix --fix to actually resolve
all issues including malformed stream outputs and missing cell IDs.
Reformat all 84 non-canonical notebooks.

Made-with: Cursor
Set up pre-commit with hooks for notebook format (pre-commit), Git LFS
tracking (pre-commit), commit signatures (pre-push), and link checking
(manual). Both notebook-format and git-lfs scripts now accept individual
file paths so pre-commit only checks staged files, not the entire repo.

Made-with: Cursor
Replace inconsistent "--- N. ALL CAPS ---" and "--- Step N: Title ---"
comment styles with consistent "Step N.) Title" format.

Made-with: Cursor
dev-start and dev-shell default to --mount (bind-mount local repo);
dev-test defaults to --no-mount (image content). Centralize volume
management in dev-common.bash:setup_docker_volume(). Forward extra
args from dev-test through to each tutorial's test.bash; bare words
like "03" become pytest -k filters for notebook tests.

Made-with: Cursor
… cuda-cccl to 0.4.5.

Notebook 07 passed CuPy arrays directly to cuda.core launch(), which
only accepts raw device pointers; restore .data.ptr usage. Upgrade
cuda-cccl from 0.4.3 to 0.4.5 to fix cuda.compute and cuda.coop
import failures caused by incompatibility with cuda-python 13.1.1.

Made-with: Cursor
…changes.

system.driver_version and system.num_devices were replaced by
system.get_driver_version() and system.get_num_devices() in
cuda-core 0.3.x.

Made-with: Cursor
@brycelelbach brycelelbach merged commit b0e5e03 into main Mar 9, 2026
31 checks passed
@brycelelbach brycelelbach deleted the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch March 9, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant