|
60 | 60 | "\n", |
61 | 61 | "Let's compare the performance of creating a large 3D array (approx. 100 MB in size) on the CPU versus the GPU.\n", |
62 | 62 | "\n", |
63 | | - "We will use `np.ones` for the CPU and `cp.ones` for the GPU.\n" |
| 63 | + "We will use `np.ones()` for the CPU and `cp.ones()` for the GPU.\n" |
64 | 64 | ] |
65 | 65 | }, |
66 | 66 | { |
|
92 | 92 | "source": [ |
93 | 93 | "We can see here that creating this array on the GPU is much faster than doing so on the CPU!\n", |
94 | 94 | "\n", |
95 | | - "**About `cupyx.profiler.benchmark`:**\n", |
| 95 | + "**About `cupyx.profiler.benchmark()`:**\n", |
96 | 96 | "\n", |
97 | | - "We use CuPy's built-in `benchmark` utility for timing GPU operations. This is important because GPU operations are **asynchronous** - when you call a CuPy function, the CPU places a task in the GPU's \"to-do list\" (stream) and immediately moves on without waiting.\n", |
| 97 | + "We use CuPy's built-in `benchmark()` utility for timing GPU operations. This is important because GPU operations are **asynchronous** - when you call a CuPy function, the CPU places a task in the GPU's \"to-do list\" (stream) and immediately moves on without waiting.\n", |
98 | 98 | "\n", |
99 | | - "The `benchmark` function handles all the complexity of proper GPU timing for us:\n", |
| 99 | + "The `benchmark()` function handles all the complexity of proper GPU timing for us:\n", |
100 | 100 | "- It automatically synchronizes GPU streams to get accurate measurements.\n", |
101 | 101 | "- It runs warm-up iterations to avoid cold-start overhead.\n", |
102 | 102 | "- It reports both CPU wall-clock times (`cpu_times`) and GPU kernel times (`gpu_times`). We use `cpu_times` for all comparisons because it measures end-to-end wall-clock time, giving a fair apples-to-apples comparison between CPU and GPU code.\n", |
|
286 | 286 | "\n", |
287 | 287 | "A key feature of CuPy is that many **NumPy functions work on CuPy arrays without changing your code**.\n", |
288 | 288 | "\n", |
289 | | - "When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd`), NumPy detects the CuPy input and **delegates the operation to CuPy’s own implementation**, which runs on the GPU.\n", |
| 289 | + "When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd()`), NumPy detects the CuPy input and **delegates the operation to CuPy’s own implementation**, which runs on the GPU.\n", |
290 | 290 | "\n", |
291 | 291 | "This allows you to write code using standard `np.*` syntax and have it run on either CPU or GPU seamlessly - **as long as CuPy implements an override for that function.**\n", |
292 | 292 | "\n", |
293 | 293 | "One common source of hidden performance penalties is **implicit transfers between CPU and GPU**. In some cases, CuPy guards against this: for example, when NumPy tries to convert a `cupy.ndarray` into a `numpy.ndarray` via the `__array__` protocol (e.g. `np.asarray(gpu_array)`), CuPy raises a `TypeError` instead of silently copying data to the host. \n", |
294 | 294 | "\n", |
295 | | - "However, CuPy **does** perform implicit GPU → CPU transfers in other cases, such as printing a GPU array, converting to a Python scalar (e.g. `float`, `.item`), or evaluating a GPU scalar in a boolean context. We will explore these implicit transfers in a later notebook." |
| 295 | + "However, CuPy **does** perform implicit GPU → CPU transfers in other cases, such as printing a GPU array, converting to a Python scalar (e.g. `float`, `.item()`), or evaluating a GPU scalar in a boolean context. We will explore these implicit transfers in a later notebook." |
296 | 296 | ] |
297 | 297 | }, |
298 | 298 | { |
|
369 | 369 | "2. Change the setup line to `xp = cp` (GPU Mode). Run it again.\n", |
370 | 370 | "3. Observe how the exact same logic runs significantly faster on the GPU with CuPy while retaining the implementation properties of NumPy.\n", |
371 | 371 | "\n", |
372 | | - "Note: We use `cupyx.profiler.benchmark` for timing, which automatically handles GPU synchronization." |
| 372 | + "Note: We use `cupyx.profiler.benchmark()` for timing, which automatically handles GPU synchronization." |
373 | 373 | ] |
374 | 374 | }, |
375 | 375 | { |
|
421 | 421 | "id": "077b7589", |
422 | 422 | "metadata": {}, |
423 | 423 | "source": [ |
424 | | - "**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose` to `np.testing.assert_allclose`. What happens and why?**" |
| 424 | + "**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose()` to `np.testing.assert_allclose()`. What happens and why?**" |
425 | 425 | ] |
426 | 426 | }, |
427 | 427 | { |
|
436 | 436 | "\n", |
437 | 437 | "**TODO:** \n", |
438 | 438 | "1) **Generate Data:** Create a NumPy array (`y_cpu`) and a CuPy array (`y_gpu`) representing $\\sin(x)$ from $0$ to $2\\pi$ with `50,000,000` points.\n", |
439 | | - "2) **Benchmark CPU and GPU:** Use `benchmark` from `cupyx.profiler` to measure both `np.sort` and `cp.sort`." |
| 439 | + "2) **Benchmark CPU and GPU:** Use `benchmark()` from `cupyx.profiler` to measure both `np.sort()` and `cp.sort()`." |
440 | 440 | ] |
441 | 441 | }, |
442 | 442 | { |
|
462 | 462 | "# Step 2.) Benchmark NumPy (CPU)\n", |
463 | 463 | "print(\"Benchmarking NumPy Sort (this may take a few seconds)...\")\n", |
464 | 464 | "# TODO: Use cpx.profiler.benchmark(function, (args,), n_repeat=5, n_warmup=1)\n", |
465 | | - "# Hint: Pass the function `np.sort` and the argument `(y_cpu,)`\n", |
| 465 | + "# Hint: Pass the function `np.sort()` and the argument `(y_cpu,)`\n", |
466 | 466 | "# Note: The comma in (y_cpu,) is required to make it a tuple!\n", |
467 | 467 | "\n", |
468 | 468 | "\n", |
469 | 469 | "# Step 3.) Benchmark CuPy (GPU)\n", |
470 | 470 | "print(\"Benchmarking CuPy Sort...\")\n", |
471 | 471 | "# TODO: Use cpx.profiler.benchmark(function, (args,), n_repeat=5, n_warmup=1)\n", |
472 | | - "# Hint: Pass the function `cp.sort` and the argument `(y_gpu,)`\n", |
| 472 | + "# Hint: Pass the function `cp.sort()` and the argument `(y_gpu,)`\n", |
473 | 473 | "# Note: The comma in (y_gpu,) is required to make it a tuple!" |
474 | 474 | ] |
475 | 475 | }, |
|
480 | 480 | "id": "qnAvEk5QFAA8" |
481 | 481 | }, |
482 | 482 | "source": [ |
483 | | - "**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark`'s return value and customize how the output is displayed. You could even make a graph.**" |
| 483 | + "**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark()`'s return value and customize how the output is displayed. You could even make a graph.**" |
484 | 484 | ] |
485 | 485 | }, |
486 | 486 | { |
|
0 commit comments