NVIDIA
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/01__numpy_intro__ndarray_basics.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/01__numpy_intro__ndarray_basics.ipynb‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/03__numpy_to_cupy__ndarray_basics.ipynb‎
Lines changed: 12 additions & 12 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/03__numpy_to_cupy__ndarray_basics.ipynb‎
Lines changed: 12 additions & 12 deletions
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/04__numpy_to_cupy__svd_reconstruction.ipynb‎
Lines changed: 5 additions & 5 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/04__numpy_to_cupy__svd_reconstruction.ipynb‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/05__memory_spaces__power_iteration.ipynb‎
Lines changed: 9 additions & 9 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/05__memory_spaces__power_iteration.ipynb‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/06__asynchrony__power_iteration.ipynb‎
Lines changed: 7 additions & 7 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/06__asynchrony__power_iteration.ipynb‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎tutorials/accelerated-python/notebooks/fundamentals/07__cuda_core__devices_streams_and_memory.ipynb‎
Lines changed: 2 additions & 2 deletions b/‎tutorials/accelerated-python/notebooks/fundamentals/07__cuda_core__devices_streams_and_memory.ipynb‎
Lines changed: 2 additions & 2 deletions
@@ -166,7 +166,7 @@
     "\n",
     "Most operations, like adding two arrays together, returns a **Copy**, which requires allocating a new array, which can negatively impact performance.\n",
     "\n",
-    "Some operations, like transposing or `reshape` often return a **View** instead of a **Copy**. A View only changes the metadata (`shape` and `strides`) without duplicating the physical data, making these operations nearly instantaneous.\n",
+    "Some operations, like transposing or `reshape()` often return a **View** instead of a **Copy**. A View only changes the metadata (`shape` and `strides`) without duplicating the physical data, making these operations nearly instantaneous.\n",
     "\n",
     "Most Copy operations take an `out` parameter that takes an array; if it provided, the result is written to that array instead of allocating a new one. For example, `A + B` or `np.add(A, B)` will return a new array with the result, but `np.add(A + B, out=A)` will place the result in `A` without an allocation.\n",
     "\n",
@@ -175,7 +175,7 @@
     "**Quick Docs**\n",
     "- `np.linspace(start, stop, num)`: Returns `num` evenly spaced samples, calculated over the interval $[\\text{start}, \\text{stop}]$.\n",
     "- `np.random.default_rng().random(size)`: Returns random floats in $[0.0, 1.0)$. `size` can be a tuple.\n",
-    "- `arr.sort`: Sorts an array in-place (modifies the original data). Use `np.sort(arr)` to return a sorted copy.\n",
+    "- `arr.sort()`: Sorts an array in-place (modifies the original data). Use `np.sort(arr)` to return a sorted copy.\n",
     "- `arr.reshape(new_shape)`: Returns a View with a new shape. One dimension can be -1, instructing NumPy to calculate the size automatically.\n",
     "- `np.resize(arr, new_shape)`: Returns a new array with the specified shape. If the new shape is larger, it fills the new elements by repeating the original array.\n"
    ]
 
@@ -60,7 +60,7 @@
     "\n",
     "Let's compare the performance of creating a large 3D array (approx. 100 MB in size) on the CPU versus the GPU.\n",
     "\n",
-    "We will use `np.ones` for the CPU and `cp.ones` for the GPU.\n"
+    "We will use `np.ones()` for the CPU and `cp.ones()` for the GPU.\n"
    ]
   },
   {
@@ -92,11 +92,11 @@
    "source": [
     "We can see here that creating this array on the GPU is much faster than doing so on the CPU!\n",
     "\n",
-    "**About `cupyx.profiler.benchmark`:**\n",
+    "**About `cupyx.profiler.benchmark()`:**\n",
     "\n",
-    "We use CuPy's built-in `benchmark` utility for timing GPU operations. This is important because GPU operations are **asynchronous** - when you call a CuPy function, the CPU places a task in the GPU's \"to-do list\" (stream) and immediately moves on without waiting.\n",
+    "We use CuPy's built-in `benchmark()` utility for timing GPU operations. This is important because GPU operations are **asynchronous** - when you call a CuPy function, the CPU places a task in the GPU's \"to-do list\" (stream) and immediately moves on without waiting.\n",
     "\n",
-    "The `benchmark` function handles all the complexity of proper GPU timing for us:\n",
+    "The `benchmark()` function handles all the complexity of proper GPU timing for us:\n",
     "- It automatically synchronizes GPU streams to get accurate measurements.\n",
     "- It runs warm-up iterations to avoid cold-start overhead.\n",
     "- It reports both CPU wall-clock times (`cpu_times`) and GPU kernel times (`gpu_times`). We use `cpu_times` for all comparisons because it measures end-to-end wall-clock time, giving a fair apples-to-apples comparison between CPU and GPU code.\n",
@@ -286,13 +286,13 @@
     "\n",
     "A key feature of CuPy is that many **NumPy functions work on CuPy arrays without changing your code**.\n",
     "\n",
-    "When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd`), NumPy detects the CuPy input and **delegates the operation to CuPy’s own implementation**, which runs on the GPU.\n",
+    "When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd()`), NumPy detects the CuPy input and **delegates the operation to CuPy’s own implementation**, which runs on the GPU.\n",
     "\n",
     "This allows you to write code using standard `np.*` syntax and have it run on either CPU or GPU seamlessly - **as long as CuPy implements an override for that function.**\n",
     "\n",
     "One common source of hidden performance penalties is **implicit transfers between CPU and GPU**. In some cases, CuPy guards against this: for example, when NumPy tries to convert a `cupy.ndarray` into a `numpy.ndarray` via the `__array__` protocol (e.g. `np.asarray(gpu_array)`), CuPy raises a `TypeError` instead of silently copying data to the host. \n",
     "\n",
-    "However, CuPy **does** perform implicit GPU → CPU transfers in other cases, such as printing a GPU array, converting to a Python scalar (e.g. `float`, `.item`), or evaluating a GPU scalar in a boolean context. We will explore these implicit transfers in a later notebook."
+    "However, CuPy **does** perform implicit GPU → CPU transfers in other cases, such as printing a GPU array, converting to a Python scalar (e.g. `float`, `.item()`), or evaluating a GPU scalar in a boolean context. We will explore these implicit transfers in a later notebook."
    ]
   },
   {
@@ -369,7 +369,7 @@
     "2. Change the setup line to `xp = cp` (GPU Mode). Run it again.\n",
     "3. Observe how the exact same logic runs significantly faster on the GPU with CuPy while retaining the implementation properties of NumPy.\n",
     "\n",
-    "Note: We use `cupyx.profiler.benchmark` for timing, which automatically handles GPU synchronization."
+    "Note: We use `cupyx.profiler.benchmark()` for timing, which automatically handles GPU synchronization."
    ]
   },
   {
@@ -421,7 +421,7 @@
    "id": "077b7589",
    "metadata": {},
    "source": [
-    "**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose` to `np.testing.assert_allclose`. What happens and why?**"
+    "**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose()` to `np.testing.assert_allclose()`. What happens and why?**"
    ]
   },
   {
@@ -436,7 +436,7 @@
     "\n",
     "**TODO:** \n",
     "1) **Generate Data:** Create a NumPy array (`y_cpu`) and a CuPy array (`y_gpu`) representing $\\sin(x)$ from $0$ to $2\\pi$ with `50,000,000` points.\n",
-    "2) **Benchmark CPU and GPU:** Use `benchmark` from `cupyx.profiler` to measure both `np.sort` and `cp.sort`."
+    "2) **Benchmark CPU and GPU:** Use `benchmark()` from `cupyx.profiler` to measure both `np.sort()` and `cp.sort()`."
    ]
   },
   {
@@ -462,14 +462,14 @@
     "# Step 2.) Benchmark NumPy (CPU)\n",
     "print(\"Benchmarking NumPy Sort (this may take a few seconds)...\")\n",
     "# TODO: Use cpx.profiler.benchmark(function, (args,), n_repeat=5, n_warmup=1)\n",
-    "# Hint: Pass the function `np.sort` and the argument `(y_cpu,)`\n",
+    "# Hint: Pass the function `np.sort()` and the argument `(y_cpu,)`\n",
     "# Note: The comma in (y_cpu,) is required to make it a tuple!\n",
     "\n",
     "\n",
     "# Step 3.) Benchmark CuPy (GPU)\n",
     "print(\"Benchmarking CuPy Sort...\")\n",
     "# TODO: Use cpx.profiler.benchmark(function, (args,), n_repeat=5, n_warmup=1)\n",
-    "# Hint: Pass the function `cp.sort` and the argument `(y_gpu,)`\n",
+    "# Hint: Pass the function `cp.sort()` and the argument `(y_gpu,)`\n",
     "# Note: The comma in (y_gpu,) is required to make it a tuple!"
    ]
   },
@@ -480,7 +480,7 @@
     "id": "qnAvEk5QFAA8"
    },
    "source": [
-    "**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark`'s return value and customize how the output is displayed. You could even make a graph.**"
+    "**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark()`'s return value and customize how the output is displayed. You could even make a graph.**"
    ]
   },
   {
 
@@ -14,8 +14,8 @@
     "**TODO: Port this code to CuPy. Here's what you'll have to do:**\n",
     "\n",
     "- **Change `import numpy as xp` to `import cupy as xp`.**\n",
-    "- **NumPy arrays are converted to CuPy arrays using `xp.asarray`. You'll see errors like `only supports cupy.ndarray` if you forget to do this.**\n",
-    "- **CuPy arrays are converted back to NumPy arrays (for Matplotlib) using `xp.asnumpy`.**\n",
+    "- **NumPy arrays are converted to CuPy arrays using `xp.asarray()`. You'll see errors like `only supports cupy.ndarray` if you forget to do this.**\n",
+    "- **CuPy arrays are converted back to NumPy arrays (for Matplotlib) using `xp.asnumpy()`.**\n",
     "\n",
     "First, we need to import our modules:"
    ]
@@ -323,7 +323,7 @@
     "\n",
     "Imagine you're measuring how long it takes to ship a package to someone, but you only time how long it takes for you to drop it off at the post office, not how long it takes for them to receive it and send you a thank you.\n",
     "\n",
-    "Common Pythonic benchmarking tools like `%timeit` are not GPU aware, so it's easy to measure incorrectly with them.  We can only use them when we know the code we're benchmarking will perform the proper synchronization.  It's better to use something like [`cupyx.profiler.benchmark`](https://docs.cupy.dev/en/stable/reference/generated/cupyx.profiler.benchmark.html#cupyx.profiler.benchmark).\n",
+    "Common Pythonic benchmarking tools like `%timeit` are not GPU aware, so it's easy to measure incorrectly with them.  We can only use them when we know the code we're benchmarking will perform the proper synchronization.  It's better to use something like [`cupyx.profiler.benchmark()`](https://docs.cupy.dev/en/stable/reference/generated/cupyx.profiler.benchmark.html#cupyx.profiler.benchmark).\n",
     "\n",
     "First, we need a NumPy (CPU) and CuPy (GPU) copy of our image:"
    ]
@@ -380,7 +380,7 @@
     "id": "TE6qPht1xAkm"
    },
    "source": [
-    "Depending on your hardware, the CPU and GPU might be close to the same speed, or the GPU might even be slower!  This is because the image is not big enough to fully utilize the GPU.  We can simulate a larger image by tiling the image using `np.tile`.  This duplicates the image both along axis 0 and axis 1:"
+    "Depending on your hardware, the CPU and GPU might be close to the same speed, or the GPU might even be slower!  This is because the image is not big enough to fully utilize the GPU.  We can simulate a larger image by tiling the image using `np.tile()`.  This duplicates the image both along axis 0 and axis 1:"
    ]
   },
   {
@@ -435,7 +435,7 @@
     "id": "5nlgOqkBxAkw"
    },
    "source": [
-    "**TODO: Experiment with different sizes of image by changing the `np.tile` arguments.  When is the GPU faster?**"
+    "**TODO: Experiment with different sizes of image by changing the `np.tile()` arguments.  When is the GPU faster?**"
    ]
   }
  ],
 
@@ -37,7 +37,7 @@
     "\n",
     "CuPy silently transfers and synchronizes when you:\n",
     "1.  **Print** a GPU array (`print(gpu_array)`).\n",
-    "2.  **Convert** to a Python scalar (`float(gpu_array)` or `.item`).\n",
+    "2.  **Convert** to a Python scalar (`float(gpu_array)` or `.item()`).\n",
     "3.  **Evaluate** a GPU scalar in a boolean context (`if gpu_scalar > 0:`).\n",
     "\n",
     "#### The Task\n",
@@ -221,9 +221,9 @@
     "Now it's your turn! Your task is to convert the `estimate_host` function to run on the GPU using CuPy.\n",
     "\n",
     "**Remember the rules of Memory Spaces:**\n",
-    "1.  **Transfer:** Move `A_host` from CPU to GPU using `cp.asarray`.\n",
+    "1.  **Transfer:** Move `A_host` from CPU to GPU using `cp.asarray()`.\n",
     "2.  **Compute:** Perform math using `cp` functions on the GPU.\n",
-    "3.  **Retrieve:** Move result back to CPU using `cp.asnumpy`.\n",
+    "3.  **Retrieve:** Move result back to CPU using `cp.asnumpy()`.\n",
     "\n",
     "**Hint:** CuPy tries to replicate the NumPy API. In many cases, you can simply change `np.` to `cp.`. However, CuPy operations *must* run on data present in Device Memory.\n",
     "\n",
@@ -303,10 +303,10 @@
     "Your task is to convert the `generate_host` function to generate the matrix directly on the GPU using CuPy's random functions.\n",
     "\n",
     "**Hints:**\n",
-    "- Use `cp.random.seed` instead of `np.random.seed`\n",
-    "- Use `cp.random.random` instead of `np.random.random`\n",
-    "- Use `cp.random.permutation` instead of `np.random.permutation`\n",
-    "- Use `cp.concatenate`, `cp.array`, `cp.diag`, and `cp.linalg.inv`\n",
+    "- Use `cp.random.seed()` instead of `np.random.seed()`\n",
+    "- Use `cp.random.random()` instead of `np.random.random()`\n",
+    "- Use `cp.random.permutation()` instead of `np.random.permutation()`\n",
+    "- Use `cp.concatenate()`, `cp.array()`, `cp.diag()`, and `cp.linalg.inv()`\n",
     "\n",
     "**The code below starts as a copy of the CPU implementation. Modify it to generate data directly on the GPU:**\n"
    ]
@@ -392,7 +392,7 @@
    "source": [
     "### 5. Verification and Benchmarking\n",
     "\n",
-    "Finally, let's verify our accuracy against a reference implementation (`numpy.linalg.eigvals`) and benchmark the speedup.\n"
+    "Finally, let's verify our accuracy against a reference implementation (`numpy.linalg.eigvals()`) and benchmark the speedup.\n"
    ]
   },
   {
@@ -433,7 +433,7 @@
    "id": "f092af24",
    "metadata": {},
    "source": [
-    "#### Benchmarking with `cupyx.profiler.benchmark`\n",
+    "#### Benchmarking with `cupyx.profiler.benchmark()`\n",
     "\n",
     "We use CuPy's built-in benchmarking utility for accurate GPU timing. This handles warmup and synchronization automatically.\n",
     "\n",
 
@@ -222,7 +222,7 @@
     "\n",
     "There's two ways that we can filter and annotate what we see in Nsight systems.\n",
     "\n",
-    "The first is to limit when we start and stop profiling in the program. In Python, we can do this with `cupyx.profiler.profile`, which give us a Python context manager. Any CUDA code used during scope will be included in the profile.\n",
+    "The first is to limit when we start and stop profiling in the program. In Python, we can do this with `cupyx.profiler.profile()`, which give us a Python context manager. Any CUDA code used during scope will be included in the profile.\n",
     "\n",
     "```\n",
     "not_in_the_profile()\n",
@@ -233,7 +233,7 @@
     "\n",
     "For this to work, we have to pass `--capture-range=cudaProfilerApi --capture-range-end=stop` as flags to `nsys`.\n",
     "\n",
-    "We can also annotate specific regions of our code, which will show up in the profiler. We can even add categories, domains, and colors to these regions, and they can be nested. To add these annotations, we use `nvtx.annotate`, another Python context manager, this time from a library called NVTX.\n",
+    "We can also annotate specific regions of our code, which will show up in the profiler. We can even add categories, domains, and colors to these regions, and they can be nested. To add these annotations, we use `nvtx.annotate()`, another Python context manager, this time from a library called NVTX.\n",
     "\n",
     "```\n",
     "with nvtx.annotate(\"Loop\"):\n",
@@ -244,8 +244,8 @@
     "\n",
     "**TODO:** Go back to the earlier cells and improve the profile results by adding:\n",
     "\n",
-    "- `nvtx.annotate` regions. Remember, you can nest them.\n",
-    "- A `cpx.profiler.profile` around the `start =`/`stop =` lines that run the solver.\n",
+    "- `nvtx.annotate()` regions. Remember, you can nest them.\n",
+    "- A `cpx.profiler.profile()` around the `start =`/`stop =` lines that run the solver.\n",
     "- `--capture-range=cudaProfilerApi --capture-range-end=stop` to the `nsys` flags.\n",
     "\n",
     "Then, capture another profile and see if you can identify how we can improve the code. Specifically, think about how we could add more asynchrony."
@@ -262,10 +262,10 @@
     "\n",
     "Remember what we've learned about streams and how to use them with CuPy:\n",
     "\n",
-    "- By default, all CuPy operations within a single thread run on the same stream. You can access this stream with `cp.cuda.get_current_stream`.\n",
+    "- By default, all CuPy operations within a single thread run on the same stream. You can access this stream with `cp.cuda.get_current_stream()`.\n",
     "- You can create a new stream with `cp.cuda.Stream(non_blocking=True)`. Use `with` statements to use the stream for all CuPy operations within a block.\n",
-    "- You can record an event on a stream by calling `.record` on it.\n",
-    "- You can synchronize on an event (or an entire stream) by calling `.synchronize` on it.\n",
+    "- You can record an event on a stream by calling `.record()` on it.\n",
+    "- You can synchronize on an event (or an entire stream) by calling `.synchronize()` on it.\n",
     "- Memory transfers will block by default. You can launch them asynchronously with `cp.asarray(..., blocking=False)` (for host to device transfers) and `cp.asnumpy(..., blocking=False)` (for device to host transfers).\n",
     "\n",
     "**TODO:** Copy your NVTX annotated code from before into the cell below (make sure not to overwrite the %%writefile), and modify the code to improve performance by adding asynchrony."
 
@@ -211,7 +211,7 @@
    "source": [
     "**What this does**:\n",
     "1. `Device(0)`: Creates a Device object representing the first GPU (GPU numbering starts at 0)\n",
-    "1. `device.set_current`: Tells CUDA \"I want to use this GPU for my operations\"\n",
+    "1. `device.set_current()`: Tells CUDA \"I want to use this GPU for my operations\"\n",
     "\n",
     "If you have multiple GPUs, CUDA needs to know which one you want to use, which is why we need `set_current`"
    ]
@@ -345,7 +345,7 @@
    "source": [
     "**What this does:**\n",
     "1. Calculate size: We figure out how many bytes we need (1000 floats × 4 bytes each)\n",
-    "2. Allocate memory: `device.allocate` reserves space on the GPU\n",
+    "2. Allocate memory: `device.allocate()` reserves space on the GPU\n",
     "3. Get a buffer: The returned device_buffer is like a \"handle\" to our GPU memory\n",
     "\n",
     "**Important**: Just like with regular Python programming, allocating memory doesn't put any meaningful data there yet. It's just reserved empty space."