GPU Deployment Guide

Agenda

Deployment
- NVIDIA Brev
- GPU Software Environment Fundamentals
- Python packages that use CUDA
- Monitoring/debugging tools
- Other platforms

Deployment

This tutorial will discuss how to get your own GPUs on the cloud in more general terms. In order to dig into some of the things we will be learning, we will launch a VM through the NVIDIA Brev portal.

Getting set up with Brev

Sign in to or register an account at https://brev.nvidia.com
Ensure you are a member of an organization
- One should be created for you when you register, but if not it will say "undefined" in the top right
- If you don't have one you can create a new one and give it a name
Apply credits to your organization
- Navigate to Billing
- Select "Redeem Code"

Launching a Brev VM

Under "GPUs" select "New Instance"
Choose a GPU type that costs <$1/hour (e.g an L4)
Choose GCP as the provider
- There is a known issue with running this material on AWS instances, other providers are untested
Give your VM a name
Press Deploy
Wait for the VM to be "Running" and the software environment to finish "Building"

Connecting to your VM

Once your VM is deployed, follow the Brev access instructions provided for your instance. The connection instructions will vary depending on your operating system. For example, on macOS you would:

Install the brev CLI
- brew install brevdev/homebrew-brev/brev
Login to your account (copy from access page)
- brev login --token ****
Connect via SSH
- brev ls to list your VMs
- brev shell <your vm name> to connect via SSH

For Linux and Windows instructions check the brev-cli install documentation

Exploring our GPU Software Environment

Let's start by exploring our VM to see what software we got out of the box.

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"

We can check our GPU information by running nvidia-smi.

$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.158.01             Driver Version: 570.158.01     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:00:03.0 Off |                    0 |
| N/A   47C    P8             13W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Let's install something

$ which pip
/usr/bin/pip

If you have pip, we can try to install something like cupy

pip install cupy-cuda12x

python3   # Start Python interpreter

import cupy as cp
x_gpu = cp.array([1, 2, 3])
x2 = x_gpu**2

Traceback (most recent call last):
  File "cupy_backends/cuda/_softlink.pyx", line 25, in cupy_backends.cuda._softlink.SoftLink.__init__
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnvrtc.so.12: cannot open shared object file: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cupy/_core/core.pyx", line 1448, in cupy._core.core._ndarray_base.__pow__
  File "cupy/_core/core.pyx", line 1799, in cupy._core.core._ndarray_base.__array_ufunc__
  File "cupy/_core/_kernel.pyx", line 1374, in cupy._core._kernel.ufunc.__call__
  File "cupy/_core/_kernel.pyx", line 1401, in cupy._core._kernel.ufunc._get_ufunc_kernel
  File "cupy/_core/_kernel.pyx", line 1082, in cupy._core._kernel._get_ufunc_kernel
  File "cupy/_core/_kernel.pyx", line 94, in cupy._core._kernel._get_simple_elementwise_kernel
  File "cupy/_core/_kernel.pyx", line 82, in cupy._core._kernel._get_simple_elementwise_kernel_from_code
  File "cupy/_core/core.pyx", line 2375, in cupy._core.core.compile_with_cache
  File "cupy/_core/core.pyx", line 2320, in cupy._core.core.assemble_cupy_compiler_options
  File "cupy_backends/cuda/libs/nvrtc.pyx", line 57, in cupy_backends.cuda.libs.nvrtc.getVersion
  File "cupy_backends/cuda/libs/_cnvrtc.pxi", line 72, in cupy_backends.cuda.libs.nvrtc.initialize
  File "cupy_backends/cuda/libs/_cnvrtc.pxi", line 75, in cupy_backends.cuda.libs.nvrtc._initialize
  File "cupy_backends/cuda/libs/_cnvrtc.pxi", line 153, in cupy_backends.cuda.libs.nvrtc._get_softlink
  File "cupy_backends/cuda/_softlink.pyx", line 32, in cupy_backends.cuda._softlink.SoftLink.__init__
RuntimeError: CuPy failed to load libnvrtc.so.12: OSError: libnvrtc.so.12: cannot open shared object file: No such file or directory

What does this error mean?

This error indicates that CuPy cannot find the CUDA runtime libraries it needs to work. It's looking for libnvrtc.so.12 (the NVIDIA Runtime Compiler library for CUDA 12), but it's not installed or not in the system's library path.

NVRTC is used to JIT (just-in-time) compile CUDA code at runtime. When we run x_gpu**2, CuPy needs NVRTC to dynamically compile a GPU kernel for this operation.

We need the core CUDA libraries in order to run any CUDA code. Often these will be installed at the system level in /usr/local/cuda. Let's check that:

ls -ld /usr/local/cuda*

If these are missing we need to decide how to get those dependencies. The way we do this is different depending on whether we want to use pip/uv or conda/pixi for our Python package manager.

Python Software environments

At the moment, when you install CuPy with pip, the package has dependencies on CUDA libraries that aren't available on PyPI. CuPy expects to find these CUDA libraries already installed on your system at /usr/local/cuda or in the system library path. This is why we need to install the CUDA Toolkit separately using the system package manager.

Note

For cupy this will change in the upcoming release. For cudf and cumlthis is not an issue. Here we are illustrating how to troubleshoot in case you run into this type of errors.

Pip

If we want to install our packages with pip we need to install the CUDA core libraries at the system level to be safe. We can do this on Ubuntu with apt.

Make sure to select the appropriate package that matches your system architecture (x86_64, ARM64, etc.) and your specific OS distribution and version. The example below shows the installation for Ubuntu 22.04 on x86_64 (what we have in our brev instance). For other distributions and architectures, consult the NVIDIA CUDA Installation Guide for Linux.

# Add the NVIDIA apt repo
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install the CUDA Toolkit (specify the CUDA version that matches your driver - check nvidia-smi)
sudo apt-get -y install cuda-toolkit-12-8

Now if we try to run our snippet of code again, we see:

>>> import cupy as cp
>>> x_gpu = cp.array([1, 2, 3])
>>> x2 = x_gpu**2
>>> x2
array([1, 4, 9])

Installing more packages

Now that we have our CUDA libraries we can install Python libraries with corresponding versions.

Important

We need to include the CUDA version in the package name due to limitations in the Python packaging spec, see the wheelnext project for plans to solve this in the long term. There is an experimental build of uv that supports wheel variants today.

Note

For some packages we need to use a custom index because the RAPIDS packages tend to be too large for uploading to PyPI. While we can work with them to increase those limits we can run our own index and handle the cost of serving those packages. You can check the RAPIDS installation selector to see if which package needs the extra index. The reason CUDA packages are so large is because GPU machine code varies between models in a way that doesn't happen with CPUs. To work around this CUDA builds for all common GPUs and bundles them together. Further improvements in packaging could help with this in the future.

As of the 25.10 release neither cuDF nor cuML need the extra index. Let's install cudf and do a simple operation.

pip install cudf-cu12

python3  # Start Python interpreter

Then we can import cudf and allocate some GPU memory

import cudf
s = cudf.Series([1, 2, 3, None, 4])
s.apply(lambda x: x+1)

What About uv?

Installation

Install uv following the Astral documentation:

curl -LsSf https://astral.sh/uv/install.sh | sh

Note

You'll need to source your .bashrc to make uv available in your current shell:

source ~/.bashrc

Creating a Test Environment

Let's create a separate directory to experiment with uv. We'll set up a Python 3.12 environment and install cudf:

mkdir sandbox
cd sandbox
uv venv --python 3.12
source .venv/bin/activate
uv pip install "cudf-cu12==25.10.*"

Testing the Installation

Launch the Python interpreter and test with some cuDF code:

import cudf

# Create a cuDF DataFrame
data = {'col1': [1, 2, 3, 4], 'col2': [10, 20, 30, 40]}
df = cudf.DataFrame(data)

# Perform an operation on a DataFrame column
df['col3'] = df['col1'] * df['col2']
df

Important Limitation

When installing nightly or pre-release versions of packages, uv has an all-or-nothing strategy. It requires more explicit configuration when working with nightlies or pre-releases, and failing to do so can generate version conflicts and installation errors that are less common with pip. For more information, see the uv pre-release compatibility documentation.

Let's deactivate this environment and return to our home directory so we can explore some more options

deactivate
cd

Conda

When installing libraries with conda each individual CUDA library can be installed as a conda package. So we don't need to ensure any of the CUDA libraries already exist in /usr/local/cuda.

If you prefer to use conda then we need to install it first.

curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"

bash Miniforge3-$(uname)-$(uname -m).sh  # Follow the prompts and choose yes to update your shell profile to automatically initialize conda

Note

You'll need to source your .bashrc to make conda available in your current shell:

source ~/.bashrc

Then we can create a new conda environment with python and cudf.

conda create -n rapids -c rapidsai -c conda-forge -c nvidia cudf python=3.13 'cuda-version>=12.0,<=12.9'

conda activate rapids

Note

You may notice this is much simpler than the pip installation. This is for two reasons:

We don't need to install the CUDA toolkit at the system level because each individual CUDA library is available as a conda package. So cudf can depend on them directly and install the ones it needs.
Conda supports virtual packages which allow the solver to discover additional information about the system such as the CUDA version and then pull in the correct package build for your system.

Why are these CUDA packages available on conda-forge but not PyPI?

Historically, Python could only package source distributions (which compile at install time), but NVIDIA doesn't distribute CUDA Toolkit source code. Conda was created as a binary package manager that can package any compiled code. While Python wheels now support binary distributions for pip, they are relatively new and it takes time for the ecosystem to catch up. Conda also provides quality of life improvements like virtual packages (exposing the driver CUDA version to the dependency solver) and optional package constraints, making it currently more mature for complex GPU dependencies.

Then we can import cudf and allocate some GPU memory

import cudf
s = cudf.Series(['a', 'aa', 'b'])
s.apply(lambda x: len(x))

Monitoring and debugging tools

When working with GPUs you need to get visibility into what the device is doing. We can get a whole range of information with nvidia-smi.

# Show high level GPU information
nvidia-smi

# List GPUs
nvidia-smi -L

# Dump detailed information
nvidia-smi -q

NVML

Below nvidia-smi sits NVML, a protocol for querying low level information from the GPU. There are Python bindings if you want to access this data yourselv.

pip install nvidia-ml-py  # Package name doesn't match library name. You import it with `import pynvml`

Here are some simple examples of using pynvml:

import pynvml

# Initialize NVML
pynvml.nvmlInit()

# Get the number of GPUs
num_gpus = pynvml.nvmlDeviceGetCount()
print(f"Number of GPUs: {num_gpus}")

# Get a handle to the first GPU
gpu = pynvml.nvmlDeviceGetHandleByIndex(0)

# Get the GPU name
gpu_name = pynvml.nvmlDeviceGetName(gpu)
print(f"GPU Name: {gpu_name}")

# Get memory information (convert bytes to GB)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(gpu)
print(f"Memory Used: {mem_info.used / 1e9:.2f} GB")
print(f"Memory Free: {mem_info.free / 1e9:.2f} GB")

You can learn more about using the pynvml library in this notebook on the Accelerated Computing Hub.

Jupyter Lab NVDashboard

If you are a fan of Jupyter Lab you can view metrics directly in the interface with jupyterlab-nvdashboard.

Note

Our Brev VM has Jupyter running for us in the system python environment via systemd. We can install NVDashboard in here but we need to ensure we are installing it into the right Python.

# Ensure we are using the base Python
conda deactivate  # If you installed conda deactivate it (you may need to run this more than once)
deactivate
which python3  # Should be /usr/bin/python3

# Install NVDashboard
pip install jupyterlab_nvdashboard
# alternatively /usr/bin/python3 -m pip install jupyterlab_nvdashboard

# Restart jupyter
sudo systemctl restart jupyter

Head back to the Brev dashboard in your browser and click the "Open Notebook" button in the top right corner. You will be asked to authenticate with Brev again to access your notebook.

Note

We've created a few additional Python environments with uv and conda. As a stretch exercise see if you can use ipykernel to register them in Jupyter.

Hint: You will need to follow the Kernels for different environments instructions

nvtop

There also also many great third-party tools out there for inspecting your GPU. One such project is nvtop, a CLI tool for viewing GPU stats.

# Install with apt
sudo apt install nvtop

# Start nvtop
nvtop

Let's create a simple Python script to keep the GPU busy so we can monitor it with nvtop:

import cupy as cp

arr = cp.arange(1, 50_000_000)

while True:
    _ = arr**2

To monitor this with nvtop:

Start running the Python script above
Press Ctrl+Z to suspend the process
Type bg to send it to the background
Run nvtop to monitor GPU activity
Observe the GPU memory usage and utilization percentages in nvtop
Press q to quit nvtop
Type fg to bring your Python process back to the foreground
Press Ctrl+C to stop the Python script

cudf.pandas profilers

Some tools and libraries have built in profiling tools. For example the cudf.pandas plugin allows you to profile your code from withing Jupyter.

%load_ext cudf.pandas
import pandas as pd

%%cudf.pandas.profile

small_df = pd.DataFrame({"a": ["0", "1", "2"], "b": ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

small_df.min(axis=0)
small_df.min(axis=1)

counts = small_df.groupby("a").b.count()

NSight Systems and nsys

NVIDIA produces debugging tools which allow you to view low level traces from the GPU kernel execution to find performance bottlenecks.

Typically Python users will run their code with nsys to produce a report, and then open it in Nsight as a local viewer.

Like many debugging tools we need to use nsys to call Python initially. This will run your code and then output a tracefile which you can download and explore locally.

Let's create script name my_script.py

import cudf.pandas
cudf.pandas.install()

import pandas as pd

small_df = pd.DataFrame({"a": ["0", "1", "2"], "b": ["x", "y", "z"]})
small_df = pd.concat([small_df, small_df])

small_df.min(axis=0)
small_df.min(axis=1)

counts = small_df.groupby("a").b.count()

Now we run the script with nsys

Note

You will need to run this command with sudo in order to access low level GPU metrics. However, this can disrupt which Python environment you have activated. We recommend you run this with either the uv or conda environment activated and use the full path to python in your environment. You can find this easily with which python.

sudo nsys profile \
  --trace cuda,osrt,nvtx \
  --gpu-metrics-devices all \
  --cuda-memory-usage true \
  --force-overwrite true \
  --output profile_run_v1 \
  $(which python) my_script.py
# Will create profile_my_script.nsys-rep

To be able to visualize the file, we can download it an use nsight-systems.

How do I do all this on "foo" platform?

Now that we've experimented with all of these tools, libraries and debuggers on a Ubuntu VM the next thing most folks need to figure out is how to apply this to your world. It's likely that you have some opinionated set of hardware/software/platform that you need to use. Perhaps your employer provides you with access to Databricks, Coiled or Snowflake. Or maybe you have cloud access and you use services such as AWS SageMaker, Azure Machine Learning or Google Cloud Vertex AI. Or maybe you have an existing machine or cluster somewhere.

However you get access to GPUs it inevitably falls to you to close the gap between the software provided and the software you need. In our Brev example we got Ubuntu with the NVIDIA driver, but nothing else. On platforms like Snowflake you will get some version of CUDA Toolkit and a few libraries out of the box, but you'll need to figure out how to add the additional things you need.

In RAPIDS we endeavour to document the most commonly used platforms and how to get from their out of the box offering to a fully functional RAPIDS environment.

If you're using something that we haven't documented then you can walk through the various levels we've covered and figure out what you have, and what you need and hopefully you now have the ability to get started anywhere. If you think you're using a platform that we should document then open an issue.`

Previous iterations of this tutorial:

This guide was adapted from:EuroSciPy 2025 | Share Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Deployment Guide

Agenda

Deployment

Getting set up with Brev

Launching a Brev VM

Connecting to your VM

Exploring our GPU Software Environment

Let's install something

Python Software environments

Pip

What About uv?

Conda

Monitoring and debugging tools

NVML

Jupyter Lab NVDashboard

nvtop

cudf.pandas profilers

NSight Systems and nsys

How do I do all this on "foo" platform?

Previous iterations of this tutorial:

FilesExpand file tree

gpu-deployment-from-scratch.md

Latest commit

History

gpu-deployment-from-scratch.md

File metadata and controls

GPU Deployment Guide

Agenda

Deployment

Getting set up with Brev

Launching a Brev VM

Connecting to your VM

Exploring our GPU Software Environment

Let's install something

Python Software environments

Pip

What About uv?

Conda

Monitoring and debugging tools

NVML

Jupyter Lab NVDashboard

nvtop

cudf.pandas profilers

NSight Systems and nsys

How do I do all this on "foo" platform?

Previous iterations of this tutorial: