This guide is adapted from garg-aayush's tutorial and updated for Ubuntu 24.04. Note: Sections marked with ö indicate content that has been directly adapted from the original tutorial with minimal modifications.
This article mainly focuses on installing multiple versions of CUDA and CUDNN, and how to manage different CUDA versions using Environment Module. However, before diving into the main content, please consider whether your needs can be met through Docker or Conda. In my case, I needed to use a specific version of PyTorch, and using their provided docker was much more convenient than installing CUDA myself. I decided to continue using Environment Modules to manage CUDA simply because I don't like leaving things unfinished, and Docker and Environment Modules are not mutually exclusive anyway. If you really need to use environment modules to manage CUDA, please prepare a backup and restore solution before starting, such as timeshift. Creating backups saves you from having to reinstall the system if it crashes.
sudo add-apt-repository ppa:graphics-drivers/ppa
ubuntu-drivers devices
best to allow Ubuntu to autodetect and install the compatible nvidia-driver
sudo ubuntu-drivers install
Note: Please restart your system after installing the nvidia driver. Ideally, you should be able to get GPU state and stats using nvidia-smi
nvidia-detector
This part mainly follows the installation method from the official website. Visit https://developer.nvidia.com/cuda-toolkit-archive to select your desired CUDA version. After clicking on it, you can see the official guide. Taking 11.8 as an example, after selecting your architecture and installer type, you'll see the installation instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-11-8-local_11.8.0-520.61.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
However, if you are using 24.04, following this guide directly will lead to two issues.
The first issue is that CUDA 11.8 depends on the libtinfo5
library. If you try to install directly, you'll encounter
the following error message:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nsight-systems-2023.3.3 : Depends: libtinfo5 but it is not installable
E: Unable to correct problems, you have held broken packages.
The solution can be found here:
Modify the source list by adding the older version repository:
sudo nano /etc/apt/sources.list.d/ubuntu.sources
Append to the end of the file:
Types: deb
URIs: http://old-releases.ubuntu.com/ubuntu/
Suites: lunar
Components: universe
Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg
Even after libtinfo5 is available, installation issues may still occur. I am using Ubuntu 24.04 with kernel
6.14.0-27-generic. When installing the meta-package cuda from a CUDA 11.8 local repo (built for Ubuntu 22.04), apt will
try to remove the previously installed 575 driver and install the corresponding 520 driver.
However, the 520 driver does not support my current kernel version, resulting in the error
Error! Bad return status for module build on kernel: 6.14.0-27-generic
.
The solution is simple - since installing the cuda meta-package will bring in the drivers, we can just install the
toolkit directly. So for the last step in the official guide,
we change sudo apt-get -y install cuda
to sudo apt-get install cuda-toolkit-11-8
.
For the same reason, when installing other CUDA versions, follow the main commands from the official documentation, just
replace cuda
with cuda-toolkit-xx-x
in the final step.
After the installation, you should be able to see the corresponding CUDA version directory under /usr/local/.
Download the CUDNN version corresponding to your CUDA from https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/. Note that newer versions are listed at the bottom. I recommend downloading the latest version.
After downloading the corresponding package using wget, use tar to decompress:
tar -xvf cudnn-<***>.tar.xz
Then, simply copy the corresponding folders to the cuda toolkit directory:
cd cudnn-<***>-archive/
sudo cp include/cudnn*.h /usr/local/cuda-**.*/include
sudo cp lib/libcudnn* /usr/local/cuda-**.*/lib64
Note: In newer versions of cuDNN, the location of libcudnn
has changed from lib64
to lib
. The exact path may vary,
so
please check using the ls
command.
Then give these files executable permissions:
sudo chmod a+r /usr/local/cuda-**.*/include/cudnn*.h /usr/local/cuda-**.*/lib64/libcudnn*
I find this manual installation method a bit odd, but I haven't found a better solution yet. Still waiting for guidance from experts.
First, install environment-modules:
sudo apt-get update
sudo apt-get install environment-modules
Check it
module avail # check for available modueles
# module list checks for loaded modules, don't confuse them
should be able to see something like
dot module-git module-info modules null use.own
The module names shown above can actually be found in the /usr/share/modules/modulefiles/
directory. Here we'll create
a cuda
folder to store our modules.
Create module files corresponding to CUDA versions:
sudo vim /usr/share/modules/modulefiles/cuda/**.* # for example 11.8
Using 11.8 as an example:
#%Module1.0
##
## cuda 11.8 modulefile
##
proc ModulesHelp { } {
global version
puts stderr "\tSets up environment for CUDA $version\n"
}
module-whatis "sets up environment for CUDA 11.8"
if { [ is-loaded cuda/12.1 ] } {
module unload cuda/12.1
}
set version 11.8
set root /usr/local/cuda-11.8
setenv CUDA_HOME $root
prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/extras/CUPTI/lib64
prepend-path LD_LIBRARY_PATH $root/lib64
conflict cuda
We can see there's an if statement that automatically unloads the CUDA 12.1 environment if it's loaded. I think this
part can be removed since there's already a conflict cuda
declaration below that handles conflicts. As long as we
remember to unload the current environment before loading another one, it should be fine. Otherwise, if we have three or
more environments, do we need to add an if statement in each environment file?
After you have defined module files for each CUDA version, you can use module avail
to check what modules are
currently available, use module load cuda/****
to load a specific module, and use module unload cuda/***
or simply
module unload cuda
to clear the environment.
After loading the corresponding environment, you can use nvcc --version
to check the CUDA version.
Although for me, success is only achieved after verifying the corresponding PyTorch version works, I haven't reached that step yet. If I encounter any difficulties along the way, I will update this tutorial. That being said, I might just give up and use Docker instead.