Skip to content

Commit d69f2eb

Browse files
pzelaskoclaude
andcommitted
docs: overhaul installation instructions around uv + bring-your-own versions
Harmonize and correct installation docs across README, CLAUDE.md, and the Sphinx install page, and fix stale package-metadata URLs. - Lead with uv + cu13 as the recommended install; pip is a documented fallback. - Emphasize bring-your-own Python (>=3.10) / PyTorch (>=2.6) / CUDA: nemo-toolkit only pins torch>=2.6, so a pre-installed PyTorch is kept, not replaced. - Frame the uv.lock/container combo (Python 3.13, PyTorch 2.12, CUDA 12.6/13.2) as the actively-supported stack, not a hard requirement. - Document the compiled / compiled-a100 extras (source-built GPU kernels for SpeechLM2 / Automodel: Transformer Engine, FlashAttention, Mamba, grouped-GEMM, DeepEP), including the H100+ vs A100 split and that they build via the Dockerfile. - Fix broken commands: GPU pip install now shows the required --extra-index-url; test/docs are PEP 735 groups (--group), not extras. - Correct the Python floor (3.10), torch version (2.12), and clone URL (NVIDIA-NeMo/NeMo); add an NGC container placeholder pending the image. - Update stale repo URLs to NVIDIA-NeMo/NeMo in pyproject.toml and package_info.py. Validated installability in Docker (py3.10/3.11/3.12; preinstalled torch 2.6/2.8/official cu124 kept; default + cu13 GPU paths resolve and import). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent d947ef7 commit d69f2eb

5 files changed

Lines changed: 170 additions & 66 deletions

File tree

CLAUDE.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,9 @@ NeMo Speech — toolkit for training/deploying speech models (ASR, TTS, Speech L
88

99
## Build & Install
1010

11-
```bash
12-
pip install -e '.[all]' # Full dev install
13-
pip install -e '.[asr]' # ASR only
14-
pip install -e '.[test]' # With test deps
15-
```
11+
See the canonical installation guide — [`docs/source/starthere/install.rst`](docs/source/starthere/install.rst) (published at https://docs.nvidia.com/nemo/speech/nightly/) — for the uv, pip (bring-your-own Python/PyTorch/CUDA), Docker, and optional `compiled` (SpeechLM2/Automodel) install paths.
1612

17-
Requires Python 3.10+, PyTorch 2.6+.
13+
Dev quickstart: `uv sync --extra all --extra cu13` (Python 3.10+, PyTorch 2.6+; `test`/`docs` are `--group`s, not extras).
1814

1915
## Code Style
2016

README.md

Lines changed: 49 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,13 @@ For technical documentation, please see the
4949

5050
## Requirements
5151

52-
- Python 3.12 or above
53-
- Pytorch 2.6 or above
54-
- NVIDIA GPU (if you intend to do model training)
52+
NeMo Speech works with the **Python, PyTorch, and CUDA versions of your choosing**:
53+
54+
- Python 3.10 or above
55+
- PyTorch 2.6 or above
56+
- NVIDIA GPU + CUDA (required for training; recommended for inference)
57+
58+
If you already have a Python/PyTorch/CUDA stack, NeMo Speech installs on top of it **without replacing it** — the `nemo-toolkit` package only requires `torch>=2.6`, so your existing PyTorch build is kept (see the install options below). The versions pinned in `uv.lock` and shipped in the official container — Python 3.13, PyTorch 2.12, CUDA 12.6/13.2 — are simply the combination we actively test and support. They make setup turnkey and reproducible, but they are **not** a hard requirement.
5559

5660
As of [Pytorch 2.6](https://docs.pytorch.org/docs/stable/notes/serialization.html#torch-load-with-weights-only-true),
5761
`torch.load` defaults to using `weights_only=True`. Some model checkpoints may require using `weights_only=False`.
@@ -68,9 +72,48 @@ can have the risk of arbitrary code execution.
6872

6973
## Install NeMo Speech
7074

71-
NeMo Speech is installable via pip: `pip install 'nemo-toolkit[all]'`
72-
To install with extra dependencies for CUDA 12.x or 13.x, use `pip install 'nemo-toolkit[all,cu12]'`
73-
or `pip install 'nemo-toolkit[all,cu13]'` respectively.
75+
The recommended way to install NeMo Speech is from source with [uv](https://docs.astral.sh/uv/), which reproduces our actively-tested stack from the committed `uv.lock`. If you need different Python/PyTorch/CUDA versions, NeMo also installs over your existing environment via pip — see the [pip fallback](#from-pypi-with-pip-fallback--bring-your-own-versions) below.
76+
77+
### From source with uv (recommended)
78+
79+
```bash
80+
git clone https://github.com/NVIDIA-NeMo/NeMo.git
81+
cd NeMo
82+
uv sync --extra all --extra cu13 # CUDA 13.x (recommended) — use --extra cu12 for CUDA 12.x
83+
```
84+
85+
This installs our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2) into `.venv/` with NeMo editable. Add `--group test` for the test suite or `--group docs` to build the docs; run tools via `uv run <cmd>` or activate with `source .venv/bin/activate`. On Linux, `cu12` and `cu13` are mutually exclusive — pass exactly one (`cu13` is the default).
86+
87+
> **SpeechLM2 / Automodel:** the Automodel backend runs **without** any compiled dependencies. It can *optionally* benefit from dedicated accelerated backends (Transformer Engine, FlashAttention, Mamba, grouped-GEMM/MoE, DeepEP) for better performance — these source-built kernels come from the `compiled` (Hopper/Blackwell) or `compiled-a100` (A100) extras, built by `docker/Dockerfile` (`GPU_TARGET=h100plus` / `a100`). See the [installation guide](https://docs.nvidia.com/nemo/speech/nightly/) for the full list and build details.
88+
89+
### Docker (turnkey, our supported stack)
90+
91+
> **NGC container:** _Coming soon — the pull command for the prebuilt NeMo Speech container image will be published here._
92+
93+
To build the container from source (CUDA 13 / H100+ by default):
94+
95+
```bash
96+
git clone https://github.com/NVIDIA-NeMo/NeMo.git
97+
cd NeMo
98+
docker buildx build -f docker/Dockerfile -t nemo-speech .
99+
```
100+
101+
See the header of [`docker/Dockerfile`](docker/Dockerfile) for CUDA 12 / A100 build arguments (`BASE_IMAGE`, `GPU_TARGET`).
102+
103+
### From PyPI with pip (fallback — bring your own versions)
104+
105+
Prefer your own Python/PyTorch/CUDA? `nemo-toolkit` only requires `torch>=2.6`, so install your PyTorch first (any version ≥ 2.6 for your CUDA — see the [PyTorch install matrix](https://pytorch.org/get-started/locally/)), then add NeMo and it **keeps your build**:
106+
107+
```bash
108+
pip install nemo_toolkit[asr,tts] # also: [asr,tts,audio], [speechlm2], etc.
109+
```
110+
111+
To have pip install our pinned PyTorch build instead, add the CUDA extra and the matching wheel index (pip does not read uv's index configuration, so `--extra-index-url` is required):
112+
113+
```bash
114+
pip install 'nemo-toolkit[asr,tts,cu13]' --extra-index-url https://download.pytorch.org/whl/cu132 # CUDA 13.x
115+
pip install 'nemo-toolkit[asr,tts,cu12]' --extra-index-url https://download.pytorch.org/whl/cu126 # CUDA 12.x
116+
```
74117

75118
## Contribute to NeMo
76119

docs/source/starthere/install.rst

Lines changed: 115 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -8,60 +8,51 @@ This page covers how to install NVIDIA NeMo for speech AI tasks (ASR, TTS, speak
88
Prerequisites
99
-------------
1010

11-
Before installing NeMo, ensure you have:
11+
NeMo Speech works with the **Python, PyTorch, and CUDA versions of your choosing**:
1212

13-
#. **Python** 3.12 or above
14-
#. **PyTorch** 2.7+ (install **before** NeMo so CUDA wheels match your GPU driver)
15-
#. **NVIDIA GPU** (required for training; CPU-only inference is possible but slow)
13+
#. **Python** 3.10 or above
14+
#. **PyTorch** 2.6 or above
15+
#. **NVIDIA GPU + CUDA** (required for training; CPU-only inference is possible but slow)
1616

17-
Recommended installation order
18-
------------------------------
17+
.. admonition:: Bring your own Python / PyTorch / CUDA
18+
:class: important
1919

20-
Install dependencies in this order when setting up a **local GPU** environment:
20+
The recommended install path is uv (below), which gives you our actively-tested stack. But NeMo Speech can also install *on top of* an existing environment: the ``nemo-toolkit`` package only requires ``torch>=2.6``, so if you already have a Python, PyTorch, and CUDA stack, your pre-installed PyTorch is **kept, not replaced** (see :ref:`the pip fallback <install-from-pypi>`).
2121

22-
#. Create and activate a Python environment.
23-
#. Install a **CUDA toolkit** (or rely on a driver + PyTorch bundle that matches your CUDA major version).
24-
#. Install **PyTorch** (and torchvision if you need it) from the index that matches your CUDA build.
25-
#. Install **NeMo** (from PyPI or editable source) **with the extras** for the collections you need (``asr``, ``tts``, etc.).
22+
The versions pinned in ``uv.lock`` and shipped in the official container — **Python 3.13, PyTorch 2.12, CUDA 12.6/13.2** — are simply the combination we actively test and support. They make setup turnkey and reproducible, but they are **not** a hard requirement.
2623

27-
Putting PyTorch in place first avoids mismatched CUDA runtimes and makes NeMo’s optional GPU-dependent packages resolve correctly.
24+
.. note::
2825

29-
**Example (conda + pip, CUDA 13.0 PyTorch wheels):**
26+
As of `PyTorch 2.6 <https://docs.pytorch.org/docs/stable/notes/serialization.html#torch-load-with-weights-only-true>`_, ``torch.load`` defaults to ``weights_only=True``. Some checkpoints require ``weights_only=False``; in that case set ``TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1`` before loading, and only with trusted files (loading untrusted files with full pickle support risks arbitrary code execution).
3027

31-
.. code-block:: bash
32-
33-
# 1) New environment (adjust Python version if your platform requires it)
34-
conda create -n nemo python=3.12 -y
35-
conda activate nemo
36-
37-
# 2) CUDA toolkit from conda (optional if you already have a compatible toolkit via the driver)
38-
conda install nvidia::cuda-toolkit
28+
.. _install-from-source:
3929

40-
# 3) PyTorch built for CUDA 13.x — change cu130 / URL if you use cu124 or CPU-only
41-
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
30+
Install from Source with uv (recommended)
31+
------------------------------------------
4232

43-
# 4) NeMo: use extras for ASR/TTS/etc. For a clone of the repo, use editable install (see below)
44-
pip install nemo_toolkit[asr,tts]
33+
The recommended way to install NeMo Speech is from source with `uv <https://docs.astral.sh/uv/>`_, which reproduces our actively-tested stack from the committed ``uv.lock``:
4534

46-
Adjust the PyTorch ``--index-url`` (e.g. ``cu124``, ``cu121``, or CPU) to match `PyTorch’s install matrix <https://pytorch.org/get-started/locally/>`_ and your NVIDIA driver.
35+
.. code-block:: bash
4736
48-
Install from PyPI
49-
-----------------
37+
git clone https://github.com/NVIDIA-NeMo/NeMo.git
38+
cd NeMo
5039
51-
The quickest way to install NeMo is via pip. Install only the collections you need:
40+
# CUDA 13.x (recommended). Use --extra cu12 for CUDA 12.x. uv resolves the
41+
# matching PyTorch CUDA wheel automatically from the pinned indexes.
42+
uv sync --extra all --extra cu13
5243
53-
.. code-block:: bash
44+
# Optional: add the test suite tooling, or the docs build dependencies
45+
# uv sync --extra all --extra cu13 --group test
46+
# uv sync --group docs
5447
55-
# Install ASR and TTS (most common)
56-
pip install nemo_toolkit[asr,tts]
48+
``uv sync`` creates a virtual environment in ``.venv/`` with NeMo installed in editable mode, matching our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2 by default). Run commands with ``uv run <cmd>`` or activate the environment with ``source .venv/bin/activate``.
5749

58-
# Install everything speech-related
59-
pip install nemo_toolkit[asr,tts,audio]
50+
On Linux, pass exactly one of ``--extra cu13`` (recommended) or ``--extra cu12`` — they are mutually exclusive. If you omit both, uv installs the generic PyPI PyTorch wheel instead of NVIDIA's CUDA-matched build.
6051

61-
Available extras:
52+
Available collection extras (combine with one CUDA extra above):
6253

6354
.. list-table::
64-
:widths: 15 85
55+
:widths: 18 82
6556
:header-rows: 1
6657

6758
* - Extra
@@ -72,32 +63,106 @@ Available extras:
7263
- Text-to-Speech models, vocoders, and audio codecs
7364
* - ``audio``
7465
- Audio processing models (enhancement, separation)
66+
* - ``speechlm2``
67+
- Speech language models (includes NeMo Automodel)
68+
* - ``all``
69+
- All of the collections above
70+
* - ``cu12`` / ``cu13``
71+
- Our pinned CUDA 12.x / 13.x PyTorch build (Linux; pick at most one)
7572

76-
.. _install-from-source:
73+
.. note::
7774

78-
Install from Source
79-
-------------------
75+
``test`` and ``docs`` are dependency *groups* (PEP 735), not extras. Install them with ``--group`` (e.g. ``uv sync --group test``) — the bracket form ``.[test]`` does not work.
76+
77+
.. _install-compiled-extras:
78+
79+
Optional compiled dependencies for SpeechLM2 / Automodel (``compiled`` / ``compiled-a100``)
80+
-------------------------------------------------------------------------------------------
81+
82+
The Automodel backend used for SpeechLM2 **does not require any compiled dependencies — it runs without them.** The ``compiled`` and ``compiled-a100`` extras are an *optional* performance add-on: when their source-built GPU kernels are installed, Automodel can route to dedicated accelerated backends (FP8 Transformer kernels via Transformer Engine, FlashAttention, Mamba/state-space layers, and Mixture-of-Experts ops). They contain:
83+
84+
.. list-table::
85+
:widths: 30 70
86+
:header-rows: 1
87+
88+
* - Package
89+
- Purpose
90+
* - ``transformer-engine``
91+
- NVIDIA Transformer Engine — FP8 and accelerated Transformer kernels
92+
* - ``flash-attn``
93+
- FlashAttention attention kernels
94+
* - ``mamba-ssm`` + ``causal-conv1d``
95+
- Mamba / state-space-model kernels (hybrid Mamba architectures)
96+
* - ``nv-grouped-gemm``
97+
- Grouped GEMM kernels for Mixture-of-Experts (MoE) layers
98+
* - ``deep_ep`` (DeepEP)
99+
- Expert-parallel communication kernels for MoE (``compiled`` only — see below)
100+
* - ``onnx-ir`` + ``onnxscript``
101+
- Pinned ONNX export tooling
102+
103+
Choose the variant that matches your GPU (the two are mutually exclusive):
104+
105+
* ``compiled`` — Hopper/Blackwell and newer (SM90/SM100/SM120, e.g. H100/H200/B200). Includes DeepEP.
106+
* ``compiled-a100`` — Ampere A100 (SM80). Omits DeepEP, which requires a separately-built, patched version on A100.
107+
108+
.. warning::
80109

81-
For the latest development version or if you plan to contribute, clone the repository and install in editable mode.
110+
These packages **build from source** and need a full CUDA build environment — build tools, matching ``TORCH_CUDA_ARCH_LIST`` / ``NVTE_CUDA_ARCHS`` flags, ``--no-build-isolation``, and (for ``compiled``) extra manual build steps that the Dockerfile performs (e.g. flash-attn-4 and DeepEP patches). The supported, reproducible way to get them is the container build, which sets all of this up for you:
82111

83-
The ``test`` extra pulls in **pytest and tooling for the test suite**. It does **not** install NeMo collection dependencies (ASR, TTS, audio, etc.). Add those extras explicitly or imports like ``nemo.collections.asr`` will fail.
112+
.. code-block:: bash
113+
114+
# Hopper/Blackwell (default GPU_TARGET=h100plus → compiled)
115+
docker buildx build -f docker/Dockerfile -t nemo-speech .
116+
117+
# Ampere A100 (GPU_TARGET=a100 → compiled-a100)
118+
docker buildx build -f docker/Dockerfile \
119+
--build-arg BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 \
120+
--build-arg GPU_TARGET=a100 -t nemo-speech .
121+
122+
A bare ``uv sync --extra all --extra cu13 --extra compiled`` outside this environment will likely fail to compile.
123+
124+
Using Docker (turnkey, our supported stack)
125+
--------------------------------------------
126+
127+
.. note::
128+
129+
**NGC container:** *Coming soon — the pull command for the prebuilt NeMo Speech container image will be published here.*
130+
131+
To build the container from source, use the provided ``docker/Dockerfile`` (CUDA 13 / H100+ by default):
84132

85133
.. code-block:: bash
86134
87-
git clone https://github.com/NVIDIA/NeMo.git
135+
git clone https://github.com/NVIDIA-NeMo/NeMo.git
88136
cd NeMo
137+
docker buildx build -f docker/Dockerfile -t nemo-speech .
89138
90-
# After PyTorch is installed (see Recommended installation order above):
91-
# Collections you need for development (required for nemo.collections.* imports)
92-
pip install -e '.[asr,tts]'
139+
See the header of ``docker/Dockerfile`` for CUDA 12 / A100 build arguments (``BASE_IMAGE``, ``GPU_TARGET``).
93140

94-
# Optional: add test to run pytest with NeMo’s dev test dependencies
95-
# pip install -e '.[asr,tts,test]'
141+
.. _install-from-pypi:
96142

97-
Using Docker
98-
------------
143+
Install from PyPI with pip (fallback — bring your own versions)
144+
---------------------------------------------------------------
145+
146+
Prefer your own Python/PyTorch/CUDA? Install your preferred PyTorch first (any version ≥ 2.6, built for your CUDA — see `PyTorch's install matrix <https://pytorch.org/get-started/locally/>`_), then add NeMo with the collections you need. Because ``nemo-toolkit`` only requires ``torch>=2.6``, your pre-installed PyTorch is kept, not replaced:
147+
148+
.. code-block:: bash
149+
150+
# 1) Your choice of PyTorch (example: CUDA 12.6 build). Skip if you already have one.
151+
pip install torch --index-url https://download.pytorch.org/whl/cu126
152+
153+
# 2) NeMo — your PyTorch above is kept
154+
pip install nemo_toolkit[asr,tts] # also: [asr,tts,audio], [speechlm2], etc.
155+
156+
To have pip install our pinned PyTorch build instead, add the matching CUDA extra **and** the PyTorch wheel index. pip does not read uv's index configuration, so the ``--extra-index-url`` is required:
157+
158+
.. code-block:: bash
159+
160+
pip install 'nemo-toolkit[asr,tts,cu13]' --extra-index-url https://download.pytorch.org/whl/cu132 # CUDA 13.x
161+
pip install 'nemo-toolkit[asr,tts,cu12]' --extra-index-url https://download.pytorch.org/whl/cu126 # CUDA 12.x
162+
163+
.. tip::
99164

100-
NVIDIA provides Docker containers with NeMo pre-installed. Check the `NeMo GitHub releases <https://github.com/NVIDIA/NeMo/releases>`_ for the latest container tags.
165+
Prefer a conda environment? Create and activate one (``conda create -n nemo python=3.10 -y && conda activate nemo``), then run the same ``uv`` or ``pip`` commands above inside it. NeMo Speech does not require a separate conda CUDA toolkit or a manual ``torchvision`` install.
101166

102167
Verify Installation
103168
-------------------

nemo/package_info.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@
2828
__contact_names__ = "NVIDIA"
2929
__contact_emails__ = "nemo-toolkit@nvidia.com"
3030
__homepage__ = "https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/"
31-
__repository_url__ = "https://github.com/nvidia/nemo"
32-
__download_url__ = "https://github.com/NVIDIA/NeMo/releases"
31+
__repository_url__ = "https://github.com/NVIDIA-NeMo/NeMo"
32+
__download_url__ = "https://github.com/NVIDIA-NeMo/NeMo/releases"
3333
__description__ = "NeMo - a toolkit for Conversational AI"
3434
__license__ = "Apache2"
3535
__keywords__ = "deep learning, machine learning, gpu, NLP, NeMo, nvidia, pytorch, torch, tts, speech, language"

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -355,8 +355,8 @@ py-modules = ["nemo"]
355355
nemo_speechlm = "nemo.collections.speechlm2.vllm.salm:register"
356356

357357
[project.urls]
358-
Download = "https://github.com/NVIDIA/NeMo/releases"
359-
Homepage = "https://github.com/nvidia/nemo"
358+
Download = "https://github.com/NVIDIA-NeMo/NeMo/releases"
359+
Homepage = "https://github.com/NVIDIA-NeMo/NeMo"
360360

361361
[tool.isort]
362362
profile = "black" # black-compatible

0 commit comments

Comments
 (0)