docs: extend install consistency sweep + clarify A100 CUDA support

pzelasko · claude · pzelasko · commit dee9c272d0a3 · 2026-06-08T15:20:07.000-07:00
Incorporate the useful parts of a parallel install-docs review and apply a
broader consistency pass:

- Distinguish uv sync --locked (exact supported baseline; add --python 3.13)
  from uv pip / pip (bring-your-own), with a warning not to use uv sync --locked
  for BYO. Offer uv pip alongside pip for the fallback path.
- Clarify A100: works with BOTH CUDA 12 and CUDA 13 — CUDA 13 (default base
  image) recommended, CUDA 12 base offered only as a convenience.
- Broaden PyTorch targets to CPU/CUDA/ROCm/Apple Silicon; note cu12/cu13 also
  add the matching CUDA Python deps (cuda-python, numba-cuda).
- Route scattered pages to the canonical install guide via :ref:`installation`
  (g2p, magpietts-finetuning, nemo_forced_aligner) and modernize index.rst /
  speechlm2/intro.rst snippets; add a docker run example and a lighter
  import-only verify step.
- Align docs build with CI (uv sync --locked --group docs; uv run make linkcheck);
  prune the now-fixed nemo_forced_aligner entry from the broken-links list.
- Normalize stale install references in the model-card template, NFA tool docs,
  and runtime error messages (nemo-toolkit name; NVIDIA-NeMo/NeMo clone URL).

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -45,7 +45,7 @@ Markers: `unit`, `integration`, `system`, `pleasefixme` (broken — skip), `skip
 Sphinx-based docs live in `docs/source/`. Build with:
 
 ```bash
-uv sync --group docs                                 # one-time setup
+uv sync --locked --group docs                        # one-time setup (matches CI)
 uv run make -C docs clean html                       # full rebuild
 uv run make -C docs html                             # incremental rebuild
 ```
diff --git a/README.md b/README.md
@@ -52,7 +52,7 @@ For technical documentation, please see the
 NeMo Speech works with the **Python, PyTorch, and CUDA versions of your choosing**:
 
 - Python 3.10 or above
-- PyTorch 2.6 or above
+- PyTorch 2.6 or above (CPU, CUDA, ROCm, or Apple Silicon build — your choice)
 - NVIDIA GPU + CUDA (required for training; recommended for inference)
 
 If you already have a Python/PyTorch/CUDA stack, NeMo Speech installs on top of it **without replacing it** — the `nemo-toolkit` package only requires `torch>=2.6`, so your existing PyTorch build is kept (see the install options below). The versions pinned in `uv.lock` and shipped in the official container — Python 3.13, PyTorch 2.12, CUDA 12.6/13.2 — are simply the combination we actively test and support. They make setup turnkey and reproducible, but they are **not** a hard requirement.
@@ -82,7 +82,7 @@ cd NeMo
 uv sync --extra all --extra cu13     # CUDA 13.x (recommended) — use --extra cu12 for CUDA 12.x
 ```
 
-This installs our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2) into `.venv/` with NeMo editable. Add `--group test` for the test suite or `--group docs` to build the docs; run tools via `uv run <cmd>` or activate with `source .venv/bin/activate`. On Linux, `cu12` and `cu13` are mutually exclusive — pass exactly one (`cu13` is the default).
+This installs our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2) into `.venv/` with NeMo editable. Add `--group test` for the test suite or `--group docs` to build the docs; run tools via `uv run <cmd>` or activate with `source .venv/bin/activate`. On Linux, `cu12` and `cu13` are mutually exclusive — pass exactly one (`cu13` is the default). For the **exact** container baseline, add `--locked --python 3.13` (the path the Dockerfile and CI use).
 
 > **SpeechLM2 / Automodel:** the Automodel backend runs **without** any compiled dependencies. It can *optionally* benefit from dedicated accelerated backends (Transformer Engine, FlashAttention, Mamba, grouped-GEMM/MoE, DeepEP) for better performance — these source-built kernels come from the `compiled` (Hopper/Blackwell) or `compiled-a100` (A100) extras, built by `docker/Dockerfile` (`GPU_TARGET=h100plus` / `a100`). See the [installation guide](https://docs.nvidia.com/nemo/speech/nightly/) for the full list and build details.
 
@@ -95,20 +95,23 @@ To build the container from source (CUDA 13 / H100+ by default):
 ```bash
 git clone https://github.com/NVIDIA-NeMo/NeMo.git
 cd NeMo
-docker buildx build -f docker/Dockerfile -t nemo-speech .
+docker buildx build -f docker/Dockerfile -t nemo-speech .          # CUDA 13 / H100+ (default)
+docker run --rm -it --gpus all -v "$PWD:/workspace" nemo-speech bash
 ```
 
-See the header of [`docker/Dockerfile`](docker/Dockerfile) for CUDA 12 / A100 build arguments (`BASE_IMAGE`, `GPU_TARGET`).
+For A100, set `GPU_TARGET=a100` — A100 works with **both CUDA 12 and CUDA 13** (CUDA 13, the default base image, is recommended; the CUDA 12 base is a convenience). See the header of [`docker/Dockerfile`](docker/Dockerfile) for all build arguments (`BASE_IMAGE`, `GPU_TARGET`).
 
 ### From PyPI with pip (fallback — bring your own versions)
 
-Prefer your own Python/PyTorch/CUDA? `nemo-toolkit` only requires `torch>=2.6`, so install your PyTorch first (any version ≥ 2.6 for your CUDA — see the [PyTorch install matrix](https://pytorch.org/get-started/locally/)), then add NeMo and it **keeps your build**:
+Prefer your own Python/PyTorch/CUDA? `nemo-toolkit` only requires `torch>=2.6`, so install your PyTorch first (any version ≥ 2.6 for your CPU/CUDA/ROCm/Apple Silicon target — see the [PyTorch install matrix](https://pytorch.org/get-started/locally/)), then add NeMo and it **keeps your build**. `uv pip` (uv's fast, pip-compatible installer) works like `pip`:
 
 ```bash
-pip install nemo_toolkit[asr,tts]      # also: [asr,tts,audio], [speechlm2], etc.
+uv pip install 'nemo-toolkit[asr,tts]'   # or plain: pip install 'nemo-toolkit[asr,tts]'
 ```
 
-To have pip install our pinned PyTorch build instead, add the CUDA extra and the matching wheel index (pip does not read uv's index configuration, so `--extra-index-url` is required):
+> ⚠️ Do **not** use `uv sync --locked` for a bring-your-own stack — it applies `uv.lock` and replaces your Python/PyTorch/CUDA with the supported baseline. Use `uv pip`/`pip` here; reserve `uv sync --locked` for reproducing our stack.
+
+To instead pull *our* pinned PyTorch build, add the CUDA extra and the matching wheel index (pip/uv pip do not read uv's project index config, so `--extra-index-url` is required):
 
 ```bash
 pip install 'nemo-toolkit[asr,tts,cu13]' --extra-index-url https://download.pytorch.org/whl/cu132   # CUDA 13.x
diff --git a/docs/README.md b/docs/README.md
@@ -2,12 +2,10 @@
 
 ## Building the Documentation
 
-1. Create and activate a virtual environment.
-
-1. Install the documentation dependencies:
+1. Install the documentation dependencies into the locked `uv` environment:
 
    ```console
-   $ uv sync --group docs
+   $ uv sync --locked --group docs
    ```
 
 1. Build the documentation:
@@ -21,7 +19,7 @@
 1. Build the documentation, as described in the preceding section, but use the following command:
 
    ```shell
-   make -C docs clean linkcheck
+   uv run make -C docs clean linkcheck
    ```
 
 1. Run the link-checking script:
diff --git a/docs/source/broken_links_needing_review..json b/docs/source/broken_links_needing_review..json
@@ -6,14 +6,6 @@
   "uri": "https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/optimizer/optimizer.py#L793",
   "info": "Anchor 'L793' not found"
 }
-{
-  "filename": "tools/nemo_forced_aligner.rst",
-  "lineno": 22,
-  "status": "broken",
-  "code": 0,
-  "uri": "https://github.com/NVIDIA/NeMo#installation",
-  "info": "Anchor 'installation' not found"
-}
 {
   "filename": "checkpoints/intro.rst",
   "lineno": 28,
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -57,11 +57,11 @@ What is NeMo?
 - **Scalable training** — multi-GPU/multi-node via PyTorch Lightning with mixed-precision support
 - **Simple configuration** — YAML-based experiment configs with `Hydra <https://hydra.cc/>`__
 
-Get started in 30 seconds:
+Get started (install the PyTorch build for your platform first):
 
 .. code-block:: bash
 
-   pip install nemo_toolkit[asr,tts]
+   uv pip install 'nemo-toolkit[asr,tts]'
 
 .. code-block:: python
 
diff --git a/docs/source/speechlm2/intro.rst b/docs/source/speechlm2/intro.rst
@@ -5,7 +5,9 @@ SpeechLM2
    The SpeechLM2 collection is still in active development and the code is likely to keep changing.
 
 .. note::
-   Install with ``pip install nemo-toolkit[speechlm2]`` to get all required dependencies including NeMo Automodel.
+   Install your chosen compatible PyTorch stack first, then install SpeechLM2 with
+   ``uv pip install 'nemo-toolkit[speechlm2]'`` (or, from a source checkout, ``uv pip install -e '.[speechlm2]'``)
+   to get all required dependencies including NeMo Automodel. See :ref:`installation` for details.
 
 SpeechLM2 refers to a collection that augments pre-trained Large Language Models (LLMs) with speech understanding and generation capabilities.
 
diff --git a/docs/source/starthere/install.rst b/docs/source/starthere/install.rst
@@ -11,8 +11,9 @@ Prerequisites
 NeMo Speech works with the **Python, PyTorch, and CUDA versions of your choosing**:
 
 #. **Python** 3.10 or above
-#. **PyTorch** 2.6 or above
+#. **PyTorch** 2.6 or above, for your chosen target (CPU, CUDA, ROCm, or Apple Silicon)
 #. **NVIDIA GPU + CUDA** (required for training; CPU-only inference is possible but slow)
+#. **uv** for the fastest source/PyPI workflow (``pip`` also works in a prepared environment)
 
 .. admonition:: Bring your own Python / PyTorch / CUDA
    :class: important
@@ -45,7 +46,7 @@ The recommended way to install NeMo Speech is from source with `uv <https://docs
    # uv sync --extra all --extra cu13 --group test
    # uv sync --group docs
 
-``uv sync`` creates a virtual environment in ``.venv/`` with NeMo installed in editable mode, matching our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2 by default). Run commands with ``uv run <cmd>`` or activate the environment with ``source .venv/bin/activate``.
+``uv sync`` creates a virtual environment in ``.venv/`` with NeMo installed in editable mode, matching our supported stack (Python 3.13, PyTorch 2.12, CUDA 13.2 by default). Run commands with ``uv run <cmd>`` or activate the environment with ``source .venv/bin/activate``. For the **exact** container baseline, add ``--locked --python 3.13`` (i.e. ``uv sync --locked --python 3.13 --extra all --extra cu13``) — this is the path the Dockerfile and CI use.
 
 On Linux, pass exactly one of ``--extra cu13`` (recommended) or ``--extra cu12`` — they are mutually exclusive. If you omit both, uv installs the generic PyPI PyTorch wheel instead of NVIDIA's CUDA-matched build.
 
@@ -68,7 +69,7 @@ Available collection extras (combine with one CUDA extra above):
    * - ``all``
      - All of the collections above
    * - ``cu12`` / ``cu13``
-     - Our pinned CUDA 12.x / 13.x PyTorch build (Linux; pick at most one)
+     - Our pinned CUDA 12.x / 13.x PyTorch build **plus** the matching CUDA Python deps (``cuda-python``, ``numba-cuda``). Linux; pick at most one.
 
 .. note::
 
@@ -134,26 +135,46 @@ To build the container from source, use the provided ``docker/Dockerfile`` (CUDA
 
    git clone https://github.com/NVIDIA-NeMo/NeMo.git
    cd NeMo
-   docker buildx build -f docker/Dockerfile -t nemo-speech .
+   docker buildx build -f docker/Dockerfile -t nemo-speech .          # CUDA 13 / H100+ (default)
+   docker run --rm -it --gpus all -v "$PWD:/workspace" nemo-speech bash
 
-See the header of ``docker/Dockerfile`` for CUDA 12 / A100 build arguments (``BASE_IMAGE``, ``GPU_TARGET``).
+For A100, set ``GPU_TARGET=a100``. A100 works with **both CUDA 12 and CUDA 13** — CUDA 13 (the default base image) is recommended; the CUDA 12 base is offered only as a convenience:
+
+.. code-block:: bash
+
+   # A100 on CUDA 13 (recommended) — uses the default CUDA 13 base image
+   docker buildx build -f docker/Dockerfile --build-arg GPU_TARGET=a100 -t nemo-speech:a100 .
+
+   # A100 on CUDA 12 (convenience)
+   docker buildx build -f docker/Dockerfile \
+     --build-arg BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 \
+     --build-arg GPU_TARGET=a100 -t nemo-speech:a100-cu12 .
+
+See the header of ``docker/Dockerfile`` for all build arguments (``BASE_IMAGE``, ``GPU_TARGET``).
 
 .. _install-from-pypi:
 
 Install from PyPI with pip (fallback — bring your own versions)
 ---------------------------------------------------------------
 
-Prefer your own Python/PyTorch/CUDA? Install your preferred PyTorch first (any version ≥ 2.6, built for your CUDA — see `PyTorch's install matrix <https://pytorch.org/get-started/locally/>`_), then add NeMo with the collections you need. Because ``nemo-toolkit`` only requires ``torch>=2.6``, your pre-installed PyTorch is kept, not replaced:
+Prefer your own Python/PyTorch/CUDA? Install your preferred PyTorch first (any version ≥ 2.6 for your CPU/CUDA/ROCm/Apple Silicon target — see `PyTorch's install matrix <https://pytorch.org/get-started/locally/>`_), then add NeMo. Because ``nemo-toolkit`` only requires ``torch>=2.6``, your pre-installed PyTorch is kept, not replaced. ``uv pip`` (uv's fast, pip-compatible installer) works just like ``pip``:
 
 .. code-block:: bash
 
+   uv venv --python 3.12          # any Python >= 3.10 your PyTorch supports — or use your own env
+   source .venv/bin/activate
+
    # 1) Your choice of PyTorch (example: CUDA 12.6 build). Skip if you already have one.
-   pip install torch --index-url https://download.pytorch.org/whl/cu126
+   uv pip install torch --index-url https://download.pytorch.org/whl/cu126
+
+   # 2) NeMo — your PyTorch above is kept (plain `pip install` works identically)
+   uv pip install 'nemo-toolkit[asr,tts]'        # also: [asr,tts,audio], [speechlm2], etc.
 
-   # 2) NeMo — your PyTorch above is kept
-   pip install nemo_toolkit[asr,tts]        # also: [asr,tts,audio], [speechlm2], etc.
+.. warning::
+
+   Do **not** use ``uv sync --locked`` for a bring-your-own stack — it intentionally applies ``uv.lock`` and replaces your Python/PyTorch/CUDA with the supported container baseline. Use ``uv pip`` (or ``pip``) here; reserve ``uv sync --locked`` for reproducing the supported stack (above).
 
-To have pip install our pinned PyTorch build instead, add the matching CUDA extra **and** the PyTorch wheel index. pip does not read uv's index configuration, so the ``--extra-index-url`` is required:
+To instead have the installer pull *our* pinned PyTorch build, add the matching CUDA extra **and** the PyTorch wheel index (``pip`` / ``uv pip`` do not read uv's project index config, so ``--extra-index-url`` is required):
 
 .. code-block:: bash
 
@@ -167,16 +188,19 @@ To have pip install our pinned PyTorch build instead, add the matching CUDA extr
 Verify Installation
 -------------------
 
-After installing, verify that NeMo is working:
+After installing, verify that the chosen collection imports:
+
+.. code-block:: bash
+
+   python -c "import nemo.collections.asr as nemo_asr; print('NeMo ASR installed')"
+
+If you installed with ``uv sync`` and have not activated ``.venv``, run the check through ``uv run python``. To also exercise a model download:
 
 .. code-block:: python
 
    import nemo.collections.asr as nemo_asr
-   print("NeMo ASR installed successfully!")
-
-   # Quick test: load a pretrained model
    model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
-   print(f"Model loaded: {model.__class__.__name__}")
+   print(f"Loaded: {model.__class__.__name__}")
 
 What's Next?
 ------------
diff --git a/docs/source/tools/nemo_forced_aligner.rst b/docs/source/tools/nemo_forced_aligner.rst
@@ -19,7 +19,7 @@ Demos & Tutorials
 Quickstart
 ----------
 
-1. Install `NeMo <https://github.com/NVIDIA/NeMo#installation>`__.
+1. Install NeMo with the ASR collection. See :ref:`installation`.
 2. Prepare a NeMo-style manifest containing the paths of audio files you would like to proces, and (optionally) their text.
 3. Run NFA's ``align.py`` script with the desired config, e.g.:
 
diff --git a/docs/source/tts/g2p.rst b/docs/source/tts/g2p.rst
@@ -126,7 +126,7 @@ Using this unknown token forces a G2P model to produce the same masking token as
 Requirements
 ------------
 
-G2P requires the NeMo ASR collection to be installed (``pip install nemo_toolkit[asr]``).
+G2P requires the NeMo ASR collection to be installed. See :ref:`installation` and include the ``asr`` extra.
 
 
 References
diff --git a/docs/source/tts/magpietts-finetuning.rst b/docs/source/tts/magpietts-finetuning.rst
@@ -20,7 +20,7 @@ Before finetuning, you will need:
 - A pretrained Magpie-TTS checkpoint (``pretrained.ckpt`` or ``pretrained.nemo``). Public checkpoints (``https://huggingface.co/nvidia/magpie_tts_multilingual_357m``) are available on Hugging Face.
 - The audio codec model (``https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps``), available on Hugging Face alongside the TTS checkpoint.
 - A prepared dataset. For faster finetuning audio codec tokens must be pre-extracted from your audio files. See the *Dataset Preparation* section below.
-- NeMo installed from source or via the NeMo container. See the `NeMo GitHub page <https://github.com/NVIDIA/NeMo>`_ for installation instructions.
+- NeMo installed from source or with the local Dockerfile. See :ref:`installation` for installation instructions.
 
 
 Dataset Preparation
diff --git a/nemo/agents/voice_agent/pipecat/services/nemo/stt.py b/nemo/agents/voice_agent/pipecat/services/nemo/stt.py
@@ -50,7 +50,7 @@
 
 except ModuleNotFoundError as e:
     logger.error(f"Exception: {e}")
-    logger.error('In order to use NVIDIA NeMo STT, you need to `pip install "nemo_toolkit[all]"`.')
+    logger.error('In order to use NVIDIA NeMo STT, you need to `pip install "nemo-toolkit[all]"`.')
     raise Exception(f"Missing module: {e}")
 
 
diff --git a/nemo/collections/speechlm2/vllm/salm/audio.py b/nemo/collections/speechlm2/vllm/salm/audio.py
@@ -95,7 +95,7 @@ def _load_nemo_perception(perception_cfg: dict) -> nn.Module:
         from nemo.collections.speechlm2.modules import AudioPerceptionModule
     except ImportError as e:
         raise ImportError(
-            "NeMo is required for the audio encoder. " "Install with: pip install nemo_toolkit[asr]"
+            "NeMo is required for the audio encoder. " "Install with: pip install 'nemo-toolkit[asr]'"
         ) from e
 
     cfg = DictConfig(perception_cfg)
diff --git a/nemo/collections/speechlm2/vllm/salm/model.py b/nemo/collections/speechlm2/vllm/salm/model.py
@@ -28,7 +28,7 @@
 granite-4.0-micro escape hatch).
 
 Requires NeMo toolkit for the audio encoder:
-    pip install nemo_toolkit[asr]
+    pip install 'nemo-toolkit[asr]'
 """
 
 from collections.abc import Iterable
diff --git a/nemo/core/config/templates/model_card.py b/nemo/core/config/templates/model_card.py
@@ -36,9 +36,9 @@
 
 ## NVIDIA NeMo: Training
 
-To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
+To train, fine-tune, or experiment with the model, install the PyTorch build for your platform first, then install [NVIDIA NeMo](https://docs.nvidia.com/nemo/speech/nightly/starthere/install.html) with the extras you need.
 ```
-pip install nemo_toolkit['all']
+pip install 'nemo-toolkit[all]'
 ``` 
 
 ## How to Use this Model
diff --git a/tools/nemo_forced_aligner/README.md b/tools/nemo_forced_aligner/README.md
@@ -12,7 +12,7 @@ NFA is a tool for generating token-, word- and segment-level timestamps of speec
 
 
 ## Quickstart
-1. Install [NeMo](https://github.com/NVIDIA/NeMo#installation).
+1. Install [NeMo](https://docs.nvidia.com/nemo/speech/nightly/starthere/install.html) with the ASR collection.
 2. Prepare a NeMo-style manifest containing the paths of audio files you would like to process, and (optionally) their text.
 3. Run NFA's `align.py` script with the desired config, e.g.:
     ``` bash
diff --git a/tools/nemo_forced_aligner/align.py b/tools/nemo_forced_aligner/align.py
@@ -48,9 +48,9 @@
     raise ImportError(
         "Missing required dependency for NFA. "
         "Install NeMo with NFA utilities support:\n"
-        "  pip install 'nemo_toolkit[all]>=2.5.0'\n"
+        "  pip install 'nemo-toolkit[all]>=2.5.0'\n"
         "Or install the latest development version:\n"
-        "  pip install git+https://github.com/NVIDIA/NeMo.git"
+        "  pip install git+https://github.com/NVIDIA-NeMo/NeMo.git"
     )
 """
 Align the utterances in manifest_filepath. 
diff --git a/tools/nemo_forced_aligner/align_eou.py b/tools/nemo_forced_aligner/align_eou.py
@@ -53,9 +53,9 @@
     raise ImportError(
         "Missing required dependency for NFA. "
         "Install NeMo with NFA utilities support:\n"
-        "  pip install 'nemo_toolkit[all]>=2.5.0'\n"
+        "  pip install 'nemo-toolkit[all]>=2.5.0'\n"
         "Or install the latest development version:\n"
-        "  pip install git+https://github.com/NVIDIA/NeMo.git"
+        "  pip install git+https://github.com/NVIDIA-NeMo/NeMo.git"
     )
 
 """
diff --git a/tools/nemo_forced_aligner/requirements.txt b/tools/nemo_forced_aligner/requirements.txt
@@ -1,3 +1,3 @@
-nemo_toolkit[all]
+nemo-toolkit[all]
 prettyprinter # for testing
 pytest # for testing