Changed the documentation getting started structure by Ssofja · Pull Request #15460 · NVIDIA-NeMo/NeMo

Ssofja · 2026-03-03T14:18:59Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This PR is changing the Getting Started part from Nemo Documentation
Collection: [Note which collection this PR will affect]

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

pzelasko · 2026-03-04T19:49:06Z

docs/source/starthere/choosing_a_model.rst

+     - Recommended Model
+     - Why
+   * - Get the best accuracy on English
+     - `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_


Should be Canary-Qwen-2.5B,
we can recommend Parakeet-TDT V2 / V3 as very fast offline alternatives to Canary models with almost SOTA accuracy

pzelasko · 2026-03-04T20:21:10Z

docs/source/starthere/choosing_a_model.rst

+     - `Canary-1B V2 <https://huggingface.co/nvidia/canary-1b-v2>`_
+     - Supports 25 EU languages + translation between them. AED decoder.
+   * - Fast multilingual inference
+     - `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_


I don't think we need to highlight 1B-Flash now that we have v2

pzelasko · 2026-03-04T20:21:28Z

docs/source/starthere/choosing_a_model.rst

+     - `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_
+     - Optimized for speed while maintaining multilingual quality.
+   * - Stream audio in real-time
+     - Cache-aware Streaming FastConformer


Feature Nemotron-Speech directly?

pzelasko · 2026-03-04T20:22:50Z

docs/source/starthere/choosing_a_model.rst

+   * - I want to...
+     - Recommended Model
+     - Why
+   * - Determine who spoke when


What about Streaming Sortformer?

pzelasko · 2026-03-04T20:24:31Z

docs/source/starthere/choosing_a_model.rst

+     - Full-duplex model that both understands and generates speech.
+
+
+Decision Flowchart


Revise according to above comments

pzelasko

Looking good!

docs/source/starthere/key_concepts.rst

pzelasko · 2026-03-04T20:26:13Z

docs/source/starthere/key_concepts.rst

+     - Audio-aware chatbots, speech translation
+
+
+Encoder Architectures


Transformer?

pzelasko · 2026-03-04T20:27:12Z

docs/source/starthere/key_concepts.rst

+       trainer.devices=8
+
+
+Manifest Files


TODO: maybe extend this section to cover "Supported data formats" instead (future PR)

pzelasko · 2026-03-04T20:28:16Z

Resolve the conflicts before continuing

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

…entation getting started structure Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com> Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

…ig and ChunkState Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko · 2026-03-10T19:22:30Z

docs/source/starthere/install.rst

+   pip install nemo_toolkit[asr,tts]
+
+   # Install everything speech-related
+   pip install nemo_toolkit[asr,tts,audio,common]


I think common is not needed? Does it add anything not there already?

Can we also add "Development installation" git clone nemo; pip install -e .[test]

pzelasko · 2026-03-10T19:23:19Z

docs/source/starthere/install.rst

+     - Text-to-Speech models, vocoders, and audio codecs
+   * - ``audio``
+     - Audio processing models (enhancement, separation)
+   * - ``common``


Remove common, I'll remove it from deps too

pzelasko · 2026-03-10T19:23:36Z

docs/source/starthere/install.rst

+
+   git clone https://github.com/NVIDIA/NeMo.git
+   cd NeMo
+   pip install -e '.[asr,tts]'


Suggested change

pip install -e '.[asr,tts]'

pip install -e '.[test]'

pzelasko · 2026-03-10T19:24:34Z

docs/source/starthere/ten_minutes.rst

+
+   # Load models
+   spec_gen = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
+   vocoder = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_hifigan")


Let's feature Magpie TTS here instead CC @blisc

pzelasko · 2026-03-10T19:26:15Z

docs/source/starthere/ten_minutes.rst

@@ -0,0 +1,119 @@
+.. _ten-minutes:
+
+10 Minutes to NeMo Speech


I like the "10 Minutes" idea but as-is this section only reflects inference while the title suggests this is a more comprehensive overview. Can we rename to sth like "NeMo Speech Inference in 5 Minutes"

pzelasko · 2026-03-10T19:30:46Z

docs/source/starthere/key_concepts.rst

+NeMo models are PyTorch modules that also integrate with `PyTorch Lightning <https://lightning.ai/>`__ for training and `Hydra <https://hydra.cc/>`__ + `OmegaConf <https://omegaconf.readthedocs.io/>`__ for configuration.
+
+
+Configuration with YAML


"Configuration with YAML" shouldn't go to "Key concepts in Speech AI" section

I think this and the next sections belong to a separate major section called "Overview of NeMo Speech"

pzelasko · 2026-03-10T19:32:07Z

docs/source/starthere/choosing_a_model.rst

+     - Why
+   * - Get the best accuracy on English
+     - `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_
+     - #1 on the `OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`_. TDT decoder provides accurate timestamps.


Even Canary-Qwen is no longer #1, let's just use "Top of the ..."

pzelasko · 2026-03-10T19:32:24Z

docs/source/starthere/choosing_a_model.rst

+
+NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case.
+
+ASR: Which Model Should I Use?


CC @nithinraok please review

Added in comments. @Ssofja could you add parakeet-v3 --> Most performant multilingual ASR

added in the new commit

pzelasko · 2026-03-10T19:32:35Z

docs/source/starthere/choosing_a_model.rst

+     - `Multitalker Parakeet Streaming <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`_
+     - Handles overlapping speech in real-time with speaker-adapted decoding.
+
+TTS: Which Model Should I Use?


CC @blisc please review

pzelasko · 2026-03-10T19:32:42Z

docs/source/starthere/choosing_a_model.rst

+     - Audio Codec
+     - Neural audio codec for tokenizing audio. Used by MagpieTTS internally.
+
+Speaker Tasks: Which Model Should I Use?


CC @tango4j please review

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

…ting_started

docs/source/starthere/key_concepts.rst

pzelasko · 2026-03-18T14:23:41Z

docs/source/starthere/key_concepts.rst

+**Channels** — Many models use mono input, but some support **multi-channel** audio (e.g. for spatial or multi-mic setups). See the model and preprocessor documentation for your use case.

-   .. code-block:: bash
+**Preprocessing** — NeMo models typically include a **preprocessor** (e.g. resampling, stereo→mono, mel-spectrogram) in the pipeline. You don't have to resample or convert channels offline unless you're building a custom dataset or bypassing the default preprocessor.


Not sure if resampling and stereo->mono is true -- in fact most models expect the user to provide the audio already resampled and converted to mono?

pzelasko · 2026-03-18T14:24:21Z

docs/source/starthere/key_concepts.rst

+   The original architecture from `Gulati et al. (2020) <https://arxiv.org/abs/2005.08100>`_ that combines self-attention with convolutions for both global and local patterns.
+
+**FastConformer**
+   A faster variant of Conformer with 8× subsampling and optimized attention. NeMo's default choice for ASR; recommended for new projects.


Link to FastConformer paper https://arxiv.org/abs/2305.05084

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

blisc

Quickly skimmed it, and it looks good

nithinraok · 2026-03-19T19:41:49Z

Issues observed:

In asr/results.html#parakeet → the model class is incorrectly shown as "Language".
In asr/results.html#parakeet → some models are listed more than once.
checkpoints/intro.html → is this page necessary, or can it be merged with other checkpoint-related documentation?
starthere/choosing_a_model.html#asr-which-model-should-i-use → add parakeet-v3.
starthere/install.html#install-from-source → this does not install the required NeMo collections when using [test].
starthere/install.html#installation → include a recommended installation sequence. Installing PyTorch and CUDA Toolkit before NeMo is typically recommended.

Suggested setup:

conda create -n nemo python=3.12 -y && conda activate nemo
conda install nvidia::cuda-toolkit
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install -e .

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

…ting_started

Ssofja requested a review from pzelasko March 3, 2026 14:19

Ssofja force-pushed the documentation-ref-getting_started branch from 7c3f90f to 9c0fd83 Compare March 3, 2026 20:41

pzelasko reviewed Mar 4, 2026

View reviewed changes

Ssofja force-pushed the documentation-ref-getting_started branch from 7a20f73 to 212c0e2 Compare March 9, 2026 11:36

Ssofja and others added 4 commits March 9, 2026 15:37

Changed the documentation getting started structure

5ade9c2

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

index on documentation-ref-getting_started: 9c0fd83 Changed the docum…

ae37717

…entation getting started structure Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Apply suggestions from code review

a378746

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com> Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Fixed html issue

d8b8a8f

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja force-pushed the documentation-ref-getting_started branch from 212c0e2 to d8b8a8f Compare March 9, 2026 11:37

Removed fundamentals.rst

7d566f2

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja force-pushed the documentation-ref-getting_started branch from 9c2673b to 7d566f2 Compare March 9, 2026 11:48

github-actions bot added the TTS label Mar 9, 2026

Updated LongformConfig and LongformChunkState to ChunkedInferenceConf…

139baf9

…ig and ChunkState Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja force-pushed the documentation-ref-getting_started branch from 3b55b96 to 139baf9 Compare March 10, 2026 10:41

Merge branch 'main' into documentation-ref-getting_started

4f79f82

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

github-actions bot removed the TTS label Mar 10, 2026

pzelasko requested changes Mar 10, 2026

View reviewed changes

pzelasko requested review from blisc, nithinraok and tango4j March 10, 2026 19:34

Ssofja added 2 commits March 18, 2026 16:45

added new required changes

c2c04e7

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Merge remote-tracking branch 'origin/main' into documentation-ref-get…

d4a931c

…ting_started

pzelasko reviewed Mar 18, 2026

View reviewed changes

Update docs/source/starthere/key_concepts.rst

7341cd4

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

Ssofja and others added 2 commits March 19, 2026 13:38

Update docs/source/starthere/key_concepts.rst

8c116fa

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>

added link to the paper and improvements regarding preprocessing

eee8219

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

blisc reviewed Mar 19, 2026

View reviewed changes

github-actions bot added the ASR label Mar 20, 2026

Ssofja added 2 commits March 21, 2026 00:41

Reverted back missing changes

6c835ed

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

fixed with asr data and instalation, added parakeet_v3 to the tables

481a31c

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja force-pushed the documentation-ref-getting_started branch from 9793777 to 481a31c Compare March 20, 2026 20:42

pzelasko approved these changes Mar 20, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into documentation-ref-get…

d367aa4

…ting_started

		- Full-duplex model that both understands and generates speech.


		Decision Flowchart

		- Audio-aware chatbots, speech translation


		Encoder Architectures

		NeMo models are PyTorch modules that also integrate with `PyTorch Lightning <https://lightning.ai/>`__ for training and `Hydra <https://hydra.cc/>`__ + `OmegaConf <https://omegaconf.readthedocs.io/>`__ for configuration.


		Configuration with YAML


		NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case.

		ASR: Which Model Should I Use?

Conversation

Ssofja commented Mar 3, 2026

What does this PR do ?

Who can review?

Additional Information

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzelasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

nithinraok commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants