Skip to content

Changed the documentation getting started structure#15460

Open
Ssofja wants to merge 15 commits intomainfrom
documentation-ref-getting_started
Open

Changed the documentation getting started structure#15460
Ssofja wants to merge 15 commits intomainfrom
documentation-ref-getting_started

Conversation

@Ssofja
Copy link
Collaborator

@Ssofja Ssofja commented Mar 3, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

This PR is changing the Getting Started part from Nemo Documentation
Collection: [Note which collection this PR will affect]

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@Ssofja Ssofja requested a review from pzelasko March 3, 2026 14:19
@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 7c3f90f to 9c0fd83 Compare March 3, 2026 20:41
- Recommended Model
- Why
* - Get the best accuracy on English
- `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Canary-Qwen-2.5B,
we can recommend Parakeet-TDT V2 / V3 as very fast offline alternatives to Canary models with almost SOTA accuracy

- `Canary-1B V2 <https://huggingface.co/nvidia/canary-1b-v2>`_
- Supports 25 EU languages + translation between them. AED decoder.
* - Fast multilingual inference
- `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to highlight 1B-Flash now that we have v2

- `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_
- Optimized for speed while maintaining multilingual quality.
* - Stream audio in real-time
- Cache-aware Streaming FastConformer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature Nemotron-Speech directly?

* - I want to...
- Recommended Model
- Why
* - Determine who spoke when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Streaming Sortformer?

- Full-duplex model that both understands and generates speech.


Decision Flowchart
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise according to above comments

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

- Audio-aware chatbots, speech translation


Encoder Architectures
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformer?

trainer.devices=8


Manifest Files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: maybe extend this section to cover "Supported data formats" instead (future PR)

@pzelasko
Copy link
Collaborator

pzelasko commented Mar 4, 2026

Resolve the conflicts before continuing

@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 7a20f73 to 212c0e2 Compare March 9, 2026 11:36
Ssofja and others added 4 commits March 9, 2026 15:37
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
…entation getting started structure

Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 212c0e2 to d8b8a8f Compare March 9, 2026 11:37
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 9c2673b to 7d566f2 Compare March 9, 2026 11:48
@github-actions github-actions bot added the TTS label Mar 9, 2026
…ig and ChunkState

Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 3b55b96 to 139baf9 Compare March 10, 2026 10:41
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@github-actions github-actions bot removed the TTS label Mar 10, 2026
pip install nemo_toolkit[asr,tts]

# Install everything speech-related
pip install nemo_toolkit[asr,tts,audio,common]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think common is not needed? Does it add anything not there already?

Can we also add "Development installation" git clone nemo; pip install -e .[test]

- Text-to-Speech models, vocoders, and audio codecs
* - ``audio``
- Audio processing models (enhancement, separation)
* - ``common``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove common, I'll remove it from deps too


git clone https://github.com/NVIDIA/NeMo.git
cd NeMo
pip install -e '.[asr,tts]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install -e '.[asr,tts]'
pip install -e '.[test]'


# Load models
spec_gen = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
vocoder = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_hifigan")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's feature Magpie TTS here instead CC @blisc

@@ -0,0 +1,119 @@
.. _ten-minutes:

10 Minutes to NeMo Speech
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the "10 Minutes" idea but as-is this section only reflects inference while the title suggests this is a more comprehensive overview. Can we rename to sth like "NeMo Speech Inference in 5 Minutes"

NeMo models are PyTorch modules that also integrate with `PyTorch Lightning <https://lightning.ai/>`__ for training and `Hydra <https://hydra.cc/>`__ + `OmegaConf <https://omegaconf.readthedocs.io/>`__ for configuration.


Configuration with YAML
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Configuration with YAML" shouldn't go to "Key concepts in Speech AI" section

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the next sections belong to a separate major section called "Overview of NeMo Speech"

- Why
* - Get the best accuracy on English
- `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_
- #1 on the `OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`_. TDT decoder provides accurate timestamps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even Canary-Qwen is no longer #1, let's just use "Top of the ..."


NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case.

ASR: Which Model Should I Use?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @nithinraok please review

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in comments. @Ssofja could you add parakeet-v3 --> Most performant multilingual ASR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in the new commit

- `Multitalker Parakeet Streaming <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`_
- Handles overlapping speech in real-time with speaker-adapted decoding.

TTS: Which Model Should I Use?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @blisc please review

- Audio Codec
- Neural audio codec for tokenizing audio. Used by MagpieTTS internally.

Speaker Tasks: Which Model Should I Use?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @tango4j please review

Ssofja added 2 commits March 18, 2026 16:45
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
**Channels** — Many models use mono input, but some support **multi-channel** audio (e.g. for spatial or multi-mic setups). See the model and preprocessor documentation for your use case.

.. code-block:: bash
**Preprocessing** — NeMo models typically include a **preprocessor** (e.g. resampling, stereo→mono, mel-spectrogram) in the pipeline. You don't have to resample or convert channels offline unless you're building a custom dataset or bypassing the default preprocessor.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if resampling and stereo->mono is true -- in fact most models expect the user to provide the audio already resampled and converted to mono?

The original architecture from `Gulati et al. (2020) <https://arxiv.org/abs/2005.08100>`_ that combines self-attention with convolutions for both global and local patterns.

**FastConformer**
A faster variant of Conformer with 8× subsampling and optimized attention. NeMo's default choice for ASR; recommended for new projects.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to FastConformer paper https://arxiv.org/abs/2305.05084

Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Ssofja and others added 2 commits March 19, 2026 13:38
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Copy link
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly skimmed it, and it looks good

@nithinraok
Copy link
Member

Issues observed:

  • In asr/results.html#parakeet → the model class is incorrectly shown as "Language".
  • In asr/results.html#parakeet → some models are listed more than once.
  • checkpoints/intro.html → is this page necessary, or can it be merged with other checkpoint-related documentation?
  • starthere/choosing_a_model.html#asr-which-model-should-i-use → add parakeet-v3.
  • starthere/install.html#install-from-source → this does not install the required NeMo collections when using [test].
  • starthere/install.html#installation → include a recommended installation sequence. Installing PyTorch and CUDA Toolkit before NeMo is typically recommended.

Suggested setup:

conda create -n nemo python=3.12 -y && conda activate nemo
conda install nvidia::cuda-toolkit
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install -e .

@github-actions github-actions bot added the ASR label Mar 20, 2026
Ssofja added 2 commits March 21, 2026 00:41
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja force-pushed the documentation-ref-getting_started branch from 9793777 to 481a31c Compare March 20, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants