Changed the documentation getting started structure#15460
Changed the documentation getting started structure#15460
Conversation
7c3f90f to
9c0fd83
Compare
| - Recommended Model | ||
| - Why | ||
| * - Get the best accuracy on English | ||
| - `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_ |
There was a problem hiding this comment.
Should be Canary-Qwen-2.5B,
we can recommend Parakeet-TDT V2 / V3 as very fast offline alternatives to Canary models with almost SOTA accuracy
| - `Canary-1B V2 <https://huggingface.co/nvidia/canary-1b-v2>`_ | ||
| - Supports 25 EU languages + translation between them. AED decoder. | ||
| * - Fast multilingual inference | ||
| - `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_ |
There was a problem hiding this comment.
I don't think we need to highlight 1B-Flash now that we have v2
| - `Canary-1B Flash <https://huggingface.co/nvidia/canary-1b-flash>`_ | ||
| - Optimized for speed while maintaining multilingual quality. | ||
| * - Stream audio in real-time | ||
| - Cache-aware Streaming FastConformer |
There was a problem hiding this comment.
Feature Nemotron-Speech directly?
| * - I want to... | ||
| - Recommended Model | ||
| - Why | ||
| * - Determine who spoke when |
There was a problem hiding this comment.
What about Streaming Sortformer?
| - Full-duplex model that both understands and generates speech. | ||
|
|
||
|
|
||
| Decision Flowchart |
There was a problem hiding this comment.
Revise according to above comments
| - Audio-aware chatbots, speech translation | ||
|
|
||
|
|
||
| Encoder Architectures |
| trainer.devices=8 | ||
|
|
||
|
|
||
| Manifest Files |
There was a problem hiding this comment.
TODO: maybe extend this section to cover "Supported data formats" instead (future PR)
|
Resolve the conflicts before continuing |
7a20f73 to
212c0e2
Compare
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
…entation getting started structure Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com> Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
212c0e2 to
d8b8a8f
Compare
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
9c2673b to
7d566f2
Compare
…ig and ChunkState Signed-off-by: Ssofja <sofiakostandian@gmail.com>
3b55b96 to
139baf9
Compare
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
docs/source/starthere/install.rst
Outdated
| pip install nemo_toolkit[asr,tts] | ||
|
|
||
| # Install everything speech-related | ||
| pip install nemo_toolkit[asr,tts,audio,common] |
There was a problem hiding this comment.
I think common is not needed? Does it add anything not there already?
Can we also add "Development installation" git clone nemo; pip install -e .[test]
docs/source/starthere/install.rst
Outdated
| - Text-to-Speech models, vocoders, and audio codecs | ||
| * - ``audio`` | ||
| - Audio processing models (enhancement, separation) | ||
| * - ``common`` |
There was a problem hiding this comment.
Remove common, I'll remove it from deps too
|
|
||
| git clone https://github.com/NVIDIA/NeMo.git | ||
| cd NeMo | ||
| pip install -e '.[asr,tts]' |
There was a problem hiding this comment.
| pip install -e '.[asr,tts]' | |
| pip install -e '.[test]' |
|
|
||
| # Load models | ||
| spec_gen = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch") | ||
| vocoder = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_hifigan") |
There was a problem hiding this comment.
Let's feature Magpie TTS here instead CC @blisc
| @@ -0,0 +1,119 @@ | |||
| .. _ten-minutes: | |||
|
|
|||
| 10 Minutes to NeMo Speech | |||
There was a problem hiding this comment.
I like the "10 Minutes" idea but as-is this section only reflects inference while the title suggests this is a more comprehensive overview. Can we rename to sth like "NeMo Speech Inference in 5 Minutes"
| NeMo models are PyTorch modules that also integrate with `PyTorch Lightning <https://lightning.ai/>`__ for training and `Hydra <https://hydra.cc/>`__ + `OmegaConf <https://omegaconf.readthedocs.io/>`__ for configuration. | ||
|
|
||
|
|
||
| Configuration with YAML |
There was a problem hiding this comment.
"Configuration with YAML" shouldn't go to "Key concepts in Speech AI" section
There was a problem hiding this comment.
I think this and the next sections belong to a separate major section called "Overview of NeMo Speech"
| - Why | ||
| * - Get the best accuracy on English | ||
| - `Parakeet-TDT-0.6B V2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`_ | ||
| - #1 on the `OpenASR Leaderboard <https://huggingface.co/spaces/hf-audio/open_asr_leaderboard>`_. TDT decoder provides accurate timestamps. |
There was a problem hiding this comment.
Even Canary-Qwen is no longer #1, let's just use "Top of the ..."
|
|
||
| NeMo offers many pretrained speech models. This guide helps you pick the right one for your use case. | ||
|
|
||
| ASR: Which Model Should I Use? |
There was a problem hiding this comment.
Added in comments. @Ssofja could you add parakeet-v3 --> Most performant multilingual ASR
There was a problem hiding this comment.
added in the new commit
| - `Multitalker Parakeet Streaming <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`_ | ||
| - Handles overlapping speech in real-time with speaker-adapted decoding. | ||
|
|
||
| TTS: Which Model Should I Use? |
| - Audio Codec | ||
| - Neural audio codec for tokenizing audio. Used by MagpieTTS internally. | ||
|
|
||
| Speaker Tasks: Which Model Should I Use? |
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
| **Channels** — Many models use mono input, but some support **multi-channel** audio (e.g. for spatial or multi-mic setups). See the model and preprocessor documentation for your use case. | ||
|
|
||
| .. code-block:: bash | ||
| **Preprocessing** — NeMo models typically include a **preprocessor** (e.g. resampling, stereo→mono, mel-spectrogram) in the pipeline. You don't have to resample or convert channels offline unless you're building a custom dataset or bypassing the default preprocessor. |
There was a problem hiding this comment.
Not sure if resampling and stereo->mono is true -- in fact most models expect the user to provide the audio already resampled and converted to mono?
| The original architecture from `Gulati et al. (2020) <https://arxiv.org/abs/2005.08100>`_ that combines self-attention with convolutions for both global and local patterns. | ||
|
|
||
| **FastConformer** | ||
| A faster variant of Conformer with 8× subsampling and optimized attention. NeMo's default choice for ASR; recommended for new projects. |
There was a problem hiding this comment.
Link to FastConformer paper https://arxiv.org/abs/2305.05084
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Co-authored-by: Piotr Żelasko <pzelasko@nvidia.com> Signed-off-by: Ssofja <78349198+Ssofja@users.noreply.github.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
blisc
left a comment
There was a problem hiding this comment.
Quickly skimmed it, and it looks good
|
Issues observed:
Suggested setup: conda create -n nemo python=3.12 -y && conda activate nemo
conda install nvidia::cuda-toolkit
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130
pip install -e . |
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
9793777 to
481a31c
Compare
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
This PR is changing the Getting Started part from Nemo Documentation
Collection: [Note which collection this PR will affect]
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information