- Upload
.wavUser has to upload their audio file (only for.wavfile) into the website.
- Drag and drop in soon feature
- Other audio file type will soon support
- Validation Once use upload their file, the file will get validate by front end
- If file type invalid:
- Front end will DO NOT except the file and make user to upload their file again
- If file type valid
- Front end will allow user to click button
transcribe
- Process
Once file get validate and pass into back end when user clicked
transcribebutton
- Front end will send
POSTHTTP method to back end via back end server (http://127.0.0.1:8000) - Then route the request to api
/api/file-asr/upload - Next, the file will be process by
Fastconformermodel and output as transcribe text - Finally, server return the output into front end
- Install Nemo Framework:
1.1 Follow official document step but change the name from
nemotothai-asrconda environment- doc:
https://docs.nvidia.com/nemo-framework/user-guide/latest/installation.html1.2 Updatepytorchversion topytorch=2.7.1
- doc:
- Install
Node Package Manager - If you're using Windows OS, Install Windows Sub Linux (WSL)
USE THIS COMMAND: . run.sh (Make sure you're in root folder)
OR
-
For front end: 1.1 Access into next-frontend directory using
cd next-frontend1.2 Installnpmdependencies and packages usingnpm i1.3 Start development server usingnpm run dev -
For back end: 2.1 Access into fastapi-backend directory using
cd fastapi-backend2.2 Activate Conda environment usingconda activate thai-asr2.3 Start development server usingfastapi dev app/main.py
.wavfile type- 16000 Hz Mono-channel
If the sound file isn't 16kHz with Mono-channel. You have to convert it before use
.transcribe()or.transcribe_generator() - Consume GPU resource
- Silero VAD supports 8000 Hz and 16000 Hz sampling rates.
- Accept only Mono-channel
- Consume CPU resources
DeepWiki: https://deepwiki.com/snakers4/silero-vad/4-basic-usage
- In front end page, we created only one single page with dynamic component. So both feature
Long FormandStreamingwill have shared states and mostly use the same components.
Error occured when Pytorch version when installing with Nemo Framework document is version pytorch=2.2.0
Upgrade to
pytorch=2.7.1to resolve the error
Bug resolved!: DO NOT use config.batch_size = command
Bug resolved!:
Bad code transcripts_result = list(asr_model.transcribe_generator(segment_tensors, config))[0]
Fix code transcripts_result = list(asr_model.transcribe_generator(segment_tensors, config)) Remove [0]
Since we use batch_size = 4 (default), it will create list with every 4 audio chunk transcribed
Bug resolved!:
Error: E0728 17:12:02.357453 169 streaming_asr_ensemble.cc:1337] Caught exception: Number of output frames or output tokens do not match expected values. Frames expected 201 got 101. Tokens expected 3001 got 3001
Cause:
Frames expected 201 got 101-> Frame expected mismatched- Model has trained to expect
ms_per_timestep=80not40Fix: - when building
.rmirfile to.riva, use--ms_per_timestep=80 \flag to make model match with the sound file
Github Official: https://github.com/snakers4/silero-vad Github Colab Example: https://github.com/snakers4/silero-vad/blob/master/silero-vad.ipynb
For Straming use class VADIterator
Github: https://github.com/snakers4/silero-vad/blob/94811cbe1207ec24bc0f5370b895364b8934936f/src/silero_vad/utils_vad.py#L398
SAMPLING_RATEmeans analog signal to store and process sound by convert into digital signal. It's measured in Hz or kHz. Typically in nowaday headphone has 48 kHz or 44,100 kHz which is 48,000 or 44,100 samples taken per seconds.
- Higher
SAMPLING_RATEmeans more sound's detail and accurate
-
There is no need to be
file_pathfor.transcribe(). You can usenumpy_arrayswhich from.wavto transcribe the sound. BUT, you have to follow these steps: Github usesoundfile: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/results.html#transcribing-inference -
Use
.transcribe_generator(list(.wav tensor_audio), config))instead of.transcribe()Why?: Can handle multi-segments instead of looping into each one -
When transcribe multi-samples simultaneously, config model's strategy from
greedy(default) togreedy_batchwill get better performance
| Feature | greedy |
greedy_batch |
|---|---|---|
| 🧠 Decoding Method | Processes one sample at a time | Processes multiple samples in parallel |
| 🔄 Efficiency | Lower — suitable for debugging/testing | Higher — better for inference speed |
| 📦 Batch Support | No | Yes |
| 🪄 Implementation | Simpler logic | Requires optimized batching mechanisms |
| 📊 Use Case | Step-by-step analysis, prototyping | Production-level transcription |
| 🎯 Output Shape | Output per individual input | Batched output matching input batch |
| 🛠 Compatibility | Any input shape | Requires consistent input lengths or padding |
- To change ASR model config
- Check what can be setting using
decoding_cfg = asr_model.cfg.decoding
print(decoding_cfg)
- If you want to change config (ex. strategy)
decoding_cfg.strategy = "greedy_batch" # change the "strategy" to other thing you want to change
asr_model.change_decoding_strategy(decoding_cfg) # make config change in model
or you can use
asr_model.change_decoding_strategy(decoding_cfg={"strategy": "greedy_batch"})
Use Local docker is better for
- single laptop,
- lower overhead,
- great for devlopment/testing
- mostly use with single laptop.