You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* more text cleanups
* nits
* minor text changes
* rm troubleshooting
* update
* minor text fixes and reorganization
* silence unnecessary logs
* refactor out one level of async nesting
---------
Co-authored-by: Charles Frye <[email protected]>
# [Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet) is the name of a family of ASR models built using [NVIDIA's NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
4
-
# We'll show you how to use Parakeet for real-time audio transcription,
5
-
# with a simple Python client and a GPU server you can spin up easily in Modal.
3
+
# This examples demonstrates the use of Parakeet ASR models for real-time speech-to-text on Modal.
6
4
7
-
# This example uses the `nvidia/parakeet-tdt-0.6b-v2` model, which, as of May 13, 2025, sits at the
8
-
# top of Hugging Face's [ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
# - stream a .wav file from a URL (optional, default is "Dream Within a Dream" by Edgar Allan Poe).
17
16
# ```bash
18
17
# modal run 06_gpu_and_ml/audio-to-text/parakeet.py --audio-url="https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
19
18
# ```
20
19
21
-
# See [Troubleshooting](https://modal.com/docs/examples/parakeet#client) at the bottom if you run into issues.
22
-
23
-
# Here's what your final output might look like:
20
+
# You should see output like the following:
24
21
25
22
# ```bash
26
-
# 🌐 Downloading audio file...
27
-
# 🎧 Downloaded 6331478 bytes
28
-
# ☀️ Waking up model, this may take a few seconds on cold start...
29
-
# 📝 Transcription: A Dream Within A Dream Edgar Allan Poe
30
-
# 📝 Transcription:
31
-
# 📝 Transcription: take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
23
+
# 🎤 Starting Transcription
24
+
# A Dream Within A Dream Edgar Allan Poe
25
+
# take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
32
26
# ...
33
27
# ```
34
28
29
+
# Running a web service you can hit from any browser isn't any harder -- Modal handles the deployment of both the frontend and backend in a single App!
.entrypoint([])# silence chatty logs by container on start
96
+
.add_local_dir(# changes fastest, so make this the last layer
97
+
Path(__file__).parent/"frontend",
91
98
remote_path="/frontend",
92
99
)
93
100
)
94
101
95
102
# ## Implementing real-time audio transcription on Modal
96
103
97
-
# Now we're ready to implement the transcription model. We wrap inference in a [modal.Cls](https://modal.com/docs/guide/lifecycle-functions) that
98
-
# ensures models are loaded and then moved to the GPU once when a new container starts. Couple of notes:
104
+
# Now we're ready to implement transcription. We wrap inference in a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions) that
105
+
# ensures models are loaded and then moved to the GPU once when a new container starts.
99
106
100
-
# - The `load` method loads the model at start, instead of during inference, using [`modal.enter()`](https://modal.com/docs/reference/modal.enter#modalenter).
101
-
# - The `transcribe` method takes bytes of audio data, and returns the transcribed text.
107
+
# A couples of notes about this code:
108
+
# - The `transcribe` method takes bytes of audio data and returns the transcribed text.
102
109
# - The `web` method creates a FastAPI app using [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) that serves a
103
110
# [WebSocket](https://modal.com/docs/guide/webhooks#websockets) endpoint for real-time audio transcription and a browser frontend for transcribing audio from your microphone.
111
+
# - The `run_with_queue` method takes a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue) and passes audio data and transcriptions between our local machine and the GPU container.
104
112
105
113
# Parakeet tries really hard to transcribe everything to English!
106
114
# Hence it tends to output utterances like "Yeah" or "Mm-hmm" when it runs on silent audio.
107
-
# We can pre-process the incoming audio in the server by using `pydub`'s silence detection,
108
-
# ensuring that we only pass audio with speech to our model.
115
+
# We pre-process the incoming audio in the server using `pydub`'s silence detection,
116
+
# ensuring that we don't pass silence into our model.
117
+
118
+
END_OF_STREAM= (
119
+
b"END_OF_STREAM_8f13d09"# byte sequence indicating a stream is finished
# ## Running transcription from a local Python client
277
+
245
278
# Next, let's test the model with a [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint) that streams audio data to the server and prints
246
-
# out the transcriptions to our terminal in real-time.
279
+
# out the transcriptions to our terminal as they arrive.
247
280
248
-
# Instead of using the WebSocket endpoint like the frontend,
281
+
# Instead of using the WebSocket endpoint like the browser frontend,
249
282
# we'll use a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue)
250
283
# to pass audio data and transcriptions between our local machine and the GPU container.
0 commit comments