Skip to content

Commit 1d80e4f

Browse files
Parakeet text cleanups (#1193)
* more text cleanups * nits * minor text changes * rm troubleshooting * update * minor text fixes and reorganization * silence unnecessary logs * refactor out one level of async nesting --------- Co-authored-by: Charles Frye <[email protected]>
1 parent 33fb820 commit 1d80e4f

File tree

1 file changed

+120
-95
lines changed

1 file changed

+120
-95
lines changed

06_gpu_and_ml/audio-to-text/parakeet.py

Lines changed: 120 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,73 @@
1-
# # Real time audio transcription using Parakeet 🦜
1+
# # Real-time audio transcription using Parakeet
22

3-
# [Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet) is the name of a family of ASR models built using [NVIDIA's NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
4-
# We'll show you how to use Parakeet for real-time audio transcription,
5-
# with a simple Python client and a GPU server you can spin up easily in Modal.
3+
# This examples demonstrates the use of Parakeet ASR models for real-time speech-to-text on Modal.
64

7-
# This example uses the `nvidia/parakeet-tdt-0.6b-v2` model, which, as of May 13, 2025, sits at the
8-
# top of Hugging Face's [ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
5+
# [Parakeet](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#parakeet)
6+
# is the name of a family of ASR models built using [NVIDIA's NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).
7+
# We'll show you how to use Parakeet for real-time audio transcription on Modal GPUs,
8+
# with simple Python and browser clients.
99

10-
# To run this example either:
10+
# This example uses the `nvidia/parakeet-tdt-0.6b-v2` model which, as of June 2025, sits at the
11+
# top of Hugging Face's [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).
12+
13+
# To try out transcription from your terminal,
14+
# provide a URL for a `.wav` file to `modal run`:
1115

12-
# - run the browser/microphone frontend, or
13-
# ```bash
14-
# modal serve 06_gpu_and_ml/audio-to-text/parakeet.py
15-
# ```
16-
# - stream a .wav file from a URL (optional, default is "Dream Within a Dream" by Edgar Allan Poe).
1716
# ```bash
1817
# modal run 06_gpu_and_ml/audio-to-text/parakeet.py --audio-url="https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
1918
# ```
2019

21-
# See [Troubleshooting](https://modal.com/docs/examples/parakeet#client) at the bottom if you run into issues.
22-
23-
# Here's what your final output might look like:
20+
# You should see output like the following:
2421

2522
# ```bash
26-
# 🌐 Downloading audio file...
27-
# 🎧 Downloaded 6331478 bytes
28-
# ☀️ Waking up model, this may take a few seconds on cold start...
29-
# 📝 Transcription: A Dream Within A Dream Edgar Allan Poe
30-
# 📝 Transcription:
31-
# 📝 Transcription: take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
23+
# 🎤 Starting Transcription
24+
# A Dream Within A Dream Edgar Allan Poe
25+
# take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
3226
# ...
3327
# ```
3428

29+
# Running a web service you can hit from any browser isn't any harder -- Modal handles the deployment of both the frontend and backend in a single App!
30+
# Just run
31+
32+
# ```bash
33+
# modal serve 06_gpu_and_ml/audio-to-text/parakeet.py
34+
# ```
35+
36+
# and go to the link printed in your terminal.
37+
38+
# The full frontend code can be found [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/audio-to-text/frontend).
39+
3540
# ## Setup
41+
3642
import asyncio
3743
import os
44+
import sys
3845
from pathlib import Path
3946

4047
import modal
4148

42-
os.environ["MODAL_LOGLEVEL"] = "INFO"
43-
app_name = "parakeet-websocket"
49+
app = modal.App("example-parakeet")
4450

45-
app = modal.App(app_name)
46-
SILENCE_THRESHOLD = -45
47-
SILENCE_MIN_LENGTH_MSEC = 1000
48-
END_OF_STREAM = b"END_OF_STREAM"
4951
# ## Volume for caching model weights
52+
5053
# We use a [Modal Volume](https://modal.com/docs/guide/volumes) to cache the model weights.
5154
# This allows us to avoid downloading the model weights every time we start a new instance.
5255

56+
# For more on storing models on Modal, see [this guide](https://modal.com/docs/guide/model-weights).
57+
5358
model_cache = modal.Volume.from_name("parakeet-model-cache", create_if_missing=True)
59+
5460
# ## Configuring dependencies
55-
# The model runs remotely inside a [custom container](https://modal.com/docs/guide/custom-container). We can define the environment
56-
# and install our Python dependencies in that container's `Image`.
5761

58-
# For inference, we recommend using the official NVIDIA CUDA Docker images from Docker Hub.
62+
# The model runs remotely inside a container on Modal. We can define the environment
63+
# and install our Python dependencies in that container's [`Image`](https://modal.com/docs/guide/images).
64+
65+
# For finicky setups like NeMO's, we recommend using the official NVIDIA CUDA Docker images from Docker Hub.
5966
# You'll need to install Python and pip with the `add_python` option because the image
6067
# doesn't have these by default.
6168

6269
# Additionally, we install `ffmpeg` for handling audio data and `fastapi` to create a web
63-
# server for our websocket.
70+
# server for our WebSocket.
6471

6572
image = (
6673
modal.Image.from_registry(
@@ -82,48 +89,61 @@
8289
"nemo_toolkit[asr]==2.3.0",
8390
"cuda-python==12.8.0",
8491
"fastapi==0.115.12",
85-
"numpy==1.26.4", # downgrading numpy to avoid issues with CUDA
92+
"numpy<2",
8693
"pydub==0.25.1",
8794
)
88-
.entrypoint([])
89-
.add_local_dir(
90-
os.path.join(Path(__file__).parent.resolve(), "frontend"),
95+
.entrypoint([]) # silence chatty logs by container on start
96+
.add_local_dir( # changes fastest, so make this the last layer
97+
Path(__file__).parent / "frontend",
9198
remote_path="/frontend",
9299
)
93100
)
94101

95102
# ## Implementing real-time audio transcription on Modal
96103

97-
# Now we're ready to implement the transcription model. We wrap inference in a [modal.Cls](https://modal.com/docs/guide/lifecycle-functions) that
98-
# ensures models are loaded and then moved to the GPU once when a new container starts. Couple of notes:
104+
# Now we're ready to implement transcription. We wrap inference in a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions) that
105+
# ensures models are loaded and then moved to the GPU once when a new container starts.
99106

100-
# - The `load` method loads the model at start, instead of during inference, using [`modal.enter()`](https://modal.com/docs/reference/modal.enter#modalenter).
101-
# - The `transcribe` method takes bytes of audio data, and returns the transcribed text.
107+
# A couples of notes about this code:
108+
# - The `transcribe` method takes bytes of audio data and returns the transcribed text.
102109
# - The `web` method creates a FastAPI app using [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) that serves a
103110
# [WebSocket](https://modal.com/docs/guide/webhooks#websockets) endpoint for real-time audio transcription and a browser frontend for transcribing audio from your microphone.
111+
# - The `run_with_queue` method takes a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue) and passes audio data and transcriptions between our local machine and the GPU container.
104112

105113
# Parakeet tries really hard to transcribe everything to English!
106114
# Hence it tends to output utterances like "Yeah" or "Mm-hmm" when it runs on silent audio.
107-
# We can pre-process the incoming audio in the server by using `pydub`'s silence detection,
108-
# ensuring that we only pass audio with speech to our model.
115+
# We pre-process the incoming audio in the server using `pydub`'s silence detection,
116+
# ensuring that we don't pass silence into our model.
117+
118+
END_OF_STREAM = (
119+
b"END_OF_STREAM_8f13d09" # byte sequence indicating a stream is finished
120+
)
109121

110122

111123
@app.cls(volumes={"/cache": model_cache}, gpu="a10g", image=image)
112124
@modal.concurrent(max_inputs=14, target_inputs=10)
113125
class Parakeet:
114126
@modal.enter()
115127
def load(self):
128+
import logging
129+
116130
import nemo.collections.asr as nemo_asr
117131

132+
# silence chatty logs from nemo
133+
logging.getLogger("nemo_logger").setLevel(logging.CRITICAL)
134+
118135
self.model = nemo_asr.models.ASRModel.from_pretrained(
119136
model_name="nvidia/parakeet-tdt-0.6b-v2"
120137
)
121138

122-
async def transcribe(self, audio_bytes: bytes) -> str:
139+
def transcribe(self, audio_bytes: bytes) -> str:
123140
import numpy as np
124141

125142
audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32)
126-
output = self.model.transcribe([audio_data])
143+
144+
with NoStdStreams(): # hide output, see https://github.com/NVIDIA/NeMo/discussions/3281#discussioncomment-2251217
145+
output = self.model.transcribe([audio_data])
146+
127147
return output[0].text
128148

129149
@modal.asgi_app()
@@ -139,7 +159,7 @@ def web(self):
139159
async def status():
140160
return Response(status_code=200)
141161

142-
# server frontend
162+
# serve frontend
143163
@web_app.get("/")
144164
async def index():
145165
return HTMLResponse(content=open("/frontend/index.html").read())
@@ -201,7 +221,13 @@ async def run_with_queue(self, q: modal.Queue):
201221
print(f"Error handling queue: {type(e)}: {e}")
202222
return
203223

204-
async def handle_audio_chunk(self, chunk: bytes, audio_segment):
224+
async def handle_audio_chunk(
225+
self,
226+
chunk: bytes,
227+
audio_segment,
228+
silence_thresh=-45, # dB
229+
min_silence_len=1000, # ms
230+
):
205231
from pydub import AudioSegment, silence
206232

207233
new_audio_segment = AudioSegment(
@@ -210,118 +236,103 @@ async def handle_audio_chunk(self, chunk: bytes, audio_segment):
210236
sample_width=2,
211237
frame_rate=TARGET_SAMPLE_RATE,
212238
)
239+
213240
# append the new audio segment to the existing audio segment
214241
audio_segment += new_audio_segment
215242

243+
# detect windows of silence
216244
silent_windows = silence.detect_silence(
217245
audio_segment,
218-
min_silence_len=SILENCE_MIN_LENGTH_MSEC,
219-
silence_thresh=SILENCE_THRESHOLD,
246+
min_silence_len=min_silence_len,
247+
silence_thresh=silence_thresh,
220248
)
221249

222250
# if there are no silent windows, continue
223251
if len(silent_windows) == 0:
224252
return audio_segment, None
253+
225254
# get the last silent window because
226255
# we want to transcribe until the final pause
227256
last_window = silent_windows[-1]
257+
228258
# if the entire audio segment is silent, reset the audio segment
229259
if last_window[0] == 0 and last_window[1] == len(audio_segment):
230260
audio_segment = AudioSegment.empty()
231261
return audio_segment, None
262+
232263
# get the segment to transcribe: beginning until last pause
233264
segment_to_transcribe = audio_segment[: last_window[1]]
265+
234266
# remove the segment to transcribe from the audio segment
235267
audio_segment = audio_segment[last_window[1] :]
236268
try:
237-
text = await self.transcribe(segment_to_transcribe.raw_data)
269+
text = self.transcribe(segment_to_transcribe.raw_data)
238270
return audio_segment, text
239271
except Exception as e:
240272
print("❌ Transcription error:", e)
241273
raise e
242274

243275

244-
# ## Client
276+
# ## Running transcription from a local Python client
277+
245278
# Next, let's test the model with a [`local_entrypoint`](https://modal.com/docs/reference/modal.App#local_entrypoint) that streams audio data to the server and prints
246-
# out the transcriptions to our terminal in real-time.
279+
# out the transcriptions to our terminal as they arrive.
247280

248-
# Instead of using the WebSocket endpoint like the frontend,
281+
# Instead of using the WebSocket endpoint like the browser frontend,
249282
# we'll use a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue)
250283
# to pass audio data and transcriptions between our local machine and the GPU container.
251284

252285
AUDIO_URL = "https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
253-
TARGET_SAMPLE_RATE = 16000
254-
CHUNK_SIZE = 16000 # send one second of audio at a time
286+
TARGET_SAMPLE_RATE = 16_000
287+
CHUNK_SIZE = 16_000 # send one second of audio at a time
255288

256289

257290
@app.local_entrypoint()
258-
def main(audio_url: str = AUDIO_URL):
291+
async def main(audio_url: str = AUDIO_URL):
259292
from urllib.request import urlopen
260293

261-
print("🌐 Downloading audio file...")
294+
print(f"🌐 Downloading audio file from {audio_url}")
262295
audio_bytes = urlopen(audio_url).read()
263296
print(f"🎧 Downloaded {len(audio_bytes)} bytes")
264297

265298
audio_data = preprocess_audio(audio_bytes)
266299

267-
print("☀️ Waking up model, this may take a few seconds on cold start...")
268-
try:
269-
asyncio.run(run(audio_data))
270-
print("✅ Transcription complete!")
271-
except KeyboardInterrupt:
272-
print("\n🛑 Stopped by user.")
300+
print("🎤 Starting Transcription")
301+
with modal.Queue.ephemeral() as q:
302+
Parakeet().run_with_queue.spawn(q)
303+
send = asyncio.create_task(send_audio(q, audio_data))
304+
recv = asyncio.create_task(receive_text(q))
305+
await asyncio.gather(send, recv)
306+
print("✅ Transcription complete!")
307+
273308

309+
# Below are the two functions that coordinate streaming audio and receiving transcriptions.
274310

275-
# Below are the three main functions that coordinate streaming audio and receiving transcriptions.
276-
#
277-
# `send_audio` transmits chunks of audio data and then pauses to approximate streaming
278-
# speech at a natural rate. That said, we set it to faster
279-
# than real-time to compensate for network latency. Plus, we're not
280-
# trying to wait forever for this to finish.
311+
# `send_audio` transmits chunks of audio data with a slight delay,
312+
# as though it was being streamed from a live source, like a microphone.
313+
# `receive_text` waits for transcribed text to arrive and prints it.
281314

282315

283316
async def send_audio(q, audio_bytes):
284317
for chunk in chunk_audio(audio_bytes, CHUNK_SIZE):
285318
await q.put.aio(chunk, partition="audio")
286-
await asyncio.sleep(
287-
CHUNK_SIZE / TARGET_SAMPLE_RATE / 8
288-
) # simulate real-time pacing
319+
await asyncio.sleep(CHUNK_SIZE / TARGET_SAMPLE_RATE / 8)
289320
await q.put.aio(END_OF_STREAM, partition="audio")
290321

291322

292-
# `receive_transcriptions` is straightforward.
293-
# It just waits for a transcription and prints it after a small delay to avoid colliding with the print statements
294-
# from the GPU container.
295-
296-
297-
async def receive_transcriptions(q):
323+
async def receive_text(q):
298324
while True:
299325
message = await q.get.aio(partition="transcription")
300326
if message == END_OF_STREAM:
301327
break
302-
await asyncio.sleep(1.00) # add a delay to avoid stdout collision
303-
print(f"📝 Transcription: {message}")
304328

329+
print(message)
305330

306-
# We take full advantage of Modal's asynchronous capabilities here. In `run`, we spawn our function call
307-
# so it doesn't block, and then we create and wait on the send and receive tasks.
308-
309-
310-
async def run(audio_bytes):
311-
with modal.Queue.ephemeral() as q:
312-
Parakeet().run_with_queue.spawn(q)
313-
send_task = asyncio.create_task(send_audio(q, audio_bytes))
314-
receive_task = asyncio.create_task(receive_transcriptions(q))
315-
await asyncio.gather(send_task, receive_task)
316-
317-
318-
# ## Troubleshooting
319-
# - Make sure you have the latest version of the Modal CLI installed.
320-
# - The server takes a few seconds to start up on cold start. If your local client times out, try
321-
# restarting the client.
322331

323332
# ## Addenda
324-
# Helper functions for converting audio to Parakeet's input format and iterating over audio chunks.
333+
334+
# The remainder of the code in this example is boilerplate,
335+
# mostly for handling Parakeet's input format.
325336

326337

327338
def preprocess_audio(audio_bytes: bytes) -> bytes:
@@ -383,3 +394,17 @@ def preprocess_audio(audio_bytes: bytes) -> bytes:
383394
def chunk_audio(data: bytes, chunk_size: int):
384395
for i in range(0, len(data), chunk_size):
385396
yield data[i : i + chunk_size]
397+
398+
399+
class NoStdStreams(object):
400+
def __init__(self):
401+
self.devnull = open(os.devnull, "w")
402+
403+
def __enter__(self):
404+
self._stdout, self._stderr = sys.stdout, sys.stderr
405+
self._stdout.flush(), self._stderr.flush()
406+
sys.stdout, sys.stderr = self.devnull, self.devnull
407+
408+
def __exit__(self, exc_type, exc_value, traceback):
409+
sys.stdout, sys.stderr = self._stdout, self._stderr
410+
self.devnull.close()

0 commit comments

Comments
 (0)