Skip to content

Parakeet text cleanups #1193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 28, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions 06_gpu_and_ml/audio-to-text/parakeet.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,33 @@
# This example uses the `nvidia/parakeet-tdt-0.6b-v2` model, which, as of May 13, 2025, sits at the
# top of Hugging Face's [ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).

# To run this example either:
# To run this example, either:

# - run the browser/microphone frontend, or
# - Run the browser/microphone frontend. Modal handles the deployment of both the frontend and backend in a single app! You should see a browser window pop up - make sure you allow access to your microphone. The full frontend code can be found [here](https://github.com/modal-labs/modal-examples/tree/main/06_gpu_and_ml/audio-to-text/frontend).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we need the callout to it being a single app?

  2. For me the browser window does not pop up automatically... I have to click the link in the terminal. Does it automatically open for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I always think it's so cool but defer to you guys
  2. Oh good catch; will fix

# ```bash
# modal serve 06_gpu_and_ml/audio-to-text/parakeet.py
# ```
# - stream a .wav file from a URL (optional, default is "Dream Within a Dream" by Edgar Allan Poe).
# - Or, stream a `.wav` file directly from a URL to simulate real-time transcription in your terminal:
# ```bash
# modal run 06_gpu_and_ml/audio-to-text/parakeet.py --audio-url="https://github.com/voxserv/audio_quality_testing_samples/raw/refs/heads/master/mono_44100/156550__acclivity__a-dream-within-a-dream.wav"
# ```

# See [Troubleshooting](https://modal.com/docs/examples/parakeet#client) at the bottom if you run into issues.

# Here's what your final output might look like:
# You should see output like the following in your terminal:

# ```bash
# 🌐 Downloading audio file...
# 🎧 Downloaded 6331478 bytes
# ☀️ Waking up model, this may take a few seconds on cold start...
# 📝 Transcription: A Dream Within A Dream Edgar Allan Poe
# 📝 Transcription:
# 📝 Transcription: take this kiss upon the brow, And in parting from you now, Thus much let me avow You are not wrong who deem That my days have been a dream.
# 📝 Transcription: Take this kiss upon the brow,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually break up the lines like this in the output... do we want it to be the actual output or this better looking one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like the better looking one but defer to you guys if that's dishonest 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Movie magic yk

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesfrye what's the Modal-Frye Style Guide say here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think splitting on punctuation with newlines in the code is a good idea! I'd like for the output to be real.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesfrye i feel like coding that is a can of worms. like, just breaking on , or . could have false positives. right now line breaks are based on speech breaks. using a poem as the example is fun but complicates this a bit.

# 📝 Transcription: And in parting from you now,
# 📝 Transcription: Thus much let me avow,
# 📝 Transcription: You are not wrong who deem
# 📝 Transcription: That my days have been a dream.
# ...
# ```
# See [Troubleshooting](https://modal.com/docs/examples/parakeet#client) at the bottom if you run into issues.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the Troubleshooting section anymore.



# ## Setup
import asyncio
Expand All @@ -40,9 +43,8 @@
import modal

os.environ["MODAL_LOGLEVEL"] = "INFO"
app_name = "parakeet-websocket"

app = modal.App(app_name)
app = modal.App("parakeet-websocket")
SILENCE_THRESHOLD = -45
SILENCE_MIN_LENGTH_MSEC = 1000
END_OF_STREAM = b"END_OF_STREAM"
Expand Down Expand Up @@ -101,6 +103,7 @@
# - The `transcribe` method takes bytes of audio data, and returns the transcribed text.
# - The `web` method creates a FastAPI app using [`modal.asgi_app`](https://modal.com/docs/reference/modal.asgi_app#modalasgi_app) that serves a
# [WebSocket](https://modal.com/docs/guide/webhooks#websockets) endpoint for real-time audio transcription and a browser frontend for transcribing audio from your microphone.
# - The `run_with_queue` method takes a [`modal.Queue`](https://modal.com/docs/reference/modal.Queue) and passes audio data and transcriptions between our local machine and the GPU container.

# Parakeet tries really hard to transcribe everything to English!
# Hence it tends to output utterances like "Yeah" or "Mm-hmm" when it runs on silent audio.
Expand Down Expand Up @@ -275,9 +278,7 @@ def main(audio_url: str = AUDIO_URL):
# Below are the three main functions that coordinate streaming audio and receiving transcriptions.
#
# `send_audio` transmits chunks of audio data and then pauses to approximate streaming
# speech at a natural rate. That said, we set it to faster
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought this was a bit too honest 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the prose is too casual.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll resist defending the tone of this sentence. But it might be worth mentioning that we set it to faster than realtime just so people understand why we divide the wait time by 8 (i.e. wait for 1/8th the duration of the chunk we just sent).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably okay to not mention it. it's not crucial or the focus here.

# than real-time to compensate for network latency. Plus, we're not
# trying to wait forever for this to finish.
# speech at a natural rate.


async def send_audio(q, audio_bytes):
Expand All @@ -289,8 +290,7 @@ async def send_audio(q, audio_bytes):
await q.put.aio(END_OF_STREAM, partition="audio")


# `receive_transcriptions` is straightforward.
# It just waits for a transcription and prints it after a small delay to avoid colliding with the print statements
# `receive_transcriptions` waits for a transcription and prints it after a small delay to avoid colliding with the print statements
# from the GPU container.


Expand Down