-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying speech-to-speech on an endpoint #2363
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Diego Maniloff <[email protected]>
Co-authored-by: Derek <[email protected]>
docker push andito/speech-to-speech:latest | ||
``` | ||
|
||
With the Docker image built and pushed, it’s ready to be used in the Hugging Face Inference Endpoint. By using this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged within the image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true that the endpoint launches faster? why? it would be good to back up the 'more efficiently' claim if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't do this, then when you launch the endpoint it has to install the dependencies. That makes start-up take longer. If your instance has a disk, then maybe the cost is mostly on the first run, but still each time you load the startup script it will recheck that it has the dependencies.
- e.g. `speech-to-speech-demo` | ||
- Keep it lower-case and short | ||
- Choose your preferred Cloud and Hardware - We used `AWS` `GPU` `L4` | ||
- It's only `$0.80` an hour and is big enough to handle the models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there anything we can say in terms of guidance for how to select 'big enough' hardware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps something high-levell re: which part of the pipeline drives most of the workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! The LLM is the part that makes you want to look at bigger instances here.
Co-authored-by: Diego Maniloff <[email protected]>
Co-authored-by: Diego Maniloff <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add an entry in _blog.yaml
ref: https://github.com/huggingface/blog?tab=readme-ov-file#how-to-write-an-article-
3. Receive the audio responses from the server | ||
4. Playback the audio responses | ||
|
||
The audio is recorded on the `audio_input_callback` method, it simply submits all chunks to a queue. Then, it is sent to the server with the `send_audio` method. Here, if there is no audio to send, we still submit an empty array in order to receive a response from the server. The responses from the server are handled by the `on_message` method we saw earlier in the blog. Then, the playback of the audio responses are handled by the `audio_output_callback` method. Here we only need to ensure that the audio is in the range we expect (We don't want to destroy someone eardrum's because of a faulty package!) and ensure that the size of the output array is what the playback library expects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a short conclusion and links again to all the useful repos. Could also let users know to open a discussion on a repo if they run into issues or have questions
Done! |
- audio | ||
- speech-to-speech | ||
- inference | ||
- inference-endpoints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andimarafioti @dmaniloff , are there any other tags you can think of?
Full list here: https://huggingface.slack.com/archives/C01BWJU0YKW/p1724308338706409?thread_ts=1724308331.622869&cid=C01BWJU0YKW
How to deploy a complex application on an inference endpoint. We created a custom docker container and custom handler.