Skip to content

Deploying speech-to-speech on an endpoint #2363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Oct 22, 2024
Merged

Deploying speech-to-speech on an endpoint #2363

merged 29 commits into from
Oct 22, 2024

Conversation

andimarafioti
Copy link
Member

@andimarafioti andimarafioti commented Sep 24, 2024

How to deploy a complex application on an inference endpoint. We created a custom docker container and custom handler.

docker push andito/speech-to-speech:latest
```

With the Docker image built and pushed, it’s ready to be used in the Hugging Face Inference Endpoint. By using this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged within the image.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true that the endpoint launches faster? why? it would be good to back up the 'more efficiently' claim if we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't do this, then when you launch the endpoint it has to install the dependencies. That makes start-up take longer. If your instance has a disk, then maybe the cost is mostly on the first run, but still each time you load the startup script it will recheck that it has the dependencies.

- e.g. `speech-to-speech-demo`
- Keep it lower-case and short
- Choose your preferred Cloud and Hardware - We used `AWS` `GPU` `L4`
- It's only `$0.80` an hour and is big enough to handle the models
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything we can say in terms of guidance for how to select 'big enough' hardware?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps something high-levell re: which part of the pipeline drives most of the workload?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! The LLM is the part that makes you want to look at bigger instances here.

@andimarafioti andimarafioti self-assigned this Sep 26, 2024
@andimarafioti andimarafioti marked this pull request as ready for review September 27, 2024 09:15
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Receive the audio responses from the server
4. Playback the audio responses

The audio is recorded on the `audio_input_callback` method, it simply submits all chunks to a queue. Then, it is sent to the server with the `send_audio` method. Here, if there is no audio to send, we still submit an empty array in order to receive a response from the server. The responses from the server are handled by the `on_message` method we saw earlier in the blog. Then, the playback of the audio responses are handled by the `audio_output_callback` method. Here we only need to ensure that the audio is in the range we expect (We don't want to destroy someone eardrum's because of a faulty package!) and ensure that the size of the output array is what the playback library expects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a short conclusion and links again to all the useful repos. Could also let users know to open a discussion on a repo if they run into issues or have questions

@datavistics
Copy link
Contributor

Make sure to add an entry in _blog.yaml

ref: https://github.com/huggingface/blog?tab=readme-ov-file#how-to-write-an-article-

Done!

Comment on lines +4708 to +4840
- audio
- speech-to-speech
- inference
- inference-endpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are great, thank you!

@andimarafioti andimarafioti merged commit c8df5eb into main Oct 22, 2024
1 check passed
@andimarafioti andimarafioti deleted the s2s_endpoint branch October 22, 2024 03:25
title: "Deploying Speech-to-Speech on Hugging Face Inference Endpoints with a Custom Docker Container"
author: andito
thumbnail: /blog/assets/s2s_endpoint/thumbnail.png
date: October 1, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reckon this should be updated to October 22, 2024 @andimarafioti

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants