Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying speech-to-speech on an endpoint #2363

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Conversation

andimarafioti
Copy link
Member

@andimarafioti andimarafioti commented Sep 24, 2024

How to deploy a complex application on an inference endpoint. We created a custom docker container and custom handler.

s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
s2s_endpoint.md Outdated Show resolved Hide resolved
docker push andito/speech-to-speech:latest
```

With the Docker image built and pushed, it’s ready to be used in the Hugging Face Inference Endpoint. By using this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged within the image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it true that the endpoint launches faster? why? it would be good to back up the 'more efficiently' claim if we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't do this, then when you launch the endpoint it has to install the dependencies. That makes start-up take longer. If your instance has a disk, then maybe the cost is mostly on the first run, but still each time you load the startup script it will recheck that it has the dependencies.

- e.g. `speech-to-speech-demo`
- Keep it lower-case and short
- Choose your preferred Cloud and Hardware - We used `AWS` `GPU` `L4`
- It's only `$0.80` an hour and is big enough to handle the models
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there anything we can say in terms of guidance for how to select 'big enough' hardware?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps something high-levell re: which part of the pipeline drives most of the workload?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! The LLM is the part that makes you want to look at bigger instances here.

s2s_endpoint.md Outdated Show resolved Hide resolved
@andimarafioti andimarafioti self-assigned this Sep 26, 2024
@andimarafioti andimarafioti marked this pull request as ready for review September 27, 2024 09:15
Co-authored-by: Diego Maniloff <[email protected]>
Copy link
Member

@Vaibhavs10 Vaibhavs10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Receive the audio responses from the server
4. Playback the audio responses

The audio is recorded on the `audio_input_callback` method, it simply submits all chunks to a queue. Then, it is sent to the server with the `send_audio` method. Here, if there is no audio to send, we still submit an empty array in order to receive a response from the server. The responses from the server are handled by the `on_message` method we saw earlier in the blog. Then, the playback of the audio responses are handled by the `audio_output_callback` method. Here we only need to ensure that the audio is in the range we expect (We don't want to destroy someone eardrum's because of a faulty package!) and ensure that the size of the output array is what the playback library expects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a short conclusion and links again to all the useful repos. Could also let users know to open a discussion on a repo if they run into issues or have questions

@datavistics
Copy link
Contributor

Make sure to add an entry in _blog.yaml

ref: https://github.com/huggingface/blog?tab=readme-ov-file#how-to-write-an-article-

Done!

Comment on lines +4708 to +4712
- audio
- speech-to-speech
- inference
- inference-endpoints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants