Deploying speech-to-speech on an endpoint #2363

andimarafioti · 2024-09-24T12:15:42Z

How to deploy a complex application on an inference endpoint. We created a custom docker container and custom handler.

s2s_endpoint.md

dmaniloff · 2024-09-26T13:58:03Z

s2s_endpoint.md

+docker push andito/speech-to-speech:latest
+```
+
+With the Docker image built and pushed, it’s ready to be used in the Hugging Face Inference Endpoint. By using this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged within the image.


is it true that the endpoint launches faster? why? it would be good to back up the 'more efficiently' claim if we can.

If you don't do this, then when you launch the endpoint it has to install the dependencies. That makes start-up take longer. If your instance has a disk, then maybe the cost is mostly on the first run, but still each time you load the startup script it will recheck that it has the dependencies.

dmaniloff · 2024-09-26T13:59:55Z

s2s_endpoint.md

+        - e.g. `speech-to-speech-demo` 
+        - Keep it lower-case and short
+    - Choose your preferred Cloud and Hardware - We used `AWS` `GPU` `L4`
+        - It's only `$0.80` an hour and is big enough to handle the models


is there anything we can say in terms of guidance for how to select 'big enough' hardware?

perhaps something high-levell re: which part of the pipeline drives most of the workload?

Sure! The LLM is the part that makes you want to look at bigger instances here.

s2s_endpoint.md

Vaibhavs10

Make sure to add an entry in _blog.yaml

ref: https://github.com/huggingface/blog?tab=readme-ov-file#how-to-write-an-article-

nbroad1881 · 2024-09-27T12:10:52Z

s2s_endpoint.md

+3. Receive the audio responses from the server
+4. Playback the audio responses
+
+The audio is recorded on the `audio_input_callback` method, it simply submits all chunks to a queue. Then, it is sent to the server with the `send_audio` method. Here, if there is no audio to send, we still submit an empty array in order to receive a response from the server. The responses from the server are handled by the `on_message` method we saw earlier in the blog. Then, the playback of the audio responses are handled by the `audio_output_callback` method. Here we only need to ensure that the audio is in the range we expect (We don't want to destroy someone eardrum's because of a faulty package!) and ensure that the size of the output array is what the playback library expects. 


Maybe a short conclusion and links again to all the useful repos. Could also let users know to open a discussion on a repo if they run into issues or have questions

datavistics · 2024-09-27T13:22:56Z

Make sure to add an entry in _blog.yaml

ref: https://github.com/huggingface/blog?tab=readme-ov-file#how-to-write-an-article-

Done!

datavistics · 2024-09-27T13:24:45Z

_blog.yml

+    - audio
+    - speech-to-speech
+    - inference
+    - inference-endpoints


@andimarafioti @dmaniloff , are there any other tags you can think of?

Full list here: https://huggingface.slack.com/archives/C01BWJU0YKW/p1724308338706409?thread_ts=1724308331.622869&cid=C01BWJU0YKW

Those are great, thank you!

Co-authored-by: Diego Maniloff <[email protected]>

Co-authored-by: Derek <[email protected]>

Co-authored-by: Diego Maniloff <[email protected]>

tomaarsen · 2024-10-22T10:26:02Z

_blog.yml

+  title: "Deploying Speech-to-Speech on Hugging Face Inference Endpoints with a Custom Docker Container"
+  author: andito
+  thumbnail: /blog/assets/s2s_endpoint/thumbnail.png
+  date: October 1, 2024


I reckon this should be updated to October 22, 2024 @andimarafioti

dmaniloff reviewed Sep 24, 2024

View reviewed changes