Skip to content

Commit c4db64c

Browse files
mjh1rickstaa
andauthored
Add Image to Text docs (#681)
* docs(ai): add Image to Text pipeline docs This commit adds the image-to-text pipeline docs and updates the API reference. --------- Co-authored-by: Rick Staa <[email protected]>
1 parent bb2dd42 commit c4db64c

File tree

6 files changed

+162
-16
lines changed

6 files changed

+162
-16
lines changed

ai/api-reference/image-to-text.mdx

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
openapi: post /image-to-text
3+
---
4+
5+
<Info>
6+
The default Gateway used in this guide is the public
7+
[Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
8+
not intended for production-ready applications. For production-ready
9+
applications, consider using the [Livepeer Studio](https://livepeer.studio/)
10+
Gateway, which requires an API token. Alternatively, you can set up your own
11+
Gateway node or partner with one via the `ai-video` channel on
12+
[Discord](https://discord.gg/livepeer).
13+
</Info>
14+
15+
<Note>
16+
Please note that the exact parameters, default values, and responses may vary
17+
between models. For more information on model-specific parameters, please
18+
refer to the respective model documentation available in the [image-to-text
19+
pipeline](/ai/pipelines/image-to-text). Not all parameters might be available
20+
for a given model.
21+
</Note>

ai/orchestrators/models-config.mdx

+14-14
Original file line numberDiff line numberDiff line change
@@ -74,15 +74,15 @@ currently **recommended** models and their respective prices.
7474
Optional flags to enhance performance (details below).
7575
</ParamField>
7676
<ParamField path="url" type="string" optional="true">
77-
Optional URL and port where the model container or custom container manager software is running.
77+
Optional URL and port where the model container or custom container manager software is running.
7878
[See External Containers](#external-containers)
7979
</ParamField>
8080
<ParamField path="token" type="string">
81-
Optional token required to interact with the model container or custom container manager software.
81+
Optional token required to interact with the model container or custom container manager software.
8282
[See External Containers](#external-containers)
8383
</ParamField>
8484
<ParamField path="capacity" type="integer">
85-
Optional capacity of the model. This is the number of inference tasks the model can handle at the same time. This defaults to 1.
85+
Optional capacity of the model. This is the number of inference tasks the model can handle at the same time. This defaults to 1.
8686
[See External Containers](#external-containers)
8787
</ParamField>
8888

@@ -131,30 +131,30 @@ are available:
131131

132132
<Warning>
133133
This feature is intended for advanced users. Incorrect setup can lead to a
134-
lower orchestrator score and reduced fees. If external containers are used,
135-
it is the Orchestrator's responsibility to ensure the correct container with
136-
the correct endpoints is running behind the specified `url`.
134+
lower orchestrator score and reduced fees. If external containers are used,
135+
it is the Orchestrator's responsibility to ensure the correct container with
136+
the correct endpoints is running behind the specified `url`.
137137
</Warning>
138138

139-
External containers can be for one model to stack on top of managed model containers,
139+
External containers can be for one model to stack on top of managed model containers,
140140
an auto-scaling GPU cluster behind a load balancer or anything in between. Orchestrators
141141
can use external containers to extend the models served or fully replace the AI Worker managed model containers
142142
using the [Docker client Go library](https://pkg.go.dev/github.com/docker/docker/client)
143-
to start and stop containers specified at startup of the AI Worker.
144-
143+
to start and stop containers specified at startup of the AI Worker.
144+
145145
External containers can be used by specifying the `url`, `capacity` and `token` fields in the
146146
model configuration. The only requirement is that the `url` specified responds as expected to the AI Worker same
147147
as the managed containers would respond (including http error codes). As long as the container management software
148-
acts as a pass through to the model container you can use any container management software to implement the custom
149-
management of the runner containers including [Kubernetes](https://kubernetes.io/), [Podman](https://podman.io/),
150-
[Docker Swarm](https://docs.docker.com/engine/swarm/), [Nomad](https://www.nomadproject.io/), or custom scripts to
148+
acts as a pass through to the model container you can use any container management software to implement the custom
149+
management of the runner containers including [Kubernetes](https://kubernetes.io/), [Podman](https://podman.io/),
150+
[Docker Swarm](https://docs.docker.com/engine/swarm/), [Nomad](https://www.nomadproject.io/), or custom scripts to
151151
manage container lifecycles based on request volume
152152

153153

154-
- The `url` set will be used to confirm a model container is running at startup of the AI Worker using the `/health` endpoint.
154+
- The `url` set will be used to confirm a model container is running at startup of the AI Worker using the `/health` endpoint.
155155
Inference requests will be forwarded to the `url` same as they are to the managed containers after startup.
156156
- The `capacity` should be set to the maximum amount of requests that can be processed concurrently for the pipeline/model id (default is 1).
157-
If auto scaling containers, take care that the startup time is fast if setting `warm: true` because slow response time will
157+
If auto scaling containers, take care that the startup time is fast if setting `warm: true` because slow response time will
158158
negatively impact your selection by Gateways for future requests.
159159
- The `token` field is used to secure the model container `url` from unauthorized access and is strongly
160160
suggested to use if the containers are exposed to external networks.

ai/pipelines/image-to-text.mdx

+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
---
2+
title: Image-to-Text
3+
---
4+
5+
## Overview
6+
7+
The `image-to-text` pipeline converts images into text captions. This pipeline
8+
is powered by the latest models in the HuggingFace
9+
[text-to-image](https://huggingface.co/models?pipeline_tag=text-to-image)
10+
pipeline.
11+
12+
<div align="center">
13+
14+
</div>
15+
16+
## Models
17+
18+
### Warm Models
19+
20+
The current warm model requested for the `image-to-text` pipeline is:
21+
22+
- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
23+
24+
<Tip>
25+
For faster responses with different
26+
[image-to-text](https://huggingface.co/models?pipeline_tag=text-to-image)
27+
diffusion models, ask Orchestrators to load it on their GPU via the `ai-video`
28+
channel in [Discord Server](https://discord.gg/livepeer).
29+
</Tip>
30+
31+
### On-Demand Models
32+
33+
The following models have been tested and verified for the `image-to-text`
34+
pipeline:
35+
36+
<Note>
37+
If a specific model you wish to use is not listed, please submit a [feature
38+
request](https://github.com/livepeer/ai-worker/issues/new?assignees=&labels=enhancement%2Cmodel&projects=&template=model_request.yml)
39+
on GitHub to get the model verified and added to the list.
40+
</Note>
41+
42+
{/* prettier-ignore */}
43+
<Accordion title="Tested and Verified Diffusion Models">
44+
- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
45+
</Accordion>
46+
47+
## Basic Usage Instructions
48+
49+
<Tip>
50+
For a detailed understanding of the `image-to-text` endpoint and to experiment
51+
with the API, see the [Livepeer AI API
52+
Reference](/ai/api-reference/image-to-text).
53+
</Tip>
54+
55+
To create an image caption using the `image-to-text` pipeline, submit a `POST`
56+
request to the Gateway's `image-to-text` API endpoint:
57+
58+
```bash
59+
curl -X POST "https://<GATEWAY_IP>/image-to-text" \
60+
-F model_id=Salesforce/blip-image-captioning-large \
61+
-F image=@<PATH_TO_FILE>
62+
```
63+
64+
In this command:
65+
66+
- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
67+
- `model_id` is the diffusion model to use.
68+
- `image` is the path to the image file to be captioned.
69+
70+
<Note>Maximum request size: 50 MB</Note>
71+
72+
For additional optional parameters, refer to the
73+
[Livepeer AI API Reference](/ai/api-reference/image-to-text).
74+
75+
## Orchestrator Configuration
76+
77+
To configure your Orchestrator to serve the `image-to-text` pipeline, refer to
78+
the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.
79+
80+
### System Requirements
81+
82+
The following system requirements are recommended for optimal performance:
83+
84+
- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 4GB** of
85+
VRAM.
86+
87+
88+
## Recommended Pipeline Pricing
89+
90+
<Note>
91+
We are planning to simplify the pricing in the future so orchestrators can set
92+
one AI price per compute unit and have the system automatically scale based on
93+
the model's compute requirements.
94+
</Note>
95+
96+
The pricing for the `image-to-text` pipeline is based on competitor pricing.
97+
However, we strongly encourage orchestrators to set their own pricing based on
98+
their costs and requirements. Setting a competitive price will help attract more
99+
jobs, as Gateways can set their maximum price for a job. The current recommended
100+
pricing for this pipeline is `2.5e-10 USD` per **input pixel**
101+
(`height * width`).
102+
103+
## API Reference
104+
105+
<Card
106+
title="API Reference"
107+
icon="rectangle-terminal"
108+
href="/ai/api-reference/image-to-text"
109+
>
110+
Explore the `image-to-text` endpoint and experiment with the API in the
111+
Livepeer AI API Reference.
112+
</Card>

ai/pipelines/overview.mdx

+7
Original file line numberDiff line numberDiff line change
@@ -89,4 +89,11 @@ pipelines:
8989
>
9090
The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
9191
</Card>
92+
<Card
93+
title="Image-to-Text"
94+
icon="message-dots"
95+
href="/ai/pipelines/image-to-text"
96+
>
97+
The image-to-text pipeline generates captions for input images, with an optional prompt to guide the process.
98+
</Card>
9299
</CardGroup>
+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
---
2+
title: "Image To Text"
3+
openapi: "POST /api/beta/generate/image-to-text"
4+
---

mint.json

+4-2
Original file line numberDiff line numberDiff line change
@@ -539,7 +539,8 @@
539539
"ai/pipelines/segment-anything-2",
540540
"ai/pipelines/text-to-image",
541541
"ai/pipelines/text-to-speech",
542-
"ai/pipelines/upscale"
542+
"ai/pipelines/upscale",
543+
"ai/pipelines/image-to-text"
543544
]
544545
},
545546
{
@@ -605,7 +606,8 @@
605606
"ai/api-reference/image-to-video",
606607
"ai/api-reference/segment-anything-2",
607608
"ai/api-reference/upscale",
608-
"ai/api-reference/text-to-speech"
609+
"ai/api-reference/text-to-speech",
610+
"ai/api-reference/image-to-text"
609611
]
610612
}
611613
]

0 commit comments

Comments
 (0)