docs(ai): add Image to Text pipeline docs

mjh1 · rickstaa · commit 026c8e37fa54 · 2024-12-11T18:45:56.000+01:00
This commit adds the image-to-text pipeline docs and updates the API
reference.
diff --git a/ai/api-reference/image-to-text.mdx b/ai/api-reference/image-to-text.mdx
@@ -0,0 +1,21 @@
+---
+openapi: post /image-to-text
+---
+
+<Info>
+  The default Gateway used in this guide is the public
+  [Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
+  not intended for production-ready applications. For production-ready
+  applications, consider using the [Livepeer Studio](https://livepeer.studio/)
+  Gateway, which requires an API token. Alternatively, you can set up your own
+  Gateway node or partner with one via the `ai-video` channel on
+  [Discord](https://discord.gg/livepeer).
+</Info>
+
+<Note>
+  Please note that the exact parameters, default values, and responses may vary
+  between models. For more information on model-specific parameters, please
+  refer to the respective model documentation available in the [image-to-text
+  pipeline](/ai/pipelines/image-to-text). Not all parameters might be available
+  for a given model.
+</Note>
diff --git a/ai/orchestrators/models-config.mdx b/ai/orchestrators/models-config.mdx
@@ -74,15 +74,15 @@ currently **recommended** models and their respective prices.
   Optional flags to enhance performance (details below).
 </ParamField>
 <ParamField path="url" type="string" optional="true">
-  Optional URL and port where the model container or custom container manager software is running.  
+  Optional URL and port where the model container or custom container manager software is running.
   [See External Containers](#external-containers)
 </ParamField>
 <ParamField path="token" type="string">
-  Optional token required to interact with the model container or custom container manager software.  
+  Optional token required to interact with the model container or custom container manager software.
   [See External Containers](#external-containers)
 </ParamField>
 <ParamField path="capacity" type="integer">
-  Optional capacity of the model. This is the number of inference tasks the model can handle at the same time. This defaults to 1.  
+  Optional capacity of the model. This is the number of inference tasks the model can handle at the same time. This defaults to 1.
   [See External Containers](#external-containers)
 </ParamField>
 
@@ -131,30 +131,30 @@ are available:
 
 <Warning>
   This feature is intended for advanced users. Incorrect setup can lead to a
-  lower orchestrator score and reduced fees. If external containers are used, 
-  it is the Orchestrator's responsibility to ensure the correct container with 
-  the correct endpoints is running behind the specified `url`. 
+  lower orchestrator score and reduced fees. If external containers are used,
+  it is the Orchestrator's responsibility to ensure the correct container with
+  the correct endpoints is running behind the specified `url`.
 </Warning>
 
-External containers can be for one model to stack on top of managed model containers, 
+External containers can be for one model to stack on top of managed model containers,
 an auto-scaling GPU cluster behind a load balancer or anything in between. Orchestrators
 can use external containers to extend the models served or fully replace the AI Worker managed model containers
 using the [Docker client Go library](https://pkg.go.dev/github.com/docker/docker/client)
-to start and stop containers specified at startup of the AI Worker. 
-  
+to start and stop containers specified at startup of the AI Worker.
+
 External containers can be used by specifying the `url`, `capacity` and `token` fields in the
 model configuration. The only requirement is that the `url` specified responds as expected to the AI Worker same
 as the managed containers would respond (including http error codes). As long as the container management software
-acts as a pass through to the model container you can use any container management software to implement the custom 
-management of the runner containers including [Kubernetes](https://kubernetes.io/), [Podman](https://podman.io/), 
-[Docker Swarm](https://docs.docker.com/engine/swarm/), [Nomad](https://www.nomadproject.io/), or custom scripts to 
+acts as a pass through to the model container you can use any container management software to implement the custom
+management of the runner containers including [Kubernetes](https://kubernetes.io/), [Podman](https://podman.io/),
+[Docker Swarm](https://docs.docker.com/engine/swarm/), [Nomad](https://www.nomadproject.io/), or custom scripts to
 manage container lifecycles based on request volume
 
 
-- The `url` set will be used to confirm a model container is running at startup of the AI Worker using the `/health` endpoint. 
+- The `url` set will be used to confirm a model container is running at startup of the AI Worker using the `/health` endpoint.
   Inference requests will be forwarded to the `url` same as they are to the managed containers after startup.
 - The `capacity` should be set to the maximum amount of requests that can be processed concurrently for the pipeline/model id (default is 1).
-  If auto scaling containers, take care that the startup time is fast if setting `warm: true` because slow response time will 
+  If auto scaling containers, take care that the startup time is fast if setting `warm: true` because slow response time will
   negatively impact your selection by Gateways for future requests.
 - The `token` field is used to secure the model container `url` from unauthorized access and is strongly
   suggested to use if the containers are exposed to external networks.
diff --git a/ai/pipelines/image-to-text.mdx b/ai/pipelines/image-to-text.mdx
@@ -0,0 +1,95 @@
+---
+title: Image-to-Text
+---
+
+## Overview
+
+The `image-to-text` pipeline converts images into text captions. This pipeline is powered by the latest models in the HuggingFace [text-to-image](https://huggingface.co/models?pipeline_tag=text-to-image) pipeline.
+
+<div align="center">
+
+</div>
+
+## Models
+
+### Warm Models
+
+The current warm model requested for the `image-to-text` pipeline is:
+
+- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
+
+<Tip>
+    For faster responses with different
+    [image-to-text](https://huggingface.co/models?pipeline_tag=text-to-image)
+    diffusion models, ask Orchestrators to load it on their GPU via the `ai-video`
+    channel in [Discord Server](https://discord.gg/livepeer).
+</Tip>
+
+### On-Demand Models
+
+The following models have been tested and verified for the `image-to-text`
+pipeline:
+
+<Note>
+    If a specific model you wish to use is not listed, please submit a [feature
+    request](https://github.com/livepeer/ai-worker/issues/new?assignees=&labels=enhancement%2Cmodel&projects=&template=model_request.yml)
+    on GitHub to get the model verified and added to the list.
+</Note>
+
+{/* prettier-ignore */}
+<Accordion title="Tested and Verified Diffusion Models">
+    - [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
+</Accordion>
+
+## Basic Usage Instructions
+
+<Tip>
+    For a detailed understanding of the `image-to-text` endpoint and to experiment
+    with the API, see the [Livepeer AI API
+    Reference](/ai/api-reference/image-to-text).
+</Tip>
+
+To create an image caption using the `image-to-text` pipeline, submit a
+`POST` request to the Gateway's `image-to-text` API endpoint:
+
+```bash
+curl -X POST "https://<GATEWAY_IP>/image-to-text" \
+    -F model_id=Salesforce/blip-image-captioning-large \
+    -F image=@<PATH_TO_FILE>
+```
+
+In this command:
+
+- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
+- `model_id` is the diffusion model to use.
+- `image` is the path to the image file to be captioned.
+
+<Note>
+    Maximum request size: 50 MB
+</Note>
+
+For additional optional parameters, refer to the
+[Livepeer AI API Reference](/ai/api-reference/image-to-text).
+
+## Orchestrator Configuration
+
+To configure your Orchestrator to serve the `image-to-text` pipeline, refer to
+the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.
+
+### System Requirements
+
+The following system requirements are recommended for optimal performance:
+
+- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
+VRAM.
+
+## API Reference
+
+<Card
+    title="API Reference"
+    icon="rectangle-terminal"
+    href="/ai/api-reference/image-to-text"
+>
+    Explore the `image-to-text` endpoint and experiment with the API in the
+    Livepeer AI API Reference.
+</Card>
diff --git a/ai/pipelines/overview.mdx b/ai/pipelines/overview.mdx
@@ -89,4 +89,11 @@ pipelines:
   >
     The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
   </Card>
+  <Card
+    title="Image-to-Text"
+    icon="message-dots"
+    href="/ai/pipelines/image-to-text"
+  >
+    The image-to-text pipeline generates captions for input images, with an optional prompt to guide the process.
+  </Card>
 </CardGroup>
diff --git a/api-reference/generate/image-to-text.mdx b/api-reference/generate/image-to-text.mdx
@@ -0,0 +1,4 @@
+---
+title: "Image To Text"
+openapi: "POST /api/beta/generate/image-to-text"
+---
diff --git a/mint.json b/mint.json
@@ -539,7 +539,8 @@
             "ai/pipelines/segment-anything-2",
             "ai/pipelines/text-to-image",
             "ai/pipelines/text-to-speech",
-            "ai/pipelines/upscale"
+            "ai/pipelines/upscale",
+            "ai/pipelines/image-to-text"
           ]
         },
         {
@@ -605,7 +606,8 @@
             "ai/api-reference/image-to-video",
             "ai/api-reference/segment-anything-2",
             "ai/api-reference/upscale",
-            "ai/api-reference/text-to-speech"
+            "ai/api-reference/text-to-speech",
+            "ai/api-reference/image-to-text"
           ]
         }
       ]

-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +---
 +title: "Image To Text"
 +openapi: "POST /api/beta/generate/image-to-text"
 +---
Original file line number	Diff line number	Diff line change
`@@ -539,7 +539,8 @@`
`539`	`539`	`"ai/pipelines/segment-anything-2",`
`540`	`540`	`"ai/pipelines/text-to-image",`
`541`	`541`	`"ai/pipelines/text-to-speech",`
`542`		`- "ai/pipelines/upscale"`
	`542`	`+ "ai/pipelines/upscale",`
	`543`	`+ "ai/pipelines/image-to-text"`
`543`	`544`	`]`
`544`	`545`	`},`
`545`	`546`	`{`
`@@ -605,7 +606,8 @@`
`605`	`606`	`"ai/api-reference/image-to-video",`
`606`	`607`	`"ai/api-reference/segment-anything-2",`
`607`	`608`	`"ai/api-reference/upscale",`
`608`		`- "ai/api-reference/text-to-speech"`
	`609`	`+ "ai/api-reference/text-to-speech",`
	`610`	`+ "ai/api-reference/image-to-text"`
`609`	`611`	`]`
`610`	`612`	`}`
`611`	`613`	`]`