Add Image to Text docs

mjh1 · mjh1 · commit b45add56348a · 2024-11-06T11:40:43.000Z
diff --git a/ai/api-reference/image-to-text.mdx b/ai/api-reference/image-to-text.mdx
@@ -0,0 +1,21 @@
+---
+openapi: post /image-to-text
+---
+
+<Info>
+  The default Gateway used in this guide is the public
+  [Livepeer.cloud](https://www.livepeer.cloud/) Gateway. It is free to use but
+  not intended for production-ready applications. For production-ready
+  applications, consider using the [Livepeer Studio](https://livepeer.studio/)
+  Gateway, which requires an API token. Alternatively, you can set up your own
+  Gateway node or partner with one via the `ai-video` channel on
+  [Discord](https://discord.gg/livepeer).
+</Info>
+
+<Note>
+  Please note that the exact parameters, default values, and responses may vary
+  between models. For more information on model-specific parameters, please
+  refer to the respective model documentation available in the [image-to-text
+  pipeline](/ai/pipelines/image-to-text). Not all parameters might be available
+  for a given model.
+</Note>
diff --git a/ai/orchestrators/models-config.mdx b/ai/orchestrators/models-config.mdx
@@ -56,6 +56,11 @@ currently **recommended** models and their respective prices.
     "price_per_unit": 11,
     "pixels_per_unit": 1e2,
     "currency": "USD",
+  },
+  {
+    "pipeline": "image-to-text",
+    "model_id": "Salesforce/blip-image-captioning-large",
+    "price_per_unit": 4768371
   }
 ]
 ```
diff --git a/ai/pipelines/image-to-text.mdx b/ai/pipelines/image-to-text.mdx
@@ -0,0 +1,95 @@
+---
+title: Image-to-Text
+---
+
+## Overview
+
+The `image-to-text` pipeline converts images into text captions. This pipeline is powered by the latest models in the HuggingFace [text-to-image](https://huggingface.co/models?pipeline_tag=text-to-image) pipeline.
+
+<div align="center">
+
+</div>
+
+## Models
+
+### Warm Models
+
+The current warm model requested for the `image-to-text` pipeline is:
+
+- [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
+
+<Tip>
+    For faster responses with different
+    [image-to-text](https://huggingface.co/models?pipeline_tag=text-to-image)
+    diffusion models, ask Orchestrators to load it on their GPU via the `ai-video`
+    channel in [Discord Server](https://discord.gg/livepeer).
+</Tip>
+
+### On-Demand Models
+
+The following models have been tested and verified for the `image-to-text`
+pipeline:
+
+<Note>
+    If a specific model you wish to use is not listed, please submit a [feature
+    request](https://github.com/livepeer/ai-worker/issues/new?assignees=&labels=enhancement%2Cmodel&projects=&template=model_request.yml)
+    on GitHub to get the model verified and added to the list.
+</Note>
+
+{/* prettier-ignore */}
+<Accordion title="Tested and Verified Diffusion Models">
+    - [Salesforce/blip-image-captioning-large](https://huggingface.co/Salesforce/blip-image-captioning-large)
+</Accordion>
+
+## Basic Usage Instructions
+
+<Tip>
+    For a detailed understanding of the `image-to-text` endpoint and to experiment
+    with the API, see the [Livepeer AI API
+    Reference](/ai/api-reference/image-to-text).
+</Tip>
+
+To create an image caption using the `image-to-text` pipeline, submit a
+`POST` request to the Gateway's `image-to-text` API endpoint:
+
+```bash
+curl -X POST "https://<GATEWAY_IP>/image-to-text" \
+    -F model_id=Salesforce/blip-image-captioning-large \
+    -F image=@<PATH_TO_FILE>
+```
+
+In this command:
+
+- `<GATEWAY_IP>` should be replaced with your AI Gateway's IP address.
+- `model_id` is the diffusion model to use.
+- `image` is the path to the image file to be captioned.
+
+<Note>
+    Maximum request size: 50 MB
+</Note>
+
+For additional optional parameters, refer to the
+[Livepeer AI API Reference](/ai/api-reference/image-to-text).
+
+## Orchestrator Configuration
+
+To configure your Orchestrator to serve the `image-to-text` pipeline, refer to
+the [Orchestrator Configuration](/ai/orchestrators/get-started) guide.
+
+### System Requirements
+
+The following system requirements are recommended for optimal performance:
+
+- [NVIDIA GPU](https://developer.nvidia.com/cuda-gpus) with **at least 12GB** of
+VRAM.
+
+## API Reference
+
+<Card
+    title="API Reference"
+    icon="rectangle-terminal"
+    href="/ai/api-reference/image-to-text"
+>
+    Explore the `image-to-text` endpoint and experiment with the API in the
+    Livepeer AI API Reference.
+</Card>
diff --git a/ai/pipelines/overview.mdx b/ai/pipelines/overview.mdx
@@ -89,4 +89,11 @@ pipelines:
   >
     The text-to-speech pipeline generates high-quality, natural sounding speech in the style of a given speaker (gender, pitch, speaking style, etc).
   </Card>
+  <Card
+    title="Image-to-Text"
+    icon="message-dots"
+    href="/ai/pipelines/image-to-text"
+  >
+    The image-to-text pipeline generates captions for input images, with an optional prompt to guide the process.
+  </Card>
 </CardGroup>
diff --git a/api-reference/generate/image-to-text.mdx b/api-reference/generate/image-to-text.mdx
@@ -0,0 +1,4 @@
+---
+title: "Image To Text"
+openapi: "POST /api/beta/generate/image-to-text"
+---
diff --git a/mint.json b/mint.json
@@ -539,7 +539,8 @@
             "ai/pipelines/segment-anything-2",
             "ai/pipelines/text-to-image",
             "ai/pipelines/text-to-speech",
-            "ai/pipelines/upscale"
+            "ai/pipelines/upscale",
+            "ai/pipelines/image-to-text"
           ]
         },
         {
@@ -604,7 +605,8 @@
             "ai/api-reference/image-to-video",
             "ai/api-reference/segment-anything-2",
             "ai/api-reference/upscale",
-            "ai/api-reference/text-to-speech"
+            "ai/api-reference/text-to-speech",
+            "ai/api-reference/image-to-text"
           ]
         }
       ]

Original file line number	Diff line number	Diff line change
`@@ -56,6 +56,11 @@ currently recommended models and their respective prices.`
`56`	`56`	`"price_per_unit": 11,`
`57`	`57`	`"pixels_per_unit": 1e2,`
`58`	`58`	`"currency": "USD",`
	`59`	`+ },`
	`60`	`+ {`
	`61`	`+ "pipeline": "image-to-text",`
	`62`	`+ "model_id": "Salesforce/blip-image-captioning-large",`
	`63`	`+ "price_per_unit": 4768371`
`59`	`64`	`}`
`60`	`65`	`]`
`61`	`66`	```
-Original file line number
+Diff line change
@@ @@ -0,0 +1,4 @@ @@
 +---
 +title: "Image To Text"
 +openapi: "POST /api/beta/generate/image-to-text"
 +---
Original file line number	Diff line number	Diff line change
`@@ -539,7 +539,8 @@`
`539`	`539`	`"ai/pipelines/segment-anything-2",`
`540`	`540`	`"ai/pipelines/text-to-image",`
`541`	`541`	`"ai/pipelines/text-to-speech",`
`542`		`- "ai/pipelines/upscale"`
	`542`	`+ "ai/pipelines/upscale",`
	`543`	`+ "ai/pipelines/image-to-text"`
`543`	`544`	`]`
`544`	`545`	`},`
`545`	`546`	`{`
`@@ -604,7 +605,8 @@`
`604`	`605`	`"ai/api-reference/image-to-video",`
`605`	`606`	`"ai/api-reference/segment-anything-2",`
`606`	`607`	`"ai/api-reference/upscale",`
`607`		`- "ai/api-reference/text-to-speech"`
	`608`	`+ "ai/api-reference/text-to-speech",`
	`609`	`+ "ai/api-reference/image-to-text"`
`608`	`610`	`]`
`609`	`611`	`}`
`610`	`612`	`]`