You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/concepts/multimodal.md
+75-14Lines changed: 75 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,18 @@ description: Learn how the Image and Audio class in Instructor enables seamless
5
5
6
6
# Multimodal
7
7
8
-
Instructor supports multimodal interactions by providing helper classes that are automatically converted to the correct format for different providers, allowing you to work with both text and images in your prompts and responses. This functionality is implemented in the `multimodal.py` module and provides a seamless way to handle images alongside text for various AI models.
8
+
> We've provided a few different sample files for you to use to test out these new features. All examples below use these files.
9
+
>
10
+
> - (Image) : An image of some blueberry plants [image.jpg](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg)
11
+
> - (Audio) : A Recording of the Original Gettysburg Address : [gettysburg.wav](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/gettysburg.wav)
12
+
> - (PDF) : A sample PDF file which contains a fake invoice [invoice.pdf](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf)
13
+
> Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images and PDFs.
14
+
15
+
Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images, PDFs, and audio files. With Instructor's multimodal objects, you can easily load media from URLs, local files, or base64 strings using a consistent API that works across different AI providers (OpenAI, Anthropic, Mistral, etc.).
16
+
17
+
Instructor handles all the provider-specific formatting requirements behind the scenes, ensuring your code remains clean and future-proof as provider APIs evolve.
18
+
19
+
Let's see how to use the Image, Audio and PDF classes.
{"description":"A tray filled with several blueberry muffins, with one muffin prominently in the foreground. The muffins have a golden-brown top and are surrounded by a beige paper liners. Some muffins are partially visible, and fresh blueberries are scattered around the tray.", "objects": ["muffins", "blueberries", "tray", "paper liners"], "colors": ["golden-brown", "blue", "beige"], "text": null}
74
85
"""
86
+
```
75
87
76
88
With autodetect_images=True, you can directly provide URLs or file paths
By leveraging Instructor's multimodal capabilities, you can focus on building your application logic without worrying about the intricacies of each provider's image handling format. This not only saves development time but also makes your code more maintainable and adaptable to future changes in AI provider APIs.
99
111
100
112
### Anthropic Prompt Caching
113
+
101
114
Instructor supports Anthropic prompt caching with images. To activate prompt caching, you can pass image content as a dictionary of the form
and set `autodetect_images=True`, or flag it within a constructor such as `instructor.Image.from_path("path/to/image.jpg", cache_control=True)`. For example:
{"description":"A tray of freshly baked blueberry muffins with golden-brown tops in paper liners.", "objects":["muffins","blueberries","tray","paper liners"], "colors":["golden-brown","blue","beige"], "text":null}
166
181
"""
167
182
183
+
```
184
+
168
185
## `Audio`
169
186
170
187
The `Audio` class represents an audio file that can be loaded from a URL or file path. It provides methods to create `Audio` instances but currently only OpenAI supports it. You can create an instance using the `from_path` and `from_url` methods. The `Audio` class will automatically convert it to a base64-encoded image and include it in the API request.
Copy file name to clipboardExpand all lines: docs/integrations/anthropic.md
+167Lines changed: 167 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -91,6 +91,173 @@ except Exception as e:
91
91
print(f"Unexpected error: {e}")
92
92
```
93
93
94
+
## Multimodal
95
+
96
+
> We've provided a few different sample files for you to use to test out these new features. All examples below use these files.
97
+
>
98
+
> - (Image) : An image of some blueberry plants [image.jpg](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg)
99
+
> - (PDF) : A sample PDF file which contains a fake invoice [invoice.pdf](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf)
100
+
101
+
Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images, PDFs, and audio files. With Instructor's multimodal objects, you can easily load media from URLs, local files, or base64 strings using a consistent API that works across different AI providers (OpenAI, Anthropic, Mistral, etc.).
102
+
103
+
Instructor handles all the provider-specific formatting requirements behind the scenes, ensuring your code remains clean and future-proof as provider APIs evolve.
104
+
105
+
Let's see how to use the Image and PDF classes.
106
+
107
+
### Image
108
+
109
+
> For a more in-depth walkthrough of the Image component, check out the [docs here](../concepts/multimodal.md)
110
+
111
+
Instructor makes it easy to analyse and extract semantic information from images using Anthropic's claude models. [Click here](https://docs.anthropic.com/en/docs/about-claude/models/all-models) to check if the model you'd like to use has vison capabilities.
112
+
113
+
Let's see an example below with the sample image above where we'll load it in using our `from_url` method.
114
+
115
+
Note that we support local files and base64 strings too with the `from_path` and the `from_base64` class methods.
116
+
117
+
```python
118
+
from instructor.multimodal import Image
119
+
from pydantic import BaseModel, Field
120
+
import instructor
121
+
from anthropic import Anthropic
122
+
123
+
124
+
classImageDescription(BaseModel):
125
+
objects: list[str] = Field(..., description="The objects in the image")
126
+
scene: str= Field(..., description="The scene of the image")
127
+
colors: list[str] = Field(..., description="The colors in the image")
If you'd like to cache the PDF and use it across multiple different requests, we support that with the `PdfWithCacheControl` class which we can see below.
215
+
216
+
```python
217
+
from instructor.multimodal import PdfWithCacheControl
0 commit comments