Skip to content

Commit ba4c225

Browse files
ivanleomkjxnlellipsis-dev[bot]
authored
feat: add mistral PDF support (#1459)
Co-authored-by: Jason Liu <jxnl@users.noreply.github.com> Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
1 parent be7821e commit ba4c225

File tree

11 files changed

+1169
-101
lines changed

11 files changed

+1169
-101
lines changed

docs/concepts/multimodal.md

Lines changed: 75 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,18 @@ description: Learn how the Image and Audio class in Instructor enables seamless
55

66
# Multimodal
77

8-
Instructor supports multimodal interactions by providing helper classes that are automatically converted to the correct format for different providers, allowing you to work with both text and images in your prompts and responses. This functionality is implemented in the `multimodal.py` module and provides a seamless way to handle images alongside text for various AI models.
8+
> We've provided a few different sample files for you to use to test out these new features. All examples below use these files.
9+
>
10+
> - (Image) : An image of some blueberry plants [image.jpg](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg)
11+
> - (Audio) : A Recording of the Original Gettysburg Address : [gettysburg.wav](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/gettysburg.wav)
12+
> - (PDF) : A sample PDF file which contains a fake invoice [invoice.pdf](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf)
13+
> Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images and PDFs.
14+
15+
Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images, PDFs, and audio files. With Instructor's multimodal objects, you can easily load media from URLs, local files, or base64 strings using a consistent API that works across different AI providers (OpenAI, Anthropic, Mistral, etc.).
16+
17+
Instructor handles all the provider-specific formatting requirements behind the scenes, ensuring your code remains clean and future-proof as provider APIs evolve.
18+
19+
Let's see how to use the Image, Audio and PDF classes.
920

1021
## `Image`
1122

@@ -72,22 +83,23 @@ print(response.model_dump_json())
7283
"""
7384
{"description":"A tray filled with several blueberry muffins, with one muffin prominently in the foreground. The muffins have a golden-brown top and are surrounded by a beige paper liners. Some muffins are partially visible, and fresh blueberries are scattered around the tray.", "objects": ["muffins", "blueberries", "tray", "paper liners"], "colors": ["golden-brown", "blue", "beige"], "text": null}
7485
"""
86+
```
7587

7688
With autodetect_images=True, you can directly provide URLs or file paths
7789
response = client.chat.completions.create(
78-
model="gpt-4o-mini",
79-
response_model=ImageAnalyzer,
80-
messages=[
81-
{
82-
"role": "user",
83-
"content": [
84-
"What is in this two images?",
85-
"https://static01.nyt.com/images/2017/04/14/dining/14COOKING-RITZ-MUFFINS/14COOKING-RITZ-MUFFINS-jumbo.jpg",
86-
"muffin.jpg", # Using the file we downloaded in the previous example
87-
],
88-
}
89-
],
90-
autodetect_images=True,
90+
model="gpt-4o-mini",
91+
response_model=ImageAnalyzer,
92+
messages=[
93+
{
94+
"role": "user",
95+
"content": [
96+
"What is in this two images?",
97+
"https://static01.nyt.com/images/2017/04/14/dining/14COOKING-RITZ-MUFFINS/14COOKING-RITZ-MUFFINS-jumbo.jpg",
98+
"muffin.jpg", # Using the file we downloaded in the previous example
99+
],
100+
}
101+
],
102+
autodetect_images=True,
91103
)
92104

93105
print(response.model_dump_json())
@@ -98,10 +110,13 @@ print(response.model_dump_json())
98110
By leveraging Instructor's multimodal capabilities, you can focus on building your application logic without worrying about the intricacies of each provider's image handling format. This not only saves development time but also makes your code more maintainable and adaptable to future changes in AI provider APIs.
99111

100112
### Anthropic Prompt Caching
113+
101114
Instructor supports Anthropic prompt caching with images. To activate prompt caching, you can pass image content as a dictionary of the form
115+
102116
```python
103117
{"type": "image", "source": <path_or_url_or_base64_encoding>, "cache_control": True}
104118
```
119+
105120
and set `autodetect_images=True`, or flag it within a constructor such as `instructor.Image.from_path("path/to/image.jpg", cache_control=True)`. For example:
106121

107122
```python
@@ -165,6 +180,8 @@ print(response.model_dump_json())
165180
{"description":"A tray of freshly baked blueberry muffins with golden-brown tops in paper liners.", "objects":["muffins","blueberries","tray","paper liners"], "colors":["golden-brown","blue","beige"], "text":null}
166181
"""
167182

183+
```
184+
168185
## `Audio`
169186

170187
The `Audio` class represents an audio file that can be loaded from a URL or file path. It provides methods to create `Audio` instances but currently only OpenAI supports it. You can create an instance using the `from_path` and `from_url` methods. The `Audio` class will automatically convert it to a base64-encoded image and include it in the API request.
@@ -221,3 +238,47 @@ resp = client.chat.completions.create(
221238
print(resp)
222239
#> name='Jason' age=20
223240
```
241+
242+
## `PDF`
243+
244+
The `PDF` class represents a PDF file that can be loaded from a URL or file path.
245+
246+
It provides methods to create `PDF` instances and is currently supported for OpenAI, Mistral, GenAI and Anthropic client integrations.
247+
248+
### Usage
249+
250+
```python
251+
from openai import OpenAI
252+
import instructor
253+
from pydantic import BaseModel
254+
from instructor.multimodal import PDF
255+
256+
# Set up the client
257+
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
258+
client = instructor.from_openai(OpenAI())
259+
260+
261+
# Create a model for analyzing PDFs
262+
class Invoice(BaseModel):
263+
total: float
264+
items: list[str]
265+
266+
267+
# Load and analyze a PDF
268+
response = client.chat.completions.create(
269+
model="gpt-4o-mini",
270+
response_model=Invoice,
271+
messages=[
272+
{
273+
"role": "user",
274+
"content": [
275+
"Analyze this document",
276+
PDF.from_url(url),
277+
],
278+
}
279+
],
280+
)
281+
282+
print(response)
283+
# > Total = 220, items = ['English Tea', 'Tofu']
284+
```

docs/integrations/anthropic.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,173 @@ except Exception as e:
9191
print(f"Unexpected error: {e}")
9292
```
9393

94+
## Multimodal
95+
96+
> We've provided a few different sample files for you to use to test out these new features. All examples below use these files.
97+
>
98+
> - (Image) : An image of some blueberry plants [image.jpg](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg)
99+
> - (PDF) : A sample PDF file which contains a fake invoice [invoice.pdf](https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf)
100+
101+
Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images, PDFs, and audio files. With Instructor's multimodal objects, you can easily load media from URLs, local files, or base64 strings using a consistent API that works across different AI providers (OpenAI, Anthropic, Mistral, etc.).
102+
103+
Instructor handles all the provider-specific formatting requirements behind the scenes, ensuring your code remains clean and future-proof as provider APIs evolve.
104+
105+
Let's see how to use the Image and PDF classes.
106+
107+
### Image
108+
109+
> For a more in-depth walkthrough of the Image component, check out the [docs here](../concepts/multimodal.md)
110+
111+
Instructor makes it easy to analyse and extract semantic information from images using Anthropic's claude models. [Click here](https://docs.anthropic.com/en/docs/about-claude/models/all-models) to check if the model you'd like to use has vison capabilities.
112+
113+
Let's see an example below with the sample image above where we'll load it in using our `from_url` method.
114+
115+
Note that we support local files and base64 strings too with the `from_path` and the `from_base64` class methods.
116+
117+
```python
118+
from instructor.multimodal import Image
119+
from pydantic import BaseModel, Field
120+
import instructor
121+
from anthropic import Anthropic
122+
123+
124+
class ImageDescription(BaseModel):
125+
objects: list[str] = Field(..., description="The objects in the image")
126+
scene: str = Field(..., description="The scene of the image")
127+
colors: list[str] = Field(..., description="The colors in the image")
128+
129+
130+
client = instructor.from_anthropic(Anthropic())
131+
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"
132+
# Multiple ways to load an image:
133+
response = client.chat.completions.create(
134+
model="claude-3-5-sonnet-20240620",
135+
response_model=ImageDescription,
136+
max_tokens=1000,
137+
messages=[
138+
{
139+
"role": "user",
140+
"content": [
141+
"What is in this image?",
142+
# Option 1: Direct URL with autodetection
143+
Image.from_url(url),
144+
# Option 2: Local file
145+
# Image.from_path("path/to/local/image.jpg")
146+
# Option 3: Base64 string
147+
# Image.from_base64("base64_encoded_string_here")
148+
# Option 4: Autodetect
149+
# Image.autodetect(<url|path|base64>)
150+
],
151+
},
152+
],
153+
)
154+
155+
print(response)
156+
# Example output:
157+
# ImageDescription(
158+
# objects=['blueberries', 'leaves'],
159+
# scene='A blueberry bush with clusters of ripe blueberries and some unripe ones against a cloudy sky',
160+
# colors=['green', 'blue', 'purple', 'white']
161+
# )
162+
163+
```
164+
165+
### PDF
166+
167+
Instructor makes it easy to analyse and extract semantic information from PDFs using Anthropic's Claude line of models.
168+
169+
Let's see an example below with the sample PDF above where we'll load it in using our `from_url` method.
170+
171+
Note that we support local files and base64 strings too with the `from_path` and the `from_base64` class methods.
172+
173+
```python
174+
from instructor.multimodal import PDF
175+
from pydantic import BaseModel, Field
176+
import instructor
177+
from anthropic import Anthropic
178+
179+
180+
class Receipt(BaseModel):
181+
total: int
182+
items: list[str]
183+
184+
185+
client = instructor.from_anthropic(Anthropic())
186+
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
187+
# Multiple ways to load an PDF:
188+
response = client.chat.completions.create(
189+
model="claude-3-5-sonnet-20240620",
190+
response_model=Receipt,
191+
max_tokens=1000,
192+
messages=[
193+
{
194+
"role": "user",
195+
"content": [
196+
"Extract out the total and line items from the invoice",
197+
# Option 1: Direct URL
198+
PDF.from_url(url),
199+
# Option 2: Local file
200+
# PDF.from_path("path/to/local/invoice.pdf"),
201+
# Option 3: Base64 string
202+
# PDF.from_base64("base64_encoded_string_here")
203+
# Option 4: Autodetect
204+
# PDF.autodetect(<url|path|base64>)
205+
],
206+
},
207+
],
208+
)
209+
210+
print(response)
211+
# > Receipt(total=220, items=['English Tea', 'Tofu'])
212+
```
213+
214+
If you'd like to cache the PDF and use it across multiple different requests, we support that with the `PdfWithCacheControl` class which we can see below.
215+
216+
```python
217+
from instructor.multimodal import PdfWithCacheControl
218+
from pydantic import BaseModel
219+
import instructor
220+
from anthropic import Anthropic
221+
222+
223+
class Receipt(BaseModel):
224+
total: int
225+
items: list[str]
226+
227+
228+
client = instructor.from_anthropic(Anthropic())
229+
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
230+
# Multiple ways to load an PDF:
231+
response, completion = client.chat.completions.create_with_completion(
232+
model="claude-3-5-sonnet-20240620",
233+
response_model=Receipt,
234+
max_tokens=1000,
235+
messages=[
236+
{
237+
"role": "user",
238+
"content": [
239+
"Extract out the total and line items from the invoice",
240+
# Option 1: Direct URL
241+
PdfWithCacheControl.from_url(url),
242+
# Option 2: Local file
243+
# PDF.from_path("path/to/local/invoice.pdf"),
244+
# Option 3: Base64 string
245+
# PDF.from_base64("base64_encoded_string_here")
246+
# Option 4: Autodetect
247+
# PDF.autodetect(<url|path|base64>)
248+
],
249+
},
250+
],
251+
)
252+
253+
assert (
254+
completion.usage.cache_creation_input_tokens > 0
255+
or completion.usage.cache_read_input_tokens > 0
256+
)
257+
print(response)
258+
# > Receipt(total=220, items=['English Tea', 'Tofu'])
259+
```
260+
94261
## Streaming Support
95262

96263
Instructor has two main ways that you can use to stream responses out

0 commit comments

Comments
 (0)