Some LLMs are now capable of understanding audio, video, image and document content.
!!! info Some models do not support image input. Please check the model's documentation to confirm whether it supports image input.
If you have a direct URL for the image, you can use [ImageUrl
][pydantic_ai.ImageUrl]:
from pydantic_ai import Agent, ImageUrl
agent = Agent(model='openai:gpt-4o')
result = agent.run_sync(
[
'What company is this logo from?',
ImageUrl(url='https://iili.io/3Hs4FMg.png'),
]
)
print(result.output)
# > This is the logo for Pydantic, a data validation and settings management library in Python.
If you have the image locally, you can also use [BinaryContent
][pydantic_ai.BinaryContent]:
import httpx
from pydantic_ai import Agent, BinaryContent
image_response = httpx.get('https://iili.io/3Hs4FMg.png') # Pydantic logo
agent = Agent(model='openai:gpt-4o')
result = agent.run_sync(
[
'What company is this logo from?',
BinaryContent(data=image_response.content, media_type='image/png'), # (1)!
]
)
print(result.output)
# > This is the logo for Pydantic, a data validation and settings management library in Python.
- To ensure the example is runnable we download this image from the web, but you can also use
Path().read_bytes()
to read a local file's contents.
!!! info Some models do not support audio input. Please check the model's documentation to confirm whether it supports audio input.
You can provide audio input using either [AudioUrl
][pydantic_ai.AudioUrl] or [BinaryContent
][pydantic_ai.BinaryContent]. The process is analogous to the examples above.
!!! info Some models do not support video input. Please check the model's documentation to confirm whether it supports video input.
You can provide video input using either [VideoUrl
][pydantic_ai.VideoUrl] or [BinaryContent
][pydantic_ai.BinaryContent]. The process is analogous to the examples above.
!!! info Some models do not support document input. Please check the model's documentation to confirm whether it supports document input.
!!! warning
When using Gemini models, the document content will always be sent as binary data, regardless of whether you use DocumentUrl
or BinaryContent
. This is due to differences in how Vertex AI and Google AI handle document inputs.
For more details, see [this discussion](https://discuss.ai.google.dev/t/i-am-using-google-generative-ai-model-gemini-1-5-pro-for-image-analysis-but-getting-error/34866/4).
If you are unsatisfied with this behavior, please let us know by opening an issue on
[GitHub](https://github.com/pydantic/pydantic-ai/issues).
You can provide document input using either [DocumentUrl
][pydantic_ai.DocumentUrl] or [BinaryContent
][pydantic_ai.BinaryContent]. The process is similar to the examples above.
If you have a direct URL for the document, you can use [DocumentUrl
][pydantic_ai.DocumentUrl]:
from pydantic_ai import Agent, DocumentUrl
agent = Agent(model='anthropic:claude-3-sonnet')
result = agent.run_sync(
[
'What is the main content of this document?',
DocumentUrl(url='https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf'),
]
)
print(result.output)
# > This document is the technical report introducing Gemini 1.5, Google's latest large language model...
The supported document formats vary by model.
You can also use [BinaryContent
][pydantic_ai.BinaryContent] to pass document data directly:
from pathlib import Path
from pydantic_ai import Agent, BinaryContent
pdf_path = Path('document.pdf')
agent = Agent(model='anthropic:claude-3-sonnet')
result = agent.run_sync(
[
'What is the main content of this document?',
BinaryContent(data=pdf_path.read_bytes(), media_type='application/pdf'),
]
)
print(result.output)
# > The document discusses...