Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
from dotenv import load_dotenv
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers.images import LLMImageBlobParser
from langchain_aws.chat_models import ChatBedrock
def main():
# Include Bedrock credentials
load_dotenv()
# Ingest document
# Note you can download this file from: https://documents1.worldbank.org/curated/en/099101824180532047/pdf/BOSIB13bdde89d07f1b3711dd8e86adb477.pdf
fp = "./data/world-bank-report-example.pdf"
prompt = (
"You are an assistant tasked with describing images for retrieval. "
"1. These descriptions will be embedded and used to retrieve the raw image. "
"Give a concise description of the image that is well optimized for retrieval\n"
"2. extract all the text from the image. "
"Do not exclude any content from the page.\n"
"Format your answer in markdown without explanatory text "
"and without markdown delimiter ``` at the beginning. "
)
# 1) Load and parse documents
llm_img_parser = ChatBedrock(
model_id = "anthropic.claude-3-sonnet-20240229-v1:0",
model_kwargs=dict(temperature=0.1)
)
img_parser = LLMImageBlobParser(
model=llm_img_parser,
prompt=prompt
)
loader = PyMuPDFLoader(
file_path=fp,
mode="page",
extract_images=True,
images_parser=img_parser,
extract_tables="markdown",
images_inner_format="text"
)
docs = []
docs_lazy = loader.lazy_load()
for doc in docs_lazy:
print(f"Processing doc {doc}")
docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)
if __name__ == "__main__":
main()
Error Message and Stack Trace (if applicable)
Error raised by bedrock service
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/langchain_aws/llms/bedrock.py", line 956, in _prepare_input_and_invoke
response = self.client.invoke_model(**request_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/botocore/client.py", line 570, in _api_call
return self._make_api_call(operation_name, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/botocore/context.py", line 124, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/botocore/client.py", line 1031, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: messages.0.content.0.image.source: image exceeds 5 MB maximum: 8033316 bytes > 5242880 bytes
Description
- I want to use the LLMImageBlobParser to extract descriptions from images in a PDF, as an addition to the PyMuPDF document loader
- I am using Anthropic in Bedrock to parse these images. I am aware that the maximum size for an image passed to Anthropic models is 5 MB
- The inner behavior of parsing in the PDF Parsers doesn't take into account these limits in file sizes. There should be a helper function to resize images to a maximum amount of MB that the user knows and can pre-specify when calling for instance the PyMuPDF document loader
- Here's the detail of where the images are created, where the helper size reduction function should go:
langchain/libs/community/langchain_community/document_loaders/parsers/pdf.py
Lines 1090 to 1104 in 1103bdf
- Another option instead would be to modify this part of the LLMImageBlobParser, probably more modular:
System Info
System Information
OS: Linux
OS Version: #1 SMP Fri Mar 29 23:14:13 UTC 2024
Python Version: 3.12.9 (main, Feb 25 2025, 02:40:13) [GCC 12.2.0]
Package Information
langchain_core: 0.3.46
langchain: 0.3.21
langchain_community: 0.3.20
langsmith: 0.3.18
langchain_aws: 0.2.16
langchain_text_splitters: 0.3.7
Optional packages not installed
langserve
Other Dependencies
aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
boto3: 1.37.13
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy: 2.2.4
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.15
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0