Skip to content

ZeroxPDFLoader is not compatible with GenericLoader #30455

Open
@pprados

Description

@pprados

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders.parsers import ZeroxPDFParser

Error Message and Stack Trace (if applicable)

No response

Description

The typical requirements for RAG projects are generally as follows:

  • Import PDF files into a vector database
  • From a directory structure
  • Be able to update the files
  • Without re-importing everything
  • Oh, and don't forget to remove files that are no longer present from the vector database
  • Since the PDF format isn’t great, we also have some files in Word format
  • It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
  • The files are, of course, stored in cloud storage

In my opinion, the best approach to handle this using LangChain is with code similar to this:

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=MimeTypeBasedParser(
        handlers={
          "application/pdf": ZeroxPDFParser(),  # IMPOSSIBLE
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            MsWordParser(),
        },
        fallback_parser=TextParser(),
    )
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

For this to work, access to the "Parsers" version for the different Loaders is required.

ZeroxPDFParser has several limitations:

  • It does not provide a parser
  • Does not support image conversions.

A PR solve this.

System Info

System Information

OS: Linux
OS Version: #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0]

Package Information

langchain_core: 0.3.45
langchain: 0.3.20
langchain_community: 0.3.19
langsmith: 0.3.8
langchain_openai: 0.3.8
langchain_tests: 0.3.11
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
httpx<1,>=0.25.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-core<1.0.0,>=0.3.42: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.20: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2.0.0,>=1.24.0;: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
openai<2.0.0,>=1.58.1: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 7.4.4
pytest-asyncio<1,>=0.20: Installed. No version info available.
pytest-socket<1,>=0.6.0: Installed. No version info available.
pytest<9,>=7: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 12.6.0
SQLAlchemy<3,>=1.4: Installed. No version info available.
syrupy<5,>=4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    🤖:bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions