Skip to content

PDFPlumberLoader is not compatible with GenericLoader #30454

Open
@pprados

Description

@pprados

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders.parsers import PDFPlumberParser

Error Message and Stack Trace (if applicable)

No response

Description

The typical requirements for RAG projects are generally as follows:

  • Import PDF files into a vector database
  • From a directory structure
  • Be able to update the files
  • Without re-importing everything
  • Oh, and don't forget to remove files that are no longer present from the vector database
  • Since the PDF format isn’t great, we also have some files in Word format
  • It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
  • The files are, of course, stored in cloud storage

In my opinion, the best approach to handle this using LangChain is with code similar to this:

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=MimeTypeBasedParser(
        handlers={
          "application/pdf": PDFPlumberParser(),  # IMPOSSIBLE
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            MsWordParser(),
        },
        fallback_parser=TextParser(),
    )
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

For this to work, access to the "Parsers" version for the different Loaders is required.

PDFPlumber has several limitations:

  • It does not provide a parser
  • Uses load()`` instead of lazy_load()`
  • Does not handle tables
  • Does not support image conversions.

This PR resolves this.

System Info

System Information

OS: Linux
OS Version: #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0]

Package Information

langchain_core: 0.3.45
langchain: 0.3.20
langchain_community: 0.3.19
langsmith: 0.3.8
langchain_openai: 0.3.8
langchain_tests: 0.3.11
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
httpx<1,>=0.25.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-core<1.0.0,>=0.3.42: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.20: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2.0.0,>=1.24.0;: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
openai<2.0.0,>=1.58.1: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 7.4.4
pytest-asyncio<1,>=0.20: Installed. No version info available.
pytest-socket<1,>=0.6.0: Installed. No version info available.
pytest<9,>=7: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 12.6.0
SQLAlchemy<3,>=1.4: Installed. No version info available.
syrupy<5,>=4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions