Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
from langchain_community.document_loaders.parsers import PDFPlumberParser
Error Message and Stack Trace (if applicable)
No response
Description
The typical requirements for RAG projects are generally as follows:
- Import PDF files into a vector database
- From a directory structure
- Be able to update the files
- Without re-importing everything
- Oh, and don't forget to remove files that are no longer present from the vector database
- Since the PDF format isn’t great, we also have some files in Word format
- It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
- The files are, of course, stored in cloud storage
In my opinion, the best approach to handle this using LangChain is with code similar to this:
vector_store=...
record_manager=...
loader=GenericLoader(
blob_loader=FileSystemBlobLoader( # Or CloudBlobLoader
path="mydata/",
glob="**/*",
show_progress=True,
),
blob_parser=MimeTypeBasedParser(
handlers={
"application/pdf": PDFPlumberParser(), # IMPOSSIBLE
"application/vnd.openxmlformats-officedocument.wordprocessingml.document":
MsWordParser(),
},
fallback_parser=TextParser(),
)
)
index(
loader.lazy_load(),
record_manager,
vector_store,
batch_size=100,
)
For this to work, access to the "Parsers" version for the different Loaders is required.
PDFPlumber has several limitations:
- It does not provide a parser
- Uses
load()`` instead of
lazy_load()` - Does not handle tables
- Does not support image conversions.
This PR resolves this.
System Info
System Information
OS: Linux
OS Version: #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0]
Package Information
langchain_core: 0.3.45
langchain: 0.3.20
langchain_community: 0.3.19
langsmith: 0.3.8
langchain_openai: 0.3.8
langchain_tests: 0.3.11
langchain_text_splitters: 0.3.6
Optional packages not installed
langserve
Other Dependencies
aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
httpx<1,>=0.25.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-core<1.0.0,>=0.3.42: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.20: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2.0.0,>=1.24.0;: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
openai<2.0.0,>=1.58.1: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 7.4.4
pytest-asyncio<1,>=0.20: Installed. No version info available.
pytest-socket<1,>=0.6.0: Installed. No version info available.
pytest<9,>=7: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 12.6.0
SQLAlchemy<3,>=1.4: Installed. No version info available.
syrupy<5,>=4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0