-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Describe the bug
When running the Streamlit application on a Windows environment, the application crashes immediately after a PDF file is uploaded. The process terminates with exit code 3221225477, which corresponds to an access violation error on Windows. This error occurs during the creation of the vector database.
- Operating System: Windows
- Python Environment: Conda
- App Version: v2.1.0
To Reproduce
Steps to reproduce the behavior:
- Set up and install the project on a Windows machine using a Conda environment.
- Launch the application using python run.py.
- In the Streamlit UI, use the file uploader to select and upload any PDF document.
- Observe that the application crashes as it begins to process the file (the "Processing uploaded PDF..." spinner appears).
Additional context
The error occurs within the create_vector_db function in src/app/main.py, specifically when loader.load() is called. The default loader, UnstructuredPDFLoader, seems to have underlying dependency issues on Windows that lead to a memory access violation.
Commenting out the pdfplumber section (used for rendering page images) did not solve the issue, isolating the problem to the PDF loading and processing part handled by UnstructuredPDFLoader.
Proposed Solution
The issue can be resolved by replacing UnstructuredPDFLoader with PyMuPDFLoader, which appears to be more stable on Windows.
The required changes in src/app/main.py are:
- Change the import statement:
- From:
1 from langchain_community.document_loaders import UnstructuredPDFLoader
- To:
1 from langchain_community.document_loaders import PyMuPDFLoader
- Update the loader instantiation within the
create_vector_dbfunction:
- From:
1 loader = UnstructuredPDFLoader(path)
- To:
1 loader = PyMuPDFLoader(path)
This change resolves the crash and allows the application to process PDFs correctly on Windows. The pymupdf package is a necessary dependency for this solution.
Thank you for this great project! I hope this report helps improve its cross-platform compatibility.