Skip to content

Improved performance and lower memory usage during PDF indexing #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 23, 2024

Conversation

velaia
Copy link
Contributor

@velaia velaia commented Sep 19, 2024

This is an update version of PR #19. Besides CPU-parallel pdftoppm images are buffered using tempfile instead of in memory. For large PDFs I have measured significantly lower memory usage (8 GB instead of 16 GB) during indexing.

More context under #19

velaia and others added 3 commits September 19, 2024 23:29
… @ 2.20GHz this way than fixed thread_count=4

Also added paths_only option to convert_from_path which can significantly reduce memory consumption for large PDFsˆ
Copy link
Contributor

@bclavie bclavie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you! I was a bit out of it last week and remember seeing the previous PR and thinking the only tweak would be using a tempfile, and you'd added it before I even had time to review haha.

@bclavie bclavie merged commit 34fb0cf into AnswerDotAI:main Sep 23, 2024
1 check passed
@velaia
Copy link
Contributor Author

velaia commented Sep 23, 2024

Great. Thank you! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants