Problem Statement
Currently, it is not possible to search for Text that is inside PDFs and images.
Proposed Solution
Text from PDFs and images would be needed to be extracted and stored somewhere for convenient retrieval.
If I am not mistaken, Nextcloud uses Elasticsearch, which offers an experimental rust client and could be used for that here as well.
For extracting the text from PDFs, I think that should be either doable by, just extracting the text from a computer generated PDF with readable text, or by using ocr.
On the first look, this project, looks like it could be helpful. Although it cannot extract text directly from PDF, converting PDF to PNG first could be a workaround.
User Impact
The user could search for keywords and find all documents it appears in, as well as the page and location.
An important question would be, how to structure the data, and how to efficiently store, retrieve and display it.
I could take a deeper look into actually implementing this, though time wise, it would probably be in a while.
Problem Statement
Currently, it is not possible to search for Text that is inside PDFs and images.
Proposed Solution
Text from PDFs and images would be needed to be extracted and stored somewhere for convenient retrieval.
If I am not mistaken, Nextcloud uses Elasticsearch, which offers an experimental rust client and could be used for that here as well.
For extracting the text from PDFs, I think that should be either doable by, just extracting the text from a computer generated PDF with readable text, or by using ocr.
On the first look, this project, looks like it could be helpful. Although it cannot extract text directly from PDF, converting PDF to PNG first could be a workaround.
User Impact
The user could search for keywords and find all documents it appears in, as well as the page and location.
An important question would be, how to structure the data, and how to efficiently store, retrieve and display it.
I could take a deeper look into actually implementing this, though time wise, it would probably be in a while.