Skip to content

[FEATURE] OCR and Full-Text-Search #446

Description

@user12257

Problem Statement

Currently, it is not possible to search for Text that is inside PDFs and images.

Proposed Solution

Text from PDFs and images would be needed to be extracted and stored somewhere for convenient retrieval.

If I am not mistaken, Nextcloud uses Elasticsearch, which offers an experimental rust client and could be used for that here as well.

For extracting the text from PDFs, I think that should be either doable by, just extracting the text from a computer generated PDF with readable text, or by using ocr.

On the first look, this project, looks like it could be helpful. Although it cannot extract text directly from PDF, converting PDF to PNG first could be a workaround.

User Impact

The user could search for keywords and find all documents it appears in, as well as the page and location.

An important question would be, how to structure the data, and how to efficiently store, retrieve and display it.

I could take a deeper look into actually implementing this, though time wise, it would probably be in a while.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions