Skip to content

Conversation

@aberenguel
Copy link
Collaborator

Solves #2555

@lfcnassif
Copy link
Member

lfcnassif commented Jun 10, 2025

Thank you @aberenguel!

I took a quick look at the code, I didn't test it yet, but just 2 questions:

  1. When opening the case from another machine, are OCR results reused by TextViewer?
  2. When creating a report from another machine, are OCR results reused by processing?

@lfcnassif
Copy link
Member

lfcnassif commented Jun 10, 2025

I suggest replacing the temporary folder cache for a case folder cache option, since the first will be case only anyway and the second can solve the 2 questions above. By the way, the global option is very interesting, but I think it should also address at least the question 1.

@aberenguel
Copy link
Collaborator Author

  1. When opening the case from another machine, are OCR results reused by TextViewer?
    With current implementation, no.
  1. When creating a report from another machine, are OCR results reused by processing?
    With current implementation, no.

I have being doing some tests in order to address the questions above.
If we want to benefits of a global cache and, at same time to address the such requirements when opening the case, I've seen two possibilites:

  1. Add a second cache in OCRParser to store OCR results internally (folder case/iped/text)
  2. Change the code in order to store OCR results in .txt files inside folder case/iped/text. Apparently it was implemented before.

In my provisory test implementations, the second option seems to be much simpler.

@wladimirleite
Copy link
Member

  1. When opening the case from another machine, are OCR results reused by TextViewer?
  2. When creating a report from another machine, are OCR results reused by processing?

@lfcnassif, that wouldn't be an issue with audio transcription, would it?

Another approach could be storing OCR text in a metadata field (like "ocrText"), similar to how it's done for audio transcription. Would that prevent the problems you mentioned?

Some users have already questioned why both extracted text and OCR output are presented as "parsed text." The latter isn't a direct extraction and can contain errors, which is why the audio transcription approach seems more logical to me.

@aberenguel
Copy link
Collaborator Author

Taking advantage of the question, what is the reason for using OCR as a parser and not as a task?

@lfcnassif
Copy link
Member

lfcnassif commented Jun 10, 2025

Add a second cache in OCRParser to store OCR results internally (folder case/iped/text)

I also thought about this solution.

Change the code in order to store OCR results in .txt files inside folder case/iped/text. Apparently it was implemented before.

I don't like the idea of generating again tons of small files that take a lot of time to copy or delete and waste physical disk space, that's why we changed the code to store results in a database.

@lfcnassif, that wouldn't be an issue with audio transcription, would it?

No, because we store it as metadata.

Another approach could be storing OCR text in a metadata field (like "ocrText"), similar to how it's done for audio transcription. Would that prevent the problems you mentioned?

Yes, it would prevent, but I'm not sure if it is the right path from a semantic perspective. OCR text seems more related to me to file content than metadata, it can be very big for PDFs with hundreds of pages and storing a large text as metadata may not be good. I admit I have already thought about it, but I'm not sure if it is the right path. I also wouldn't like to duplicate results...

I put audio transcription results in metadata for the sake of simplicity.

Taking advantage of the question, what is the reason for using OCR as a parser and not as a task?

A long time ago, we had just parsers and Tasks didn`t exist. I also have thought to convert it to a Task, but an OCRParser allows to OCR embedded images, even if they are not expanded\extracted, like images inside DOCs, PPTs, PDFs, etc, and it seems good to me.

@lfcnassif
Copy link
Member

lfcnassif commented Jun 10, 2025

Also, see this discussion about metadata x properties #1195

@wladimirleite
Copy link
Member

wladimirleite commented Jun 10, 2025

(...) it can be very big for PDFs with hundreds of pages and storing a large text as metadata may not be good.

Some information about this, comparing audio transcription text and OCR'ed text, from all cases processed this year available in our network share, with OCR and audio transcription enabled (~100 cases):

  OCR Audio Transcription
Items with empty text per case 298,819 17,979
Items with non-empty text per case 61,638 11,431
Average Characters per Case 18,579,966 4,486,644
Average Characters per Item (non-empty only) 301 392
Max. Characters per Case 221,046,819 91,433,333
Max. Characters per Item 4,327,819 121,997

Indeed there are more OCR'ed characters, around 4x the number of transcribed characters.
There was a huge item with ~4 million OCR'ed characters, but on average (considering non-empty texts only), audios have a bit more characters that OCR'ed items (392 vs 301).

I also wouldn't like to duplicate results...

Why would it be duplicated?

@lfcnassif
Copy link
Member

I also wouldn't like to duplicate results...

Why would it be duplicated?

At a moment I also thought to save OCR as text content and as metadata, but I don't like the idea.

@aberenguel
Copy link
Collaborator Author

I tried several approaches to store OCR results.
I experimented with EhCache, MapDB, H2 MVStore. All of them presented issues when enableExternalParsing = true was set, due to multiple processes accessing the store file simultaneously. They also encountered problems when the processing crashed, requiring integrity validations to be skipped.

As a result, I reverted to using SQLite for storing OCR results, as it appears to be more robust for this purpose.

Regarding caching with EhCache, I enabled persistent caching only in the main processing instance, since the cache cannot be opened by multiple processes. If the cache becomes corrupted, it is deleted and recreated. In all other cases (App, ForkServer), only in-memory caching was enabled.

@lfcnassif
Copy link
Member

lfcnassif commented Jun 12, 2025

I tried several approaches to store OCR results.
I experimented with EhCache, MapDB, H2 MVStore. All of them presented issues when enableExternalParsing = true was set, due to multiple processes accessing the store file simultaneously. They also encountered problems when the processing crashed, requiring integrity validations to be skipped.

Thanks @aberenguel for all your tests, enableExternalParsing = true and --continue are important scenarios that must continue working.

@aberenguel
Copy link
Collaborator Author

aberenguel commented Jun 13, 2025

@lfcnassif , I added support for Redis as optional fallback cache loader. It solved the case when external parsers are enabled to use persistent cache. You can see CacheConfig.txt to see how it works.
An interesting scenario is that we can install Redis in a server and share the cache with all users of the office.

About disk cache stored in the case folder, I'm looking for a way to obtain case folder value inside CacheConfig.java. Do you have any idea?

@aberenguel
Copy link
Collaborator Author

I added support for this implemented cache in FacesRecognitionTask.

@lfcnassif
Copy link
Member

About disk cache stored in the case folder, I'm looking for a way to obtain case folder value inside CacheConfig.java. Do you have any idea?

Sorry for my delay, @aberenguel, seems you've already found a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants