#2555: Use EhCache for heap and persistent cache #2556

aberenguel · 2025-06-09T21:01:46Z

Solves #2555

…sults

…creation in CacheConfig

lfcnassif · 2025-06-10T01:09:10Z

Thank you @aberenguel!

I took a quick look at the code, I didn't test it yet, but just 2 questions:

When opening the case from another machine, are OCR results reused by TextViewer?
When creating a report from another machine, are OCR results reused by processing?

lfcnassif · 2025-06-10T02:00:29Z

I suggest replacing the temporary folder cache for a case folder cache option, since the first will be case only anyway and the second can solve the 2 questions above. By the way, the global option is very interesting, but I think it should also address at least the question 1.

aberenguel · 2025-06-10T18:49:42Z

When opening the case from another machine, are OCR results reused by TextViewer?
With current implementation, no.

When creating a report from another machine, are OCR results reused by processing?
With current implementation, no.

I have being doing some tests in order to address the questions above.
If we want to benefits of a global cache and, at same time to address the such requirements when opening the case, I've seen two possibilites:

Add a second cache in OCRParser to store OCR results internally (folder case/iped/text)
Change the code in order to store OCR results in .txt files inside folder case/iped/text. Apparently it was implemented before.

In my provisory test implementations, the second option seems to be much simpler.

wladimirleite · 2025-06-10T18:50:50Z

When opening the case from another machine, are OCR results reused by TextViewer?

When creating a report from another machine, are OCR results reused by processing?

@lfcnassif, that wouldn't be an issue with audio transcription, would it?

Another approach could be storing OCR text in a metadata field (like "ocrText"), similar to how it's done for audio transcription. Would that prevent the problems you mentioned?

Some users have already questioned why both extracted text and OCR output are presented as "parsed text." The latter isn't a direct extraction and can contain errors, which is why the audio transcription approach seems more logical to me.

aberenguel · 2025-06-10T20:13:29Z

Taking advantage of the question, what is the reason for using OCR as a parser and not as a task?

lfcnassif · 2025-06-10T21:42:23Z

Add a second cache in OCRParser to store OCR results internally (folder case/iped/text)

I also thought about this solution.

Change the code in order to store OCR results in .txt files inside folder case/iped/text. Apparently it was implemented before.

I don't like the idea of generating again tons of small files that take a lot of time to copy or delete and waste physical disk space, that's why we changed the code to store results in a database.

@lfcnassif, that wouldn't be an issue with audio transcription, would it?

No, because we store it as metadata.

Another approach could be storing OCR text in a metadata field (like "ocrText"), similar to how it's done for audio transcription. Would that prevent the problems you mentioned?

Yes, it would prevent, but I'm not sure if it is the right path from a semantic perspective. OCR text seems more related to me to file content than metadata, it can be very big for PDFs with hundreds of pages and storing a large text as metadata may not be good. I admit I have already thought about it, but I'm not sure if it is the right path. I also wouldn't like to duplicate results...

I put audio transcription results in metadata for the sake of simplicity.

Taking advantage of the question, what is the reason for using OCR as a parser and not as a task?

A long time ago, we had just parsers and Tasks didn`t exist. I also have thought to convert it to a Task, but an OCRParser allows to OCR embedded images, even if they are not expanded\extracted, like images inside DOCs, PPTs, PDFs, etc, and it seems good to me.

lfcnassif · 2025-06-10T21:44:20Z

Also, see this discussion about metadata x properties #1195

wladimirleite · 2025-06-10T22:55:59Z

(...) it can be very big for PDFs with hundreds of pages and storing a large text as metadata may not be good.

Some information about this, comparing audio transcription text and OCR'ed text, from all cases processed this year available in our network share, with OCR and audio transcription enabled (~100 cases):

	OCR	Audio Transcription
Items with empty text per case	298,819	17,979
Items with non-empty text per case	61,638	11,431
Average Characters per Case	18,579,966	4,486,644
Average Characters per Item (non-empty only)	301	392
Max. Characters per Case	221,046,819	91,433,333
Max. Characters per Item	4,327,819	121,997

Indeed there are more OCR'ed characters, around 4x the number of transcribed characters.
There was a huge item with ~4 million OCR'ed characters, but on average (considering non-empty texts only), audios have a bit more characters that OCR'ed items (392 vs 301).

I also wouldn't like to duplicate results...

Why would it be duplicated?

lfcnassif · 2025-06-10T23:12:37Z

I also wouldn't like to duplicate results...

Why would it be duplicated?

At a moment I also thought to save OCR as text content and as metadata, but I don't like the idea.

… view

… true

aberenguel · 2025-06-12T00:06:34Z

I tried several approaches to store OCR results.
I experimented with EhCache, MapDB, H2 MVStore. All of them presented issues when enableExternalParsing = true was set, due to multiple processes accessing the store file simultaneously. They also encountered problems when the processing crashed, requiring integrity validations to be skipped.

As a result, I reverted to using SQLite for storing OCR results, as it appears to be more robust for this purpose.

Regarding caching with EhCache, I enabled persistent caching only in the main processing instance, since the cache cannot be opened by multiple processes. If the cache becomes corrupted, it is deleted and recreated. In all other cases (App, ForkServer), only in-memory caching was enabled.

lfcnassif · 2025-06-12T01:24:52Z

I tried several approaches to store OCR results.
I experimented with EhCache, MapDB, H2 MVStore. All of them presented issues when enableExternalParsing = true was set, due to multiple processes accessing the store file simultaneously. They also encountered problems when the processing crashed, requiring integrity validations to be skipped.

Thanks @aberenguel for all your tests, enableExternalParsing = true and --continue are important scenarios that must continue working.

…+ Fix off-heap size

aberenguel · 2025-06-13T03:44:31Z

@lfcnassif , I added support for Redis as optional fallback cache loader. It solved the case when external parsers are enabled to use persistent cache. You can see CacheConfig.txt to see how it works.
An interesting scenario is that we can install Redis in a server and share the cache with all users of the office.

About disk cache stored in the case folder, I'm looking for a way to obtain case folder value inside CacheConfig.java. Do you have any idea?

aberenguel · 2025-06-14T14:46:51Z

I added support for this implemented cache in FacesRecognitionTask.

…ded to be used

lfcnassif · 2025-06-18T01:52:26Z

About disk cache stored in the case folder, I'm looking for a way to obtain case folder value inside CacheConfig.java. Do you have any idea?

Sorry for my delay, @aberenguel, seems you've already found a solution.

aberenguel added 2 commits June 9, 2025 18:00

#2555: Use EhCache for caching (heap + disk) OCR and Transcription re…

c6522ae

…sults

#2555: Compare tesseract version when cache hits and unify the cache …

fb24086

…creation in CacheConfig

#2555: Separate cache provision from CacheConfig

110c327

aberenguel added 9 commits June 11, 2025 12:40

#2555: Static properties names in CacheConfig

ab0e625

#2555: Supports only memory cache by configuration and forced in case…

f9fd205

… view

#2555: Uses mapDB as primary storage of OCR results

0e01f4c

#2555: Fix config comments

a41df54

#2555: Preload OCRParser cache to avoid first TextViewer loading slow

0aa0148

#2555: Remove old mapDB dependency

5ee5f21

#2555: Close cache in the end of processing

bb11f31

#2555: Support memory cache in OCRParser when enableExternalParsing =…

098544f

… true

#2555: Revert OCR result storage with SQLite

ec93b39

aberenguel added 5 commits June 12, 2025 17:39

#2555: Simplify InMemoryCacheProvider.java

ef9e5c1

#2555: Fix NPE when serializing OCRResult

149e8cb

#2555: Support Redis as cache loader

e11728e

#2555: Support redis in App UI and when enableExternalParsing = true …

47e47f4

…+ Fix off-heap size

#2555: Log levels for Cache

80a0550

#2555: Support disk cache in case folder

2b754aa

aberenguel added 3 commits June 13, 2025 15:22

#2555: Optimize debug messages

f9b1b2c

#2555: Fix log messages

87cdd3f

#2555: Uses managed cache in PythonTask

d476f06

aberenguel added 3 commits June 15, 2025 05:44

#2555: Refactoring load data for redis

1cf1f68

#2555: Use Long when it is not explicit double

456553f

#2555: Python script class must set usesCache = True if cache inten…

d4ca40a

…ded to be used

#2555: Use EhCache for heap and persistent cache #2556

Are you sure you want to change the base?

#2555: Use EhCache for heap and persistent cache #2556

Uh oh!

Conversation

aberenguel commented Jun 9, 2025

Uh oh!

lfcnassif commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aberenguel commented Jun 10, 2025

Uh oh!

wladimirleite commented Jun 10, 2025

Uh oh!

aberenguel commented Jun 10, 2025

Uh oh!

lfcnassif commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wladimirleite commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfcnassif commented Jun 10, 2025

Uh oh!

aberenguel commented Jun 12, 2025

Uh oh!

lfcnassif commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aberenguel commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aberenguel commented Jun 14, 2025

Uh oh!

lfcnassif commented Jun 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lfcnassif commented Jun 10, 2025 •

edited

Loading

lfcnassif commented Jun 10, 2025 •

edited

Loading

lfcnassif commented Jun 10, 2025 •

edited

Loading

lfcnassif commented Jun 10, 2025 •

edited

Loading

wladimirleite commented Jun 10, 2025 •

edited

Loading

lfcnassif commented Jun 12, 2025 •

edited

Loading

aberenguel commented Jun 13, 2025 •

edited

Loading