-
Notifications
You must be signed in to change notification settings - Fork 290
#2555: Use EhCache for heap and persistent cache #2556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thank you @aberenguel! I took a quick look at the code, I didn't test it yet, but just 2 questions:
|
|
I suggest replacing the temporary folder cache for a case folder cache option, since the first will be case only anyway and the second can solve the 2 questions above. By the way, the global option is very interesting, but I think it should also address at least the question 1. |
I have being doing some tests in order to address the questions above.
In my provisory test implementations, the second option seems to be much simpler. |
@lfcnassif, that wouldn't be an issue with audio transcription, would it? Another approach could be storing OCR text in a metadata field (like "ocrText"), similar to how it's done for audio transcription. Would that prevent the problems you mentioned? Some users have already questioned why both extracted text and OCR output are presented as "parsed text." The latter isn't a direct extraction and can contain errors, which is why the audio transcription approach seems more logical to me. |
|
Taking advantage of the question, what is the reason for using OCR as a parser and not as a task? |
I also thought about this solution.
I don't like the idea of generating again tons of small files that take a lot of time to copy or delete and waste physical disk space, that's why we changed the code to store results in a database.
No, because we store it as metadata.
Yes, it would prevent, but I'm not sure if it is the right path from a semantic perspective. OCR text seems more related to me to file content than metadata, it can be very big for PDFs with hundreds of pages and storing a large text as metadata may not be good. I admit I have already thought about it, but I'm not sure if it is the right path. I also wouldn't like to duplicate results... I put audio transcription results in metadata for the sake of simplicity.
A long time ago, we had just parsers and Tasks didn`t exist. I also have thought to convert it to a Task, but an OCRParser allows to OCR embedded images, even if they are not expanded\extracted, like images inside DOCs, PPTs, PDFs, etc, and it seems good to me. |
|
Also, see this discussion about metadata x properties #1195 |
Some information about this, comparing audio transcription text and OCR'ed text, from all cases processed this year available in our network share, with OCR and audio transcription enabled (~100 cases):
Indeed there are more OCR'ed characters, around 4x the number of transcribed characters.
Why would it be duplicated? |
At a moment I also thought to save OCR as text content and as metadata, but I don't like the idea. |
|
I tried several approaches to store OCR results. As a result, I reverted to using SQLite for storing OCR results, as it appears to be more robust for this purpose. Regarding caching with EhCache, I enabled persistent caching only in the main processing instance, since the cache cannot be opened by multiple processes. If the cache becomes corrupted, it is deleted and recreated. In all other cases (App, ForkServer), only in-memory caching was enabled. |
Thanks @aberenguel for all your tests, |
|
@lfcnassif , I added support for Redis as optional fallback cache loader. It solved the case when external parsers are enabled to use persistent cache. You can see About disk cache stored in the case folder, I'm looking for a way to obtain case folder value inside CacheConfig.java. Do you have any idea? |
|
I added support for this implemented cache in FacesRecognitionTask. |
Sorry for my delay, @aberenguel, seems you've already found a solution. |
Solves #2555