-
Notifications
You must be signed in to change notification settings - Fork 307
Add support for password protected documents #2241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This also change the way the Tika parser was instantiated. It's no more a static class. This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords. I think we should implement a PasswordProvider interface which could get the Password from many possible providers. The idea is to define `PasswordProvider#getPassword(String path)` which is responsible to provide the password for a given file. The simplest one would be `MemoryPasswordProvider`. The easiest one would be `DiskPasswordProvider`. And may be an `ElasticsearchPasswordProvider`. Related to #1916.
| WriteOutContentHandler handler = new WriteOutContentHandler(indexedChars); | ||
| try (stream) { | ||
| // Set the password if any | ||
| context.set(PasswordProvider.class, new StandardPasswordProvider(password)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: PasswordProvider persists across multiple document parsing operations
The ParseContext is created once in TikaInstance constructor and reused across all parsing operations. When a password is provided, the PasswordProvider is set in this shared context but never cleaned up after parsing. This means subsequent parsing operations with the same TikaDocParser instance will retain the previous password, potentially allowing encrypted documents to be decrypted with incorrect passwords or bypassing password protection unintentionally.
Additional Locations (1)
| logger.info(" --> Launching test [{}]", currentTestName); | ||
| currentTestResourceDir = testResourceTarget.resolve(currentTestName); | ||
| String url = getUrl("samples", currentTestName); | ||
| String url = getUrl("samples", sampleDirName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Test resources directory mismatch with sample directory
In the new copyTestResources(String sampleDirName) method, the sampleDirName parameter is used to locate the source files via getUrl("samples", sampleDirName), but the destination directory is still set using currentTestName instead of sampleDirName. This causes a mismatch where test files from the sampleDirName directory are copied to a different directory named after the test method. When a subclass overrides this method to use a different sample directory (e.g., "ocr"), the resources will be copied to the wrong location.
|
|
||
| public void copyTestResources() throws IOException { | ||
| copyTestResources("ocr"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Missing @Before annotation on overridden setup method
The copyTestResources() method overrides the parent's @Before annotated method from AbstractITCase but is missing the @Before annotation. In JUnit 4, the @Before annotation is not inherited when overriding, so this method will never be called before tests run. This means the "ocr" test resources won't be copied to the test directory, causing all OCR tests to fail because they won't find the expected sample files.
|



This also change the way the Tika parser was instantiated. It's no more a static class.
This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords.
I think we should implement a PasswordProvider interface which could get the Password from many possible providers.
The idea is to define
PasswordProvider#getPassword(String path)which is responsible to provide the password for a given file.The simplest one would be
MemoryPasswordProvider.The easiest one would be
DiskPasswordProvider.And may be an
ElasticsearchPasswordProvider.Related to #1916.
Note
Adds password handling for encrypted documents via REST (form/header/query) and refactors Tika parsing to an instance-based API with password support.
passwordsupport toPOST /_document(multipart and JSON 3rd-party), accepted via form, header, or query params.DocumentApinow uses an instanceTikaDocParser;enrichDocaccepts a password and passes it to parsing.TikaDocParserandTikaInstancefrom static to instance-based; injectFsSettings.PasswordProvider; handle encrypted docs (no content if missing/wrong password).FsParserAbstractholds aTikaDocParserinstance and uses it for extraction.pdf/docx; add assertions on extractedcontent.Written by Cursor Bugbot for commit 05b66f4. This will update automatically on new commits. Configure here.