Skip to content

Conversation

@dadoonet
Copy link
Owner

@dadoonet dadoonet commented Nov 25, 2025

This also change the way the Tika parser was instantiated. It's no more a static class.

This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords.

I think we should implement a PasswordProvider interface which could get the Password from many possible providers.

The idea is to define PasswordProvider#getPassword(String path) which is responsible to provide the password for a given file.

The simplest one would be MemoryPasswordProvider.
The easiest one would be DiskPasswordProvider.
And may be an ElasticsearchPasswordProvider.

Related to #1916.


Note

Adds password handling for encrypted documents via REST (form/header/query) and refactors Tika parsing to an instance-based API with password support.

  • REST API:
    • Add password support to POST /_document (multipart and JSON 3rd-party), accepted via form, header, or query params.
    • DocumentApi now uses an instance TikaDocParser; enrichDoc accepts a password and passes it to parsing.
  • Tika Parsing:
    • Refactor TikaDocParser and TikaInstance from static to instance-based; inject FsSettings.
    • Implement password-aware parsing using Tika PasswordProvider; handle encrypted docs (no content if missing/wrong password).
    • Minor: route language detection via instance and adjust logging.
  • Core Crawler:
    • FsParserAbstract holds a TikaDocParser instance and uses it for extraction.
  • Tests:
    • Update REST and unit/integration tests to pass passwords for protected pdf/docx; add assertions on extracted content.
    • Restructure OCR ITs; utility method to copy resources by sample dir.
  • Docs:
    • REST docs: add “Document password” section with usage examples; minor formatting tweak.

Written by Cursor Bugbot for commit 05b66f4. This will update automatically on new commits. Configure here.

This also change the way the Tika parser was instantiated. It's no more a static class.

This is a WIP as I'd like to add the support for multiple passwords so we can try many options "a la brute force" in case the directory contains many files with different passwords.

I think we should implement a PasswordProvider interface which could get the Password from many possible providers.

The idea is to define `PasswordProvider#getPassword(String path)` which is responsible to provide the password for a given file.

The simplest one would be `MemoryPasswordProvider`.
The easiest one would be `DiskPasswordProvider`.
And may be an `ElasticsearchPasswordProvider`.

Related to #1916.
@dadoonet dadoonet self-assigned this Nov 25, 2025
@dadoonet dadoonet added new For new features or options component:extractor For Tika, XML and JSON parsers labels Nov 25, 2025
WriteOutContentHandler handler = new WriteOutContentHandler(indexedChars);
try (stream) {
// Set the password if any
context.set(PasswordProvider.class, new StandardPasswordProvider(password));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: PasswordProvider persists across multiple document parsing operations

The ParseContext is created once in TikaInstance constructor and reused across all parsing operations. When a password is provided, the PasswordProvider is set in this shared context but never cleaned up after parsing. This means subsequent parsing operations with the same TikaDocParser instance will retain the previous password, potentially allowing encrypted documents to be decrypted with incorrect passwords or bypassing password protection unintentionally.

Additional Locations (1)

Fix in Cursor Fix in Web

logger.info(" --> Launching test [{}]", currentTestName);
currentTestResourceDir = testResourceTarget.resolve(currentTestName);
String url = getUrl("samples", currentTestName);
String url = getUrl("samples", sampleDirName);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Test resources directory mismatch with sample directory

In the new copyTestResources(String sampleDirName) method, the sampleDirName parameter is used to locate the source files via getUrl("samples", sampleDirName), but the destination directory is still set using currentTestName instead of sampleDirName. This causes a mismatch where test files from the sampleDirName directory are copied to a different directory named after the test method. When a subclass overrides this method to use a different sample directory (e.g., "ocr"), the resources will be copied to the wrong location.

Fix in Cursor Fix in Web

@dadoonet dadoonet linked an issue Nov 25, 2025 that may be closed by this pull request

public void copyTestResources() throws IOException {
copyTestResources("ocr");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Missing @Before annotation on overridden setup method

The copyTestResources() method overrides the parent's @Before annotated method from AbstractITCase but is missing the @Before annotation. In JUnit 4, the @Before annotation is not inherited when overriding, so this method will never be called before tests run. This means the "ocr" test resources won't be copied to the test directory, causing all OCR tests to fail because they won't find the expected sample files.

Fix in Cursor Fix in Web

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 1, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:extractor For Tika, XML and JSON parsers new For new features or options

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for password protected documents

2 participants