Skip to content

Add support for password protected documents #1916

@lengoyvaerts

Description

@lengoyvaerts

This is a feature request as a result of this Elastic Discuss thread, which mentions #229

When trying to send a password protected document to the FSCrawler REST API, I'm getting the following exception:

08:24:49,504 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [document-with-password.docx] org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:274) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1] at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1] at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:205) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?] at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) ~[?:?] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.5.jar:?] at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.5.jar:?] at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.5.jar:?] at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.1.jar:4.0.1] at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.1.jar:4.0.1] at java.base/java.lang.Thread.run(Thread.java:834) [?:?] 08:24:49,505 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation

Would it be possible to support a feature to allow password protected documents to be crawled? When using the crawler to crawl a directory, it could be something like a .password file as suggested in the Discuss thread. For the REST API, it could be an extra parameter in the request, e.g: -F "password=my-password" (could also be Base64 encoded maybe)

I would think Tika supports this, given the documentation for the PasswordProvider class

Metadata

Metadata

Assignees

Labels

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions