-
Notifications
You must be signed in to change notification settings - Fork 47
Description
There's a bunch of problems with the Tika integration. Rather than splitting them across several tickets, I think it makes sense to document them all together here.
TL;DR
There's a two line fix for all the problems described below. In your buildout:
[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}
1. use_tika=False uses Tika
The use_tika setting with title "Use Tika" is misleadingly named.
Collective.solr always uses Tika when indexing File or Image content
1.1. If use_tika is True, the blob of the File will be sent to the Solr extracthandler inlined within the http post.. In the solr documentation this mode is called Solr Cell and it's discouraged for production use.
1.2. If use_tika is False, the post to the Solr extracthandler will just contain the path to the blob, not the actual blob contents.
Renaming the title of this label into "Use Tika without direct blobstorage access" would perhaps better reflect what's going on here, especially given the filesystem security policy issue detailed below.
Only searchable NamedBlobFile fields don't use Tika
The only scenario in which Tika really is not used, is if you have a NamedBlobFile field on a content object that is marked as plone.app.dexterity.textindexer.searchable, which gets indexed via portal_transforms without hitting the Tika integration. See also this discussion. I was wondering if perhaps it makes sense, to try and convert a blob with portal_transforms first, and only to feed it to Tika when that fails?
Certainly it makes sense to drop some comments or documentation somewhere about this variant.
2. remote streaming is disabled by default
Starting with Solr 9.3, remote streaming has been disabled by default.
You will have to explicitly enable the extraction module.
The fix for that is given in this StackOverflow answer. You have to set both SOLR_ENABLE_REMOTE_STREAMING and SOLR_ENABLE_STREAM_BODY.
If you don't do that, trying to index a binary content object will error out with a 500 traceback from Solr. We should probably inspect that traceback and log a more helpful error message here, to avoid other users having to reconstruct my findings from scratch.
3. invalid tika default field
If you have use_tika = True, you'll get
Client exception => org.apache.solr.common.SolrException: undefined field: "content"
at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1423)
Changing tika_default_field back from content to SearchableText fixes this. Apparently #316 has been obsoleted and either the provided schema needs to be amended, or the code needs to change in other ways.
4. solr security policy by default disallows blobstorage access
If use_tika = False, indexing a file will point Solr to the blobstorage, on which Solr throws 500 Exception => java.security.AccessControlException: access denied ("java.io.FilePermission". You'll think: hang on, the file permissions are world-read for the whole path to my blobstorage, what's up? Well, java security policy.
For this case also, we should probably inspect that traceback and log a more helpful error message here, to avoid other users having to reconstruct my findings from scratch.
Solr ships with its own security policy, in /parts/solr/server/etc/security.policy. There is no need to muck around with hardcoding changes there: Since Solr 9.4 solrAllowPaths can be set as a configuration option..
reproducing
Instead of whipping up a full Plone+Solr integration, you can easily reproduce these errors against Solr directly. Suppose you have a Solr server up and running, with a core called plone. And you have a simple PDF file in /path/to/testdocument.pdf. You can then in your browser open the following URL to verify whether the extraction handler is working and has access to the file:
http://localhost:8983/solr/plone/update/extract?extractFormat=text&extractOnly=true&wt=xml&stream.file=/path/to/testdocument.pdf
summary
For collective.solr indexing of File and Image content to work, the solr process needs to start with the proper flags set in the environment, as in: SOLR_ENABLE_REMOTE_STREAMING=true SOLR_ENABLE_STREAM_BODY=true SOLR_OPTS="-Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage" bin/solr-foreground.
To configure that via buildout:
[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}
Note that this disables security features that were added for good reasons. With these options enabled, anybody who has network access to Solr also has access to all of the blobs stored in Plone, bypassing the Zope access controls completely. Also note that -Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage will give full write/delete access to the blobstorage, when we actually only need read access.