Skip to content

Tika integration is broken #385

@gyst

Description

@gyst

There's a bunch of problems with the Tika integration. Rather than splitting them across several tickets, I think it makes sense to document them all together here.

TL;DR

There's a two line fix for all the problems described below. In your buildout:

[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}

1. use_tika=False uses Tika

The use_tika setting with title "Use Tika" is misleadingly named.

Collective.solr always uses Tika when indexing File or Image content

1.1. If use_tika is True, the blob of the File will be sent to the Solr extracthandler inlined within the http post.. In the solr documentation this mode is called Solr Cell and it's discouraged for production use.
1.2. If use_tika is False, the post to the Solr extracthandler will just contain the path to the blob, not the actual blob contents.

Renaming the title of this label into "Use Tika without direct blobstorage access" would perhaps better reflect what's going on here, especially given the filesystem security policy issue detailed below.

Only searchable NamedBlobFile fields don't use Tika

The only scenario in which Tika really is not used, is if you have a NamedBlobFile field on a content object that is marked as plone.app.dexterity.textindexer.searchable, which gets indexed via portal_transforms without hitting the Tika integration. See also this discussion. I was wondering if perhaps it makes sense, to try and convert a blob with portal_transforms first, and only to feed it to Tika when that fails?

Certainly it makes sense to drop some comments or documentation somewhere about this variant.

2. remote streaming is disabled by default

Starting with Solr 9.3, remote streaming has been disabled by default.

You will have to explicitly enable the extraction module.

The fix for that is given in this StackOverflow answer. You have to set both SOLR_ENABLE_REMOTE_STREAMING and SOLR_ENABLE_STREAM_BODY.

If you don't do that, trying to index a binary content object will error out with a 500 traceback from Solr. We should probably inspect that traceback and log a more helpful error message here, to avoid other users having to reconstruct my findings from scratch.

3. invalid tika default field

If you have use_tika = True, you'll get

Client exception => org.apache.solr.common.SolrException: undefined field: "content"
	at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1423)

Changing tika_default_field back from content to SearchableText fixes this. Apparently #316 has been obsoleted and either the provided schema needs to be amended, or the code needs to change in other ways.

4. solr security policy by default disallows blobstorage access

If use_tika = False, indexing a file will point Solr to the blobstorage, on which Solr throws 500 Exception => java.security.AccessControlException: access denied ("java.io.FilePermission". You'll think: hang on, the file permissions are world-read for the whole path to my blobstorage, what's up? Well, java security policy.

For this case also, we should probably inspect that traceback and log a more helpful error message here, to avoid other users having to reconstruct my findings from scratch.

Solr ships with its own security policy, in /parts/solr/server/etc/security.policy. There is no need to muck around with hardcoding changes there: Since Solr 9.4 solrAllowPaths can be set as a configuration option..

reproducing

Instead of whipping up a full Plone+Solr integration, you can easily reproduce these errors against Solr directly. Suppose you have a Solr server up and running, with a core called plone. And you have a simple PDF file in /path/to/testdocument.pdf. You can then in your browser open the following URL to verify whether the extraction handler is working and has access to the file:
http://localhost:8983/solr/plone/update/extract?extractFormat=text&extractOnly=true&wt=xml&stream.file=/path/to/testdocument.pdf

summary

For collective.solr indexing of File and Image content to work, the solr process needs to start with the proper flags set in the environment, as in: SOLR_ENABLE_REMOTE_STREAMING=true SOLR_ENABLE_STREAM_BODY=true SOLR_OPTS="-Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage" bin/solr-foreground.

To configure that via buildout:

[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}

Note that this disables security features that were added for good reasons. With these options enabled, anybody who has network access to Solr also has access to all of the blobs stored in Plone, bypassing the Zope access controls completely. Also note that -Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage will give full write/delete access to the blobstorage, when we actually only need read access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions