Tika integration is broken

There's a bunch of problems with the Tika integration. Rather than splitting them across several tickets, I think it makes sense to document them all together here.

## TL;DR

There's a two line fix for all the problems described below. In your buildout:
```
[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}
```

## 1. `use_tika=False` uses Tika

[The `use_tika` setting with title "Use Tika"](https://github.com/collective/collective.solr/blob/1d65570fd9270d465317469796f17a7b351bd4d5/src/collective/solr/interfaces.py#L387) is misleadingly named.

Collective.solr always uses Tika when indexing File or Image content

1.1. If `use_tika is True`, the blob of the File will be sent to the Solr extracthandler [inlined within the http post.](https://github.com/collective/collective.solr/blob/1d65570fd9270d465317469796f17a7b351bd4d5/src/collective/solr/indexer.py#L167). In the solr documentation this mode is called Solr Cell and it's [discouraged for production use](https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#solr-cell-performance-implications).
1.2. If `use_tika is False`, the post to the Solr extracthandler will just [contain the path to the blob](https://github.com/collective/collective.solr/blob/1d65570fd9270d465317469796f17a7b351bd4d5/src/collective/solr/indexer.py#L176), not the actual blob contents.

Renaming the title of this label into "Use Tika without direct blobstorage access" would perhaps better reflect what's going on here, especially given the filesystem security policy issue detailed below.

### Only searchable NamedBlobFile fields don't use Tika
The only scenario in which Tika really is not used, is if you have a `NamedBlobFile` field on a content object that is marked as `plone.app.dexterity.textindexer.searchable`, which gets indexed via portal_transforms without hitting the Tika integration. See also t[his discussion](https://community.plone.org/t/plone-6-installation-does-not-index-pdf-files/17056/10). I was wondering if perhaps it makes sense, to try and convert a blob with portal_transforms first, and only to feed it to Tika when that fails?

Certainly it makes sense to drop some comments or documentation somewhere about this variant.

## 2. remote streaming is disabled by default

Starting with Solr 9.3, [remote streaming has been disabled by default](https://solr.apache.org/guide/solr/latest/upgrade-notes/major-changes-in-solr-9.html#security-2).

[You will have to explicitly enable the extraction module.](https://solr.apache.org/guide/solr/latest/indexing-guide/indexing-with-tika.html#module)

The fix for that is given in [this StackOverflow answer](https://stackoverflow.com/questions/77842844/solr-9-4-using-sys-prop-when-enabling-remote-stream-and-stream-body#77845415). You have to set both `SOLR_ENABLE_REMOTE_STREAMING` and `SOLR_ENABLE_STREAM_BODY`.

If you don't do that, trying to index a binary content object will error out with a 500 traceback from Solr. We should probably inspect that traceback and log a more helpful error message [here](https://github.com/collective/collective.solr/blob/1d65570fd9270d465317469796f17a7b351bd4d5/src/collective/solr/indexer.py#L189), to avoid other users having to reconstruct my findings from scratch.

## 3. invalid tika default field

If you have `use_tika = True`, you'll get
```
Client exception => org.apache.solr.common.SolrException: undefined field: "content"
	at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1423)
```

Changing `tika_default_field` back from `content` to `SearchableText` fixes this. Apparently #316 has been obsoleted and either the provided schema needs to be amended, or the code needs to change in other ways.

## 4. solr security policy by default disallows blobstorage access

If `use_tika = False`, indexing a file will point Solr to the blobstorage, on which Solr throws `500 Exception => java.security.AccessControlException: access denied ("java.io.FilePermission"`. You'll think: hang on, the file permissions are world-read for the whole path to my blobstorage, what's up? Well, [java security policy](https://stackoverflow.com/questions/10454037/java-security-accesscontrolexception-access-denied-java-io-filepermission).

For this case also, we should probably inspect that traceback and log a more helpful error message [here](https://github.com/collective/collective.solr/blob/1d65570fd9270d465317469796f17a7b351bd4d5/src/collective/solr/indexer.py#L189), to avoid other users having to reconstruct my findings from scratch.

Solr ships with its own security policy, in `/parts/solr/server/etc/security.policy`. There is no need to muck around with hardcoding changes there: [Since Solr 9.4 `solrAllowPaths` can be set as a configuration option.](https://issues.apache.org/jira/browse/SOLR-16905).

## reproducing

Instead of whipping up a full Plone+Solr integration, you can easily reproduce these errors against Solr directly. Suppose you have a Solr server up and running, with a core called `plone`. And you have a simple PDF file in `/path/to/testdocument.pdf`. You can then in your browser open the following URL to verify whether the extraction handler is working and has access to the file:
http://localhost:8983/solr/plone/update/extract?extractFormat=text&extractOnly=true&wt=xml&stream.file=/path/to/testdocument.pdf

## summary

For collective.solr indexing of File and Image content to work, the solr process needs to start with the proper flags set in the environment, as in: `SOLR_ENABLE_REMOTE_STREAMING=true SOLR_ENABLE_STREAM_BODY=true SOLR_OPTS="-Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage" bin/solr-foreground`.

To configure that via buildout:
```
[supervisor]
supervisord-environment = SOLR_ENABLE_REMOTE_STREAMING=true,SOLR_ENABLE_STREAM_BODY=true,SOLR_OPTS="-Dsolr.allowPaths=${instance:blob-storage}
```

Note that this disables security features that were added for good reasons. With these options enabled, anybody who has network access to Solr also has access to all of the blobs stored in Plone, bypassing the Zope access controls completely. Also note that `-Dsolr.allowPaths=/path/to/my.buildout/var/blobstorage` [will give full write/delete access](https://github.com/apache/solr/blob/de7f11d0bfc0cba8ce227dcdc6add390da8f5c2d/solr/server/etc/security.policy#L205) to the blobstorage, when we actually only need read access.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tika integration is broken #385

TL;DR

1. `use_tika=False` uses Tika

Only searchable NamedBlobFile fields don't use Tika

2. remote streaming is disabled by default

3. invalid tika default field

4. solr security policy by default disallows blobstorage access

reproducing

summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tika integration is broken #385

Description

TL;DR

1. use_tika=False uses Tika

Only searchable NamedBlobFile fields don't use Tika

2. remote streaming is disabled by default

3. invalid tika default field

4. solr security policy by default disallows blobstorage access

reproducing

summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `use_tika=False` uses Tika