Skip to content

Conversation

@adam-vessey
Copy link
Contributor

@adam-vessey adam-vessey commented Nov 6, 2025

Supersedes #16

Summary by CodeRabbit

  • New Features

    • Nested document path support added for improved hierarchical search.
  • Improvements

    • Upgraded to Solr/Lucene 9 for better performance and compatibility.
    • Enhanced tokenization, analyzers and offsets handling for more accurate search and highlighting.
    • Modernized query caching with Caffeine and raised max boolean clauses for complex queries.
    • Root field now stored to improve retrieval and highlighting.
  • Infrastructure

    • Docker build parameterized by Solr version; OCR highlighting/plugin placement standardized.
    • Circuit breaker scaffold added and remote streaming enabled by default.
  • Behavioral

    • Suggester build-on-commit disabled and index path configured.

Also, accordingly change the version of the `solr-ocrhighlighting` plugin to _not_
use the Solr 7/8 version.
Would generally get suppressed in when storage is mounted into here, anyway.
@adam-vessey adam-vessey added the major Backwards incompatible API changes. label Nov 6, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 6, 2025

Walkthrough

Parameterizes the Docker base image and module paths, relocates OCR highlighting into a computed Solr module path, updates many Solr schema fieldTypes/dynamicFields and analyzers for Solr 9, replaces legacy caches with CaffeineCache, adjusts solrconfig options (streaming, versioning, circuit-breaker scaffold), bumps core properties, and adds data/.gitignore.

Changes

Cohort / File(s) Summary
Docker Build & OCR plugin
Dockerfile, solr.in.sh
Parameterizes base image (ARG SOLR_VERSION, FROM solr:$SOLR_VERSION), adds ARG SOLR_PATH and derives SOLR_HOCR_PLUGIN_PATH, introduces SOLR_MODULES env, simplifies adding OCR jar into computed plugin path with stricter permissions and ownership, removes trailing-space COPY, and removes startup sysprop injection of -Dsolr.hocr.plugin.path from solr.in.sh.
Schema core updates
islandora8/conf/schema.xml
Adds _nest_path_ fieldType/field and makes _root_ stored, introduces multi-valued point/collection fieldTypes (e.g., strings, booleans, pfloats, pdoubles, plongs, pdates, date_ranges, etc.), and updates many dynamicField mappings to use the new named multi-valued types and adjusted docValues/termVectors/storage settings; text_ws now has storeOffsetsWithPositions="true".
Text analysis tweaks
islandora8/conf/schema_extra_types.xml
Adds storeOffsetsWithPositions="true" to many text fieldTypes, replaces WhitespaceTokenizerFactory with StandardTokenizerFactory in several analyzers, adds FlattenGraphFilterFactory where appropriate, and reorders SynonymGraphFilterFactory attributes for text_und query analyzer.
Solr core configuration
islandora8/conf/solrconfig.xml
Bumps luceneMatchVersion (LUCENE_80 → LUCENE_90), adds numVersionBuckets to updateLog sections, inserts a circuitBreaker scaffold (enabled by default, options commented), sets enableRemoteStreaming="true" and enableStreamBody="true", removes XSLT writer block, and updates documentation links.
Query/cache config
islandora8/conf/solrconfig_query.xml
Replaces LRU/FastLRU cache classes with solr.CaffeineCache for documentCache, fieldValueCache, filterCache, perSegFilter, and queryResultCache; adds <query name="maxBooleanClauses">1024</query>.
Solr extra config
islandora8/conf/solrconfig_extra.xml
Removes the <lib ... ocrhighlighting ... /> entry, sets suggester und indexPath="./und", and changes buildOnCommit="false" for suggester und.
Core properties
islandora8/conf/solrcore.properties
Bumps solr.luceneMatchVersion (8.11.4 → 9.12.2) and reorders replication config file entries.
Repository housekeeping
islandora8/data/.gitignore
Adds .gitignore to ignore all files in islandora8/data.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Builder as Docker build
    participant ImageFS as Image filesystem
    participant Container as Solr runtime

    rect rgb(230,248,255)
      note right of Builder: Build-time (Dockerfile)
      Builder->>ImageFS: FROM solr:$SOLR_VERSION
      Builder->>ImageFS: create $SOLR_MODULE_DIR and $SOLR_HOCR_PLUGIN_PATH
      Builder->>ImageFS: ADD/download ocrhighlighting.jar -> $SOLR_HOCR_PLUGIN_PATH
      Builder->>ImageFS: set permissions (chmod 444) and ownership (0:0)
      Builder->>ImageFS: set USER solr
    end

    rect rgb(255,249,230)
      note right of Container: Runtime (container start)
      Container->>Container: start without -Dsolr.hocr.plugin.path sysprop
      Container->>Container: solrconfig_extra has no OCR <lib/>
      Container->>Container: discover modules under Solr module path provided in image
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Files needing extra attention:
    • islandora8/conf/schema.xml — many interdependent dynamicField/type changes and new nest field.
    • islandora8/conf/schema_extra_types.xml — tokenizer and filter changes affecting indexing/query behavior.
    • islandora8/conf/solrconfig_query.xml — cache class substitutions and sizing implications.
    • Dockerfile & solr.in.sh — ensure build-time plugin placement aligns with runtime discovery after removing sysprop and <lib/>.

Possibly related PRs

Poem

🐇 I hopped in the build with a jar on my back,
Paths nested and tokenizers set on the track,
Caffeine for queries, versions raised high,
I pruned stray libs and watched configs sigh,
A tiny rabbit cheer for the Solr-packed stack.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: updating Solr to version 9, which is reflected across all modified files (Dockerfile, schema updates, configuration bumps to Lucene/Solr 9.x).
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch update/solr-9

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0e312b2 and ceef084.

📒 Files selected for processing (1)
  • Dockerfile (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • Dockerfile

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)
  • JIRA integration encountered authorization issues. Please disconnect and reconnect the integration in the CodeRabbit UI.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dec0039 and 78aa143.

⛔ Files ignored due to path filters (1)
  • islandora8/data/index/write.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • Dockerfile (1 hunks)
  • islandora8/conf/schema.xml (8 hunks)
  • islandora8/conf/schema_extra_types.xml (8 hunks)
  • islandora8/conf/solrconfig.xml (10 hunks)
  • islandora8/conf/solrconfig_extra.xml (2 hunks)
  • islandora8/conf/solrconfig_query.xml (1 hunks)
  • islandora8/conf/solrcore.properties (1 hunks)
  • islandora8/data/.gitignore (1 hunks)
  • solr.in.sh (0 hunks)
💤 Files with no reviewable changes (1)
  • solr.in.sh
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-29T19:08:13.952Z
Learnt from: adam-vessey
Repo: discoverygarden/solr-image PR: 15
File: Dockerfile:13-16
Timestamp: 2025-08-29T19:08:13.952Z
Learning: Solr automatically maps environment variables that start with SOLR_ to system properties by lowercasing the name and replacing underscores with dots. For example, SOLR_HOCR_PLUGIN_PATH becomes solr.hocr.plugin.path as a system property that can be referenced in Solr configuration files.

Applied to files:

  • islandora8/conf/solrconfig_extra.xml
  • islandora8/conf/solrconfig.xml
  • islandora8/conf/solrcore.properties
🔇 Additional comments (1)
islandora8/conf/schema_extra_types.xml (1)

25-25: Verify tokenizer changes don't break existing search behavior.

Multiple fieldTypes were updated to use StandardTokenizerFactory instead of WhitespaceTokenizerFactory. This is a functional change that affects how text is tokenized:

  • WhitespaceTokenizerFactory splits on whitespace only
  • StandardTokenizerFactory uses Unicode word boundaries and handles punctuation

This change improves tokenization quality but represents a behavioral shift. Affected fieldTypes: text_edge, text_phonetic_und, text_und, text_spell_und, text_ngramstring, text_ngram.

Verify that this change is intentional and won't negatively impact existing search behavior. Consider whether reindexing is needed for existing data to benefit from the improved tokenization.

Also applies to: 36-36, 105-105, 115-115, 195-195, 205-205, 221-221, 256-256, 267-267

Doesn't quite seem to be documented, so probably shouldn't depend on
it.
Comment on lines +19 to +21
# extraction,langid,ltr,analysis-extras are required by search_api_solr, so
# let's set 'em by default.
ENV SOLR_MODULES=extraction,langid,ltr,analysis-extras
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved over to be something set by the containers using the image, but... probably fine here, especially considering we already have some binding here with the presence of the config-set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major Backwards incompatible API changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants