Skip to content

Fix search_repository: Docker URL, ecm:fulltext field, configurable highlights#21

Open
bdelbosc wants to merge 3 commits into
mainfrom
supint-2477-2478-2479-fix-search-repository
Open

Fix search_repository: Docker URL, ecm:fulltext field, configurable highlights#21
bdelbosc wants to merge 3 commits into
mainfrom
supint-2477-2478-2479-fix-search-repository

Conversation

@bdelbosc
Copy link
Copy Markdown
Member

Summary

Three fixes and one improvement to search_repository and search_audit.

SUPINT-2477 — Fix search_repository/search_audit failing inside Docker

nuxeo.client.host stores the host as seen at connection time (http://localhost:8080), which does not resolve to the Nuxeo service from inside a Docker container.

Fix: use os.environ.get("NUXEO_URL", nuxeo.client.host) so the Docker service name can be injected via env var, with the client host as fallback. Also increases the ES probe timeout from 2s to 10s.

SUPINT-2478 — Fix fulltext search targeting non-existent ecm:fulltext field

ElasticsearchQueryBuilder was querying ecm:fulltext/ecm:fulltext.title and NaturalLanguageParser was requesting highlights on ecm:fulltext. These fields do not exist in the Nuxeo OpenSearch mapping — the catch-all field is all_field (a copy_to aggregate).

Fix:

  • Query field: ecm:fulltextall_field
  • Highlight field: ecm:fulltextecm:binarytext with require_field_match: false (all_field has no store so cannot be highlighted directly)

SUPINT-2479 — Add configurable highlight fragment size and count

Fixed-size highlights (150 chars) were too small for meaningful content extraction from large PDFs. Added two optional parameters throughout the call chain (tools.pyes_passthrough.pynl_parser.py):

  • highlight_fragment_size (default: 150, max: ~9,000,000)
  • highlight_number_of_fragments (default: 3; set to 0 for entire field)

Also adds source_fields passthrough for cases where raw _source content is needed.

Docs

  • nuxeo_mcp_config.md: note on NUXEO_URL Docker service name requirement
  • USAGE.md: examples for new search_repository parameters

@bdelbosc bdelbosc force-pushed the supint-2477-2478-2479-fix-search-repository branch 2 times, most recently from d5aaa9d to 76e40f6 Compare May 21, 2026 16:34
@bdelbosc bdelbosc requested review from ataillefer and Copilot May 21, 2026 16:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Elasticsearch passthrough behavior for search_repository / search_audit in Dockerized deployments, aligns repository fulltext querying/highlighting with Nuxeo’s OpenSearch mapping, and adds configurable highlight/source-field controls to improve content extraction.

Changes:

  • Use NUXEO_URL env var (fallback to nuxeo.client.host) for Elasticsearch passthrough base URL, and increase the repository ES probe timeout.
  • Switch fulltext query target from ecm:fulltext to all_field, and update highlighting to target ecm:binarytext with require_field_match: false.
  • Add source_fields, highlight_fragment_size, and highlight_number_of_fragments parameters through the tool → passthrough → parser stack; update docs and one unit test accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
USAGE.md Documents new search_repository parameters for highlight sizing and _source field inclusion.
tests/test_es_query_builder.py Updates expected fulltext query fields to all_field.
src/nuxeo_mcp/tools.py Adds new search_repository parameters; uses NUXEO_URL for passthrough; increases repository ES probe timeout.
src/nuxeo_mcp/nl_parser.py Updates highlight configuration (field + configurable fragment sizing).
src/nuxeo_mcp/es_query_builder.py Changes default fulltext search field list to ["all_field"].
src/nuxeo_mcp/es_passthrough.py Threads highlight params; adds _source passthrough and modifies result formatting.
nuxeo_mcp_config.md Adds Docker note about setting NUXEO_URL to a reachable service hostname.
Dockerfile Uses explicit octal permissions (--chmod=0755) for the entrypoint copy.
Comments suppressed due to low confidence (1)

src/nuxeo_mcp/es_passthrough.py:96

  • source_fields is documented/used as “extra fields to include”, but it is passed straight into _source.includes. When a caller supplies only an extra field (e.g. ['ecm:binarytext']), Elasticsearch will omit the standard metadata fields (dc:title, ecm:path, etc.), and the formatted results will end up with empty strings for those base properties. Consider always including the required base fields plus any extras when building _source.includes (or treating source_fields as additive rather than replacing).
        # Parse natural language to Elasticsearch DSL
        es_request = self.nl_parser.parse_to_elasticsearch(
            query,
            index="repository",
            include_sort=True,
            include_pagination=True,
            include_highlight=True,
            highlight_fragment_size=highlight_fragment_size,
            highlight_number_of_fragments=highlight_number_of_fragments,
            apply_acl=True,
            user_principals=[principal] + groups,
            source_includes=source_fields,
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/nuxeo_mcp/tools.py
Comment thread src/nuxeo_mcp/tools.py
Comment thread src/nuxeo_mcp/es_passthrough.py Outdated
Comment thread src/nuxeo_mcp/nl_parser.py
@bdelbosc bdelbosc force-pushed the supint-2477-2478-2479-fix-search-repository branch from 76e40f6 to 910c56e Compare May 26, 2026 06:34
@bdelbosc bdelbosc requested a review from guirenard May 26, 2026 06:36
Comment thread src/nuxeo_mcp/nl_parser.py
Comment thread src/nuxeo_mcp/tools.py
bdelbosc added 3 commits May 29, 2026 14:05
…g inside Docker

Use os.environ.get("NUXEO_URL", nuxeo.client.host) so a Docker-aware URL
can be injected via environment variable. Also increases the ES connectivity
probe timeout from 2s to 10s.
Query field changed from ecm:fulltext to all_field (the copy_to aggregate
used by Nuxeo OpenSearch mapping). Highlight field changed from ecm:fulltext
to ecm:binarytext with require_field_match=false, since all_field has no
store setting and cannot be highlighted directly.
…ch_repository

Adds highlight_fragment_size and highlight_number_of_fragments parameters
to search_repository tool, ElasticsearchPassthrough.search_repository() and
NaturalLanguageParser.parse_to_elasticsearch(). Also adds source_fields
passthrough so callers can request extra _source fields when needed.
@bdelbosc bdelbosc force-pushed the supint-2477-2478-2479-fix-search-repository branch from 910c56e to ab6338d Compare May 29, 2026 12:07
@bdelbosc bdelbosc requested a review from ataillefer May 29, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants