fix(sharepoint): Add secondary filter for embedded images by justin-tahara · Pull Request #5473 · onyx-dot-app/onyx

justin-tahara · 2025-09-23T01:08:43Z

Description

[Provide a brief description of the changes in this PR]
The sharepoint connector does some interesting things in the way it is extracting images. If the actual file being indexed is a TIFF file, we will gracefully skip over them. But for any images that are not valid image types, we would extract them from other files like a PDF and then try and summarize the information from the image.

This PR aims to do a second tier of filtering in order to ensure that if the file is a text file but has embedded images, we also skip over images that are not a valid image type.

How Has This Been Tested?

[Describe the tests you ran to verify your changes]
Ran the unit tests.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

This PR should be backported (make sure to check that the backport attempt succeeds)
[Optional] Override Linear Check

vercel · 2025-09-23T01:08:47Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
internal-search	Ready	Preview	Comment	Sep 23, 2025 1:11am

greptile-apps

Greptile Summary

This PR adds secondary filtering for embedded images in the SharePoint connector to ensure consistent handling of unsupported image types. The change addresses an inconsistency where the connector would filter out unsupported image files (like TIFF) at the document level but would still process embedded images of unsupported types extracted from other documents like PDFs.

The implementation adds a new _store_embedded_image() helper function that validates embedded image data using get_image_type_from_bytes() before processing. This function checks for unknown formats and excluded types (particularly GIF images), gracefully skipping them with debug logging. The change ensures that both standalone image files and embedded images follow the same validation rules defined in EXCLUDED_IMAGE_TYPES.

This modification fits into the broader SharePoint connector architecture by extending the existing image validation logic that was already present for standalone files. The connector processes various SharePoint document types and extracts both text and embedded images - this change ensures the image extraction pipeline maintains consistency across different extraction scenarios. The defensive programming approach with proper error handling aligns with the connector's existing patterns for handling various file types and potential processing failures.

Confidence score: 4/5

This PR is safe to merge with minimal risk as it adds defensive filtering logic without changing existing functionality
Score reflects well-structured defensive code with proper error handling and logging, addressing a clear consistency gap
Pay close attention to the embedded image processing logic in the SharePoint connector file

_{1 file reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

cubic-dev-ai

1 issue found across 1 file

Prompt for AI agents (all 1 issues)


Understand the root cause of the following 1 issues and fix them.


<file name="backend/onyx/connectors/sharepoint/connector.py">

<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:62">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</violation>
</file>

_{React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-09-23T01:17:00Z

backend/onyx/connectors/sharepoint/connector.py

 from onyx.file_processing.extract_file_text import get_file_ext
 from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES
 from onyx.file_processing.image_utils import store_image_and_create_section
+from onyx.utils.b64 import get_image_type_from_bytes


Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.

Prompt for AI agents

Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 62: <comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</comment> <file context> @@ -59,6 +59,7 @@ from onyx.file_processing.extract_file_text import get_file_ext from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES from onyx.file_processing.image_utils import store_image_and_create_section +from onyx.utils.b64 import get_image_type_from_bytes from onyx.utils.logger import setup_logger </file context>

…pp#5473)

fix(sharepoint): Add secondary filter for embedded images

7372d48

justin-tahara requested a review from a team as a code owner September 23, 2025 01:08

greptile-apps bot reviewed Sep 23, 2025

View reviewed changes

vercel bot deployed to Preview September 23, 2025 01:11 View deployment

evan-onyx approved these changes Sep 23, 2025

View reviewed changes

cubic-dev-ai bot reviewed Sep 23, 2025

View reviewed changes

justin-tahara merged commit 8dd7934 into main Sep 23, 2025
56 of 58 checks passed

justin-tahara deleted the jtahara/add-secondary-mime-type-filter branch September 23, 2025 01:48

brijsiyag-meesho pushed a commit to brijsiyag-meesho/onyx that referenced this pull request Sep 23, 2025

fix(sharepoint): Add secondary filter for embedded images (onyx-dot-a…

78fd588

…pp#5473)

razvanMiu pushed a commit to eea/danswer that referenced this pull request Oct 16, 2025

fix(sharepoint): Add secondary filter for embedded images (onyx-dot-a…

0551a8b

…pp#5473)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sharepoint): Add secondary filter for embedded images#5473

fix(sharepoint): Add secondary filter for embedded images#5473
justin-tahara merged 1 commit intomainfrom
jtahara/add-secondary-mime-type-filter

justin-tahara commented Sep 23, 2025 •

edited

Loading

Uh oh!

vercel bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justin-tahara commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Backporting (check the box to trigger backport action)

Uh oh!

vercel bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justin-tahara commented Sep 23, 2025 •

edited

Loading

vercel bot commented Sep 23, 2025 •

edited

Loading

cubic-dev-ai bot Sep 23, 2025 •

edited

Loading