Skip to content

fix(sharepoint): Add secondary filter for embedded images#5473

Merged
justin-tahara merged 1 commit intomainfrom
jtahara/add-secondary-mime-type-filter
Sep 23, 2025
Merged

fix(sharepoint): Add secondary filter for embedded images#5473
justin-tahara merged 1 commit intomainfrom
jtahara/add-secondary-mime-type-filter

Conversation

@justin-tahara
Copy link
Contributor

@justin-tahara justin-tahara commented Sep 23, 2025

Description

[Provide a brief description of the changes in this PR]
The sharepoint connector does some interesting things in the way it is extracting images. If the actual file being indexed is a TIFF file, we will gracefully skip over them. But for any images that are not valid image types, we would extract them from other files like a PDF and then try and summarize the information from the image.

This PR aims to do a second tier of filtering in order to ensure that if the file is a text file but has embedded images, we also skip over images that are not a valid image type.

How Has This Been Tested?

[Describe the tests you ran to verify your changes]
Ran the unit tests.

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)
  • [Optional] Override Linear Check

@justin-tahara justin-tahara requested a review from a team as a code owner September 23, 2025 01:08
@vercel
Copy link

vercel bot commented Sep 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
internal-search Ready Ready Preview Comment Sep 23, 2025 1:11am

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds secondary filtering for embedded images in the SharePoint connector to ensure consistent handling of unsupported image types. The change addresses an inconsistency where the connector would filter out unsupported image files (like TIFF) at the document level but would still process embedded images of unsupported types extracted from other documents like PDFs.

The implementation adds a new _store_embedded_image() helper function that validates embedded image data using get_image_type_from_bytes() before processing. This function checks for unknown formats and excluded types (particularly GIF images), gracefully skipping them with debug logging. The change ensures that both standalone image files and embedded images follow the same validation rules defined in EXCLUDED_IMAGE_TYPES.

This modification fits into the broader SharePoint connector architecture by extending the existing image validation logic that was already present for standalone files. The connector processes various SharePoint document types and extracts both text and embedded images - this change ensures the image extraction pipeline maintains consistency across different extraction scenarios. The defensive programming approach with proper error handling aligns with the connector's existing patterns for handling various file types and potential processing failures.

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it adds defensive filtering logic without changing existing functionality
  • Score reflects well-structured defensive code with proper error handling and logging, addressing a clear consistency gap
  • Pay close attention to the embedded image processing logic in the SharePoint connector file

1 file reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (all 1 issues)

Understand the root cause of the following 1 issues and fix them.


<file name="backend/onyx/connectors/sharepoint/connector.py">

<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:62">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</violation>
</file>

React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.

from onyx.file_processing.extract_file_text import get_file_ext
from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES
from onyx.file_processing.image_utils import store_image_and_create_section
from onyx.utils.b64 import get_image_type_from_bytes
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.

Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 62:

<comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</comment>

<file context>
@@ -59,6 +59,7 @@
 from onyx.file_processing.extract_file_text import get_file_ext
 from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES
 from onyx.file_processing.image_utils import store_image_and_create_section
+from onyx.utils.b64 import get_image_type_from_bytes
 from onyx.utils.logger import setup_logger
 
</file context>
Fix with Cubic

@justin-tahara justin-tahara merged commit 8dd7934 into main Sep 23, 2025
56 of 58 checks passed
@justin-tahara justin-tahara deleted the jtahara/add-secondary-mime-type-filter branch September 23, 2025 01:48
brijsiyag-meesho pushed a commit to brijsiyag-meesho/onyx that referenced this pull request Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants