fix(sharepoint): Add secondary filter for embedded images#5473
fix(sharepoint): Add secondary filter for embedded images#5473justin-tahara merged 1 commit intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Greptile Summary
This PR adds secondary filtering for embedded images in the SharePoint connector to ensure consistent handling of unsupported image types. The change addresses an inconsistency where the connector would filter out unsupported image files (like TIFF) at the document level but would still process embedded images of unsupported types extracted from other documents like PDFs.
The implementation adds a new _store_embedded_image() helper function that validates embedded image data using get_image_type_from_bytes() before processing. This function checks for unknown formats and excluded types (particularly GIF images), gracefully skipping them with debug logging. The change ensures that both standalone image files and embedded images follow the same validation rules defined in EXCLUDED_IMAGE_TYPES.
This modification fits into the broader SharePoint connector architecture by extending the existing image validation logic that was already present for standalone files. The connector processes various SharePoint document types and extracts both text and embedded images - this change ensures the image extraction pipeline maintains consistency across different extraction scenarios. The defensive programming approach with proper error handling aligns with the connector's existing patterns for handling various file types and potential processing failures.
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it adds defensive filtering logic without changing existing functionality
- Score reflects well-structured defensive code with proper error handling and logging, addressing a clear consistency gap
- Pay close attention to the embedded image processing logic in the SharePoint connector file
1 file reviewed, no comments
There was a problem hiding this comment.
1 issue found across 1 file
Prompt for AI agents (all 1 issues)
Understand the root cause of the following 1 issues and fix them.
<file name="backend/onyx/connectors/sharepoint/connector.py">
<violation number="1" location="backend/onyx/connectors/sharepoint/connector.py:62">
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</violation>
</file>
React with 👍 or 👎 to teach cubic. Mention @cubic-dev-ai to give feedback, ask questions, or re-run the review.
| from onyx.file_processing.extract_file_text import get_file_ext | ||
| from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES | ||
| from onyx.file_processing.image_utils import store_image_and_create_section | ||
| from onyx.utils.b64 import get_image_type_from_bytes |
There was a problem hiding this comment.
Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.
Prompt for AI agents
Address the following comment on backend/onyx/connectors/sharepoint/connector.py at line 62:
<comment>Client-secret auth never sets sp_tenant_domain, causing RuntimeError when creating ClientContext.</comment>
<file context>
@@ -59,6 +59,7 @@
from onyx.file_processing.extract_file_text import get_file_ext
from onyx.file_processing.file_validation import EXCLUDED_IMAGE_TYPES
from onyx.file_processing.image_utils import store_image_and_create_section
+from onyx.utils.b64 import get_image_type_from_bytes
from onyx.utils.logger import setup_logger
</file context>
Description
[Provide a brief description of the changes in this PR]
The sharepoint connector does some interesting things in the way it is extracting images. If the actual file being indexed is a TIFF file, we will gracefully skip over them. But for any images that are not valid image types, we would extract them from other files like a PDF and then try and summarize the information from the image.
This PR aims to do a second tier of filtering in order to ensure that if the file is a text file but has embedded images, we also skip over images that are not a valid image type.
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Ran the unit tests.
Backporting (check the box to trigger backport action)
Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.