Add back MSG parsing #1993

NolanTrem · 2025-02-20T20:25:16Z

Important

Reintroduces MSG parsing using msg-parser library and updates dependencies and tests accordingly.

Parsers:
- Re-enable MSGParser in __init__.py files in core and parsers.
- Update MSGParser in msg_parser.py to use msg_parser library for parsing MSG files.
Dependencies:
- Add msg-parser>=1.2.0 to pyproject.toml and uv.lock.
Ingestion:
- Enable MSG file support in R2RIngestionProvider in base.py.
Tests:
- Re-enable MSG test case in test_ingestion.py.

^{This description was created by}^{for a4b9443. It will automatically update as commits are pushed.}

ellipsis-dev

❌ Changes requested. Reviewed everything up to 208e682 in 2 minutes and 9 seconds

More details

Looked at 249 lines of code in 8 files
Skipped 0 files when reviewing.
Skipped posting 9 drafted comments based on config settings.

1. py/core/parsers/structured/msg_parser.py:8

Draft comment:
Constructor is named 'init' instead of the conventional 'init'.
Reason this comment was not posted:
Marked as duplicate.

2. py/core/parsers/structured/msg_parser.py:37

Draft comment:
Consider removing the newline prefix ("\n") in the attachment yield for consistency.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50%
None

3. py/core/parsers/structured/msg_parser.py:25

Draft comment:
Metadata, body, and attachments are yielded in separate steps. Verify that downstream consumers can correctly handle multiple yields or if a single concatenated string is expected.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
The comment is asking the PR author to verify the behavior of downstream consumers, which violates the rule against asking for confirmation or verification of behavior. It does not provide a specific suggestion or point out a clear issue with the code.

4. py/pyproject.toml:54

Draft comment:
A new dependency 'msg-parser>=1.2.0' was added. Please ensure this version is compatible with the existing dependencies (e.g. 'olefile').
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

5. py/tests/integration/test_ingestion.py:161

Draft comment:
The test for MSG ingestion has been re-enabled. Confirm that the sample MSG file ('msg.msg') meets expectations for the new parser.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

6. py/core/providers/ingestion/r2r/base.py:48

Draft comment:
MSGParser is now mapped to DocumentType.MSG. Double-check that this routing is consistent with MSG file ingestion requirements.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

7. py/core/parsers/__init__.py:15

Draft comment:
Duplicate entry 'VLMPDFParser' detected (appears at both line 12 and line 15). Please remove or consolidate the duplicate to avoid redundancy.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

8. py/core/providers/ingestion/r2r/base.py:198

Draft comment:
Typographical error: The variable 'text_spliiter' in the chunk() method is misspelled. Please rename it to 'text_splitter' for clarity and consistency.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

9. py/pyproject.toml:57

Draft comment:
The dependency 'msg-parser>=1.2.0' is missing a space after the package name. For consistency with the other dependencies (e.g., 'aiofiles >=24.1.0,<25.0.0'), please update it to 'msg-parser >=1.2.0'.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The comment points out an inconsistency in formatting, but: 1) Both formats are valid in Python packaging 2) The codebase already has mixed usage 3) This is a very minor stylistic issue that doesn't affect functionality 4) There's no strong convention being enforced here 5) The comment doesn't suggest any functional improvements.
I might be overlooking an established style guide or linting rule that requires consistent spacing. The inconsistency could make future automated processing more difficult.
If there was a strict style requirement, it would likely be enforced by automated tools. The presence of both styles in the existing codebase suggests this isn't a critical issue.
This comment should be deleted as it addresses a minor stylistic issue that isn't consistently enforced in the codebase and doesn't affect functionality.

Workflow ID: wflow_9yDsUQt6zFrRhhBJ

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-20T20:27:35Z

py/core/parsers/structured/msg_parser.py

-#         """Ingest MSG data and yield email content."""
-#         if isinstance(data, str):
-#             raise ValueError("MSG data must be in bytes format.")
+    def init(


The parser’s initializer is named 'init' instead of the conventional 'init'. Confirm if this is intentional—if not, rename to 'init' so dependencies are set at construction.

ellipsis-dev

❌ Changes requested. Incremental review on 494b3d7 in 1 minute and 5 seconds

More details

Looked at 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 drafted comments based on config settings.

1. py/core/parsers/structured/msg_parser.py:17

Draft comment:
Constructor renamed to init is correct.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is purely informative and does not provide any actionable feedback or suggestions. It simply states that a change is correct, which is not useful for the PR author.

Workflow ID: wflow_xB63r3n8DR84GBhz

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-20T20:37:22Z

py/core/parsers/structured/msg_parser.py

-#         """Ingest MSG data and yield email content."""
-#         if isinstance(data, str):
-#             raise ValueError("MSG data must be in bytes format.")
+    def __init__(


Renaming 'init' to 'init' ensures that the constructor is correctly invoked when instantiating the class. Consider checking if a call to 'super().init()' is needed, if the base class has initialization logic.

ellipsis-dev

❌ Changes requested. Incremental review on a4b9443 in 1 minute and 23 seconds

More details

Looked at 75 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 5 drafted comments based on config settings.

1. py/core/parsers/structured/msg_parser.py:36

Draft comment:
Consider using a context manager for the temporary file to simplify cleanup, if the library supports reading via a file object.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

2. py/core/parsers/structured/msg_parser.py:65

Draft comment:
Consider handling potential errors in os.remove to avoid masking earlier exceptions.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

3. py/core/parsers/structured/msg_parser.py:58

Draft comment:
Verify that msg.attachments is iterable and that each attachment has a 'Filename' attribute as expected.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

4. py/core/parsers/structured/msg_parser.py:58

Draft comment:
Ensure msg.attachments is iterable (e.g., check for None or default to an empty iterable) to avoid potential runtime errors.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The code was intentionally changed from dict-style to direct iteration, suggesting the author is aware of the API. Without seeing the msg_parser library's source code, we can't be certain if attachments can be None. The try/except block would catch any AttributeError anyway. The comment is speculative about potential runtime errors.
We don't have visibility into the msg_parser library's implementation - maybe attachments could be None in some cases?
The code already has error handling via try/except that would catch any such issues, and the intentional API change suggests the author understands the msg_parser library's behavior.
Delete the comment. It's speculative, and the code already has error handling in place.

5. py/core/parsers/structured/msg_parser.py:45

Draft comment:
Confirm that using attribute access (msg.subject, msg.sender, etc.) aligns with the msg_parser library interface, as opposed to method calls from previous versions.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

Workflow ID: wflow_OtWa5wkCiaUmZu5c

Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ellipsis-dev · 2025-02-20T21:34:51Z

py/core/parsers/structured/msg_parser.py


-#         file_obj = BytesIO(data)
+        tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".msg")


Consider using a context manager (or wrap os.remove in try/except) for the temporary file to ensure cleanup doesn’t mask prior errors.

Add back MSG parsing

208e682

ellipsis-dev bot reviewed Feb 20, 2025

View reviewed changes

Fix init

494b3d7

ellipsis-dev bot reviewed Feb 20, 2025

View reviewed changes

Fix msg parsing

a4b9443

ellipsis-dev bot reviewed Feb 20, 2025

View reviewed changes

NolanTrem merged commit ad001fa into main Feb 20, 2025
16 checks passed

NolanTrem deleted the Nolan/MsgParsing branch February 20, 2025 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add back MSG parsing #1993

Add back MSG parsing #1993

Uh oh!

NolanTrem commented Feb 20, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Uh oh!

Uh oh!

Uh oh!


		# file_obj = BytesIO(data)
		tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".msg")

Add back MSG parsing #1993

Add back MSG parsing #1993

Uh oh!

Conversation

NolanTrem commented Feb 20, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NolanTrem commented Feb 20, 2025 •

edited by ellipsis-dev bot

Loading