-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add back MSG parsing #1993
Add back MSG parsing #1993
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 208e682 in 2 minutes and 9 seconds
More details
- Looked at
249
lines of code in8
files - Skipped
0
files when reviewing. - Skipped posting
9
drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:8
- Draft comment:
Constructor is named 'init' instead of the conventional 'init'. - Reason this comment was not posted:
Marked as duplicate.
2. py/core/parsers/structured/msg_parser.py:37
- Draft comment:
Consider removing the newline prefix ("\n") in the attachment yield for consistency. - Reason this comment was not posted:
Confidence changes required:50%
<= threshold50%
None
3. py/core/parsers/structured/msg_parser.py:25
- Draft comment:
Metadata, body, and attachments are yielded in separate steps. Verify that downstream consumers can correctly handle multiple yields or if a single concatenated string is expected. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
The comment is asking the PR author to verify the behavior of downstream consumers, which violates the rule against asking for confirmation or verification of behavior. It does not provide a specific suggestion or point out a clear issue with the code.
4. py/pyproject.toml:54
- Draft comment:
A new dependency 'msg-parser>=1.2.0' was added. Please ensure this version is compatible with the existing dependencies (e.g. 'olefile'). - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
5. py/tests/integration/test_ingestion.py:161
- Draft comment:
The test for MSG ingestion has been re-enabled. Confirm that the sample MSG file ('msg.msg') meets expectations for the new parser. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
6. py/core/providers/ingestion/r2r/base.py:48
- Draft comment:
MSGParser is now mapped to DocumentType.MSG. Double-check that this routing is consistent with MSG file ingestion requirements. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
7. py/core/parsers/__init__.py:15
- Draft comment:
Duplicate entry 'VLMPDFParser' detected (appears at both line 12 and line 15). Please remove or consolidate the duplicate to avoid redundancy. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. py/core/providers/ingestion/r2r/base.py:198
- Draft comment:
Typographical error: The variable 'text_spliiter' in the chunk() method is misspelled. Please rename it to 'text_splitter' for clarity and consistency. - Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. py/pyproject.toml:57
- Draft comment:
The dependency 'msg-parser>=1.2.0' is missing a space after the package name. For consistency with the other dependencies (e.g., 'aiofiles >=24.1.0,<25.0.0'), please update it to 'msg-parser >=1.2.0'. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
The comment points out an inconsistency in formatting, but: 1) Both formats are valid in Python packaging 2) The codebase already has mixed usage 3) This is a very minor stylistic issue that doesn't affect functionality 4) There's no strong convention being enforced here 5) The comment doesn't suggest any functional improvements.
I might be overlooking an established style guide or linting rule that requires consistent spacing. The inconsistency could make future automated processing more difficult.
If there was a strict style requirement, it would likely be enforced by automated tools. The presence of both styles in the existing codebase suggests this isn't a critical issue.
This comment should be deleted as it addresses a minor stylistic issue that isn't consistently enforced in the codebase and doesn't affect functionality.
Workflow ID: wflow_9yDsUQt6zFrRhhBJ
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
# """Ingest MSG data and yield email content.""" | ||
# if isinstance(data, str): | ||
# raise ValueError("MSG data must be in bytes format.") | ||
def init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser’s initializer is named 'init' instead of the conventional 'init'. Confirm if this is intentional—if not, rename to 'init' so dependencies are set at construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 494b3d7 in 1 minute and 5 seconds
More details
- Looked at
13
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
1
drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:17
- Draft comment:
Constructor renamed to init is correct. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is purely informative and does not provide any actionable feedback or suggestions. It simply states that a change is correct, which is not useful for the PR author.
Workflow ID: wflow_xB63r3n8DR84GBhz
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
# """Ingest MSG data and yield email content.""" | ||
# if isinstance(data, str): | ||
# raise ValueError("MSG data must be in bytes format.") | ||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming 'init' to 'init' ensures that the constructor is correctly invoked when instantiating the class. Consider checking if a call to 'super().init()' is needed, if the base class has initialization logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on a4b9443 in 1 minute and 23 seconds
More details
- Looked at
75
lines of code in1
files - Skipped
0
files when reviewing. - Skipped posting
5
drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:36
- Draft comment:
Consider using a context manager for the temporary file to simplify cleanup, if the library supports reading via a file object. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
2. py/core/parsers/structured/msg_parser.py:65
- Draft comment:
Consider handling potential errors in os.remove to avoid masking earlier exceptions. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
3. py/core/parsers/structured/msg_parser.py:58
- Draft comment:
Verify that msg.attachments is iterable and that each attachment has a 'Filename' attribute as expected. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
4. py/core/parsers/structured/msg_parser.py:58
- Draft comment:
Ensure msg.attachments is iterable (e.g., check for None or default to an empty iterable) to avoid potential runtime errors. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
The code was intentionally changed from dict-style to direct iteration, suggesting the author is aware of the API. Without seeing the msg_parser library's source code, we can't be certain if attachments can be None. The try/except block would catch any AttributeError anyway. The comment is speculative about potential runtime errors.
We don't have visibility into the msg_parser library's implementation - maybe attachments could be None in some cases?
The code already has error handling via try/except that would catch any such issues, and the intentional API change suggests the author understands the msg_parser library's behavior.
Delete the comment. It's speculative, and the code already has error handling in place.
5. py/core/parsers/structured/msg_parser.py:45
- Draft comment:
Confirm that using attribute access (msg.subject, msg.sender, etc.) aligns with the msg_parser library interface, as opposed to method calls from previous versions. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
Workflow ID: wflow_OtWa5wkCiaUmZu5c
Want Ellipsis to fix these issues? Tag @ellipsis-dev
in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
|
||
# file_obj = BytesIO(data) | ||
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".msg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a context manager (or wrap os.remove in try/except) for the temporary file to ensure cleanup doesn’t mask prior errors.
Important
Reintroduces MSG parsing using
msg-parser
library and updates dependencies and tests accordingly.MSGParser
in__init__.py
files incore
andparsers
.MSGParser
inmsg_parser.py
to usemsg_parser
library for parsing MSG files.msg-parser>=1.2.0
topyproject.toml
anduv.lock
.R2RIngestionProvider
inbase.py
.test_ingestion.py
.This description was created by
for a4b9443. It will automatically update as commits are pushed.