Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add back MSG parsing #1993

Merged
merged 3 commits into from
Feb 20, 2025
Merged

Add back MSG parsing #1993

merged 3 commits into from
Feb 20, 2025

Conversation

NolanTrem
Copy link
Collaborator

@NolanTrem NolanTrem commented Feb 20, 2025

Important

Reintroduces MSG parsing using msg-parser library and updates dependencies and tests accordingly.

  • Parsers:
    • Re-enable MSGParser in __init__.py files in core and parsers.
    • Update MSGParser in msg_parser.py to use msg_parser library for parsing MSG files.
  • Dependencies:
    • Add msg-parser>=1.2.0 to pyproject.toml and uv.lock.
  • Ingestion:
    • Enable MSG file support in R2RIngestionProvider in base.py.
  • Tests:
    • Re-enable MSG test case in test_ingestion.py.

This description was created by Ellipsis for a4b9443. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Reviewed everything up to 208e682 in 2 minutes and 9 seconds

More details
  • Looked at 249 lines of code in 8 files
  • Skipped 0 files when reviewing.
  • Skipped posting 9 drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:8
  • Draft comment:
    Constructor is named 'init' instead of the conventional 'init'.
  • Reason this comment was not posted:
    Marked as duplicate.
2. py/core/parsers/structured/msg_parser.py:37
  • Draft comment:
    Consider removing the newline prefix ("\n") in the attachment yield for consistency.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50%
    None
3. py/core/parsers/structured/msg_parser.py:25
  • Draft comment:
    Metadata, body, and attachments are yielded in separate steps. Verify that downstream consumers can correctly handle multiple yields or if a single concatenated string is expected.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    The comment is asking the PR author to verify the behavior of downstream consumers, which violates the rule against asking for confirmation or verification of behavior. It does not provide a specific suggestion or point out a clear issue with the code.
4. py/pyproject.toml:54
  • Draft comment:
    A new dependency 'msg-parser>=1.2.0' was added. Please ensure this version is compatible with the existing dependencies (e.g. 'olefile').
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
5. py/tests/integration/test_ingestion.py:161
  • Draft comment:
    The test for MSG ingestion has been re-enabled. Confirm that the sample MSG file ('msg.msg') meets expectations for the new parser.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
6. py/core/providers/ingestion/r2r/base.py:48
  • Draft comment:
    MSGParser is now mapped to DocumentType.MSG. Double-check that this routing is consistent with MSG file ingestion requirements.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
7. py/core/parsers/__init__.py:15
  • Draft comment:
    Duplicate entry 'VLMPDFParser' detected (appears at both line 12 and line 15). Please remove or consolidate the duplicate to avoid redundancy.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
8. py/core/providers/ingestion/r2r/base.py:198
  • Draft comment:
    Typographical error: The variable 'text_spliiter' in the chunk() method is misspelled. Please rename it to 'text_splitter' for clarity and consistency.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
9. py/pyproject.toml:57
  • Draft comment:
    The dependency 'msg-parser>=1.2.0' is missing a space after the package name. For consistency with the other dependencies (e.g., 'aiofiles >=24.1.0,<25.0.0'), please update it to 'msg-parser >=1.2.0'.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
    The comment points out an inconsistency in formatting, but: 1) Both formats are valid in Python packaging 2) The codebase already has mixed usage 3) This is a very minor stylistic issue that doesn't affect functionality 4) There's no strong convention being enforced here 5) The comment doesn't suggest any functional improvements.
    I might be overlooking an established style guide or linting rule that requires consistent spacing. The inconsistency could make future automated processing more difficult.
    If there was a strict style requirement, it would likely be enforced by automated tools. The presence of both styles in the existing codebase suggests this isn't a critical issue.
    This comment should be deleted as it addresses a minor stylistic issue that isn't consistently enforced in the codebase and doesn't affect functionality.

Workflow ID: wflow_9yDsUQt6zFrRhhBJ


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

# """Ingest MSG data and yield email content."""
# if isinstance(data, str):
# raise ValueError("MSG data must be in bytes format.")
def init(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser’s initializer is named 'init' instead of the conventional 'init'. Confirm if this is intentional—if not, rename to 'init' so dependencies are set at construction.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on 494b3d7 in 1 minute and 5 seconds

More details
  • Looked at 13 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:17
  • Draft comment:
    Constructor renamed to init is correct.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is purely informative and does not provide any actionable feedback or suggestions. It simply states that a change is correct, which is not useful for the PR author.

Workflow ID: wflow_xB63r3n8DR84GBhz


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

# """Ingest MSG data and yield email content."""
# if isinstance(data, str):
# raise ValueError("MSG data must be in bytes format.")
def __init__(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming 'init' to 'init' ensures that the constructor is correctly invoked when instantiating the class. Consider checking if a call to 'super().init()' is needed, if the base class has initialization logic.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Changes requested. Incremental review on a4b9443 in 1 minute and 23 seconds

More details
  • Looked at 75 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. py/core/parsers/structured/msg_parser.py:36
  • Draft comment:
    Consider using a context manager for the temporary file to simplify cleanup, if the library supports reading via a file object.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
2. py/core/parsers/structured/msg_parser.py:65
  • Draft comment:
    Consider handling potential errors in os.remove to avoid masking earlier exceptions.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
3. py/core/parsers/structured/msg_parser.py:58
  • Draft comment:
    Verify that msg.attachments is iterable and that each attachment has a 'Filename' attribute as expected.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
4. py/core/parsers/structured/msg_parser.py:58
  • Draft comment:
    Ensure msg.attachments is iterable (e.g., check for None or default to an empty iterable) to avoid potential runtime errors.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
    The code was intentionally changed from dict-style to direct iteration, suggesting the author is aware of the API. Without seeing the msg_parser library's source code, we can't be certain if attachments can be None. The try/except block would catch any AttributeError anyway. The comment is speculative about potential runtime errors.
    We don't have visibility into the msg_parser library's implementation - maybe attachments could be None in some cases?
    The code already has error handling via try/except that would catch any such issues, and the intentional API change suggests the author understands the msg_parser library's behavior.
    Delete the comment. It's speculative, and the code already has error handling in place.
5. py/core/parsers/structured/msg_parser.py:45
  • Draft comment:
    Confirm that using attribute access (msg.subject, msg.sender, etc.) aligns with the msg_parser library interface, as opposed to method calls from previous versions.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None

Workflow ID: wflow_OtWa5wkCiaUmZu5c


Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.


# file_obj = BytesIO(data)
tmp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".msg")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a context manager (or wrap os.remove in try/except) for the temporary file to ensure cleanup doesn’t mask prior errors.

@NolanTrem NolanTrem merged commit ad001fa into main Feb 20, 2025
16 checks passed
@NolanTrem NolanTrem deleted the Nolan/MsgParsing branch February 20, 2025 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant