Skip to content

Commit

Permalink
Merge pull request #28 from fleetingbytes/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
fleetingbytes authored Jun 21, 2024
2 parents a15f1ea + 7d33010 commit 77ccd98
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 7 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# Test files
target.html
extract.py
test.msg

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,20 @@

<!-- towncrier release notes start -->

## 0.9.1 (2024-06-21)


### Documentation

- Fix old naming in readme [#22](https://github.com/fleetingbytes/rtfparse/issues/22)
- Add example how to programmatically extract HTML from MS Outlook message [#25](https://github.com/fleetingbytes/rtfparse/issues/25)


### Bugfixes

- Don't setup log if not using the CLI [#24](https://github.com/fleetingbytes/rtfparse/issues/24)
- Fix possible bug in error handling [#26](https://github.com/fleetingbytes/rtfparse/issues/26)

## 0.9.0 (2024-03-11)


Expand Down
43 changes: 38 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# rtfparse

Parses Microsofts Rich Text Format (RTF) documents. It creates an in-memory object which represents the tree structure of the RTF document. This object can in turn be rendered by using one of the renderers.
So far, rtfparse provides only one renderer (`Decapsulate_HTML`) which liberates the HTML code encapsulated in RTF. This will come handy, for examle, if you ever need to extract the HTML from a HTML-formatted email message saved by Microsoft Outlook.
Parses Microsoft's Rich Text Format (RTF) documents. It creates an in-memory object which represents the tree structure of the RTF document. This object can in turn be rendered by using one of the renderers.
So far, rtfparse provides only one renderer (`HTML_Decapsulator`) which liberates the HTML code encapsulated in RTF. This will come handy, for examle, if you ever need to extract the HTML from a HTML-formatted email message saved by Microsoft Outlook.

MS Outlook also tends to use RTF compression, so the CLI of rtfparse can optionally do that, too.
MS Outlook also tends to use RTF compression, so the CLI of rtfparse can optionally decompress that, too.

You can of course write your own renderers of parsed RTF documents and consider contributing them to this project.

Expand Down Expand Up @@ -56,7 +56,9 @@ In the current version the option `--embed-img` does nothing.

# Programatic usage in a Python module

```
## Decapsulate HTML from an uncompressed RTF file

```py
from pathlib import Path
from rtfparse.parser import Rtf_Parser
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator
Expand All @@ -75,8 +77,39 @@ with open(target_path, mode="w", encoding="utf-8") as html_file:
renderer.render(parsed, html_file)
```

## Decapsulate HTML from an MS Outlook msg file

```py
from pathlib import Path
from extract_msg import openMsg
from compressed_rtf import decompress
from io import BytesIO
from rtfparse.parser import Rtf_Parser
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator


source_file = Path("path/to/your/source.msg")
target_file = Path(r"path/to/your/target.html")
# Create parent directory of `target_path` if it does not already exist:
target_file.parent.mkdir(parents=True, exist_ok=True)

# Get a decompressed RTF bytes buffer from the MS Outlook message
msg = openMsg(source_file)
decompressed_rtf = decompress(msg.compressedRtf)
rtf_buffer = BytesIO(decompressed_rtf)

# Parse the rtf buffer
parser = Rtf_Parser(rtf_file=rtf_buffer)
parsed = parser.parse_file()

# Decapsulate the HTML from the parsed RTF
decapsulator = HTML_Decapsulator()
with open(target_file, mode="w", encoding="utf-8") as html_file:
decapsulator.render(parsed, html_file)
```

# RTF Specification Links

* [RTF Informative References](https://learn.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfcp/85c0b884-a960-4d1a-874e-53eeee527ca6)
* [RTF Spec 1.9.1](https://go.microsoft.com/fwlink/?LinkId=120924)
* [RTF Specification 1.9.1](https://go.microsoft.com/fwlink/?LinkId=120924)
* [RTF Extensions, MS-OXRTFEX](https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfex/411d0d58-49f7-496c-b8c3-5859b045f6cf)
2 changes: 1 addition & 1 deletion src/rtfparse/__about__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/usr/bin/env python


__version__ = "0.9.0"
__version__ = "0.9.1"
2 changes: 1 addition & 1 deletion src/rtfparse/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ def parse_file(self) -> entities.Group:
self.parsed = entities.Group(encoding, file)
except Exception as err:
logger.exception(err)
self.parsed == Namespace()
self.parsed = Namespace()
self.parsed.structure = list()
finally:
if self.rtf_path is not None:
Expand Down

0 comments on commit 77ccd98

Please sign in to comment.