Skip to content

Commit 77ccd98

Browse files
Merge pull request #28 from fleetingbytes/develop
Develop
2 parents a15f1ea + 7d33010 commit 77ccd98

File tree

5 files changed

+59
-7
lines changed

5 files changed

+59
-7
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
# Test files
2+
target.html
3+
extract.py
4+
test.msg
5+
16
# Byte-compiled / optimized / DLL files
27
__pycache__/
38
*.py[cod]

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,20 @@
22

33
<!-- towncrier release notes start -->
44

5+
## 0.9.1 (2024-06-21)
6+
7+
8+
### Documentation
9+
10+
- Fix old naming in readme [#22](https://github.com/fleetingbytes/rtfparse/issues/22)
11+
- Add example how to programmatically extract HTML from MS Outlook message [#25](https://github.com/fleetingbytes/rtfparse/issues/25)
12+
13+
14+
### Bugfixes
15+
16+
- Don't setup log if not using the CLI [#24](https://github.com/fleetingbytes/rtfparse/issues/24)
17+
- Fix possible bug in error handling [#26](https://github.com/fleetingbytes/rtfparse/issues/26)
18+
519
## 0.9.0 (2024-03-11)
620

721

README.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# rtfparse
22

3-
Parses Microsofts Rich Text Format (RTF) documents. It creates an in-memory object which represents the tree structure of the RTF document. This object can in turn be rendered by using one of the renderers.
4-
So far, rtfparse provides only one renderer (`Decapsulate_HTML`) which liberates the HTML code encapsulated in RTF. This will come handy, for examle, if you ever need to extract the HTML from a HTML-formatted email message saved by Microsoft Outlook.
3+
Parses Microsoft's Rich Text Format (RTF) documents. It creates an in-memory object which represents the tree structure of the RTF document. This object can in turn be rendered by using one of the renderers.
4+
So far, rtfparse provides only one renderer (`HTML_Decapsulator`) which liberates the HTML code encapsulated in RTF. This will come handy, for examle, if you ever need to extract the HTML from a HTML-formatted email message saved by Microsoft Outlook.
55

6-
MS Outlook also tends to use RTF compression, so the CLI of rtfparse can optionally do that, too.
6+
MS Outlook also tends to use RTF compression, so the CLI of rtfparse can optionally decompress that, too.
77

88
You can of course write your own renderers of parsed RTF documents and consider contributing them to this project.
99

@@ -56,7 +56,9 @@ In the current version the option `--embed-img` does nothing.
5656

5757
# Programatic usage in a Python module
5858

59-
```
59+
## Decapsulate HTML from an uncompressed RTF file
60+
61+
```py
6062
from pathlib import Path
6163
from rtfparse.parser import Rtf_Parser
6264
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator
@@ -75,8 +77,39 @@ with open(target_path, mode="w", encoding="utf-8") as html_file:
7577
renderer.render(parsed, html_file)
7678
```
7779

80+
## Decapsulate HTML from an MS Outlook msg file
81+
82+
```py
83+
from pathlib import Path
84+
from extract_msg import openMsg
85+
from compressed_rtf import decompress
86+
from io import BytesIO
87+
from rtfparse.parser import Rtf_Parser
88+
from rtfparse.renderers.html_decapsulator import HTML_Decapsulator
89+
90+
91+
source_file = Path("path/to/your/source.msg")
92+
target_file = Path(r"path/to/your/target.html")
93+
# Create parent directory of `target_path` if it does not already exist:
94+
target_file.parent.mkdir(parents=True, exist_ok=True)
95+
96+
# Get a decompressed RTF bytes buffer from the MS Outlook message
97+
msg = openMsg(source_file)
98+
decompressed_rtf = decompress(msg.compressedRtf)
99+
rtf_buffer = BytesIO(decompressed_rtf)
100+
101+
# Parse the rtf buffer
102+
parser = Rtf_Parser(rtf_file=rtf_buffer)
103+
parsed = parser.parse_file()
104+
105+
# Decapsulate the HTML from the parsed RTF
106+
decapsulator = HTML_Decapsulator()
107+
with open(target_file, mode="w", encoding="utf-8") as html_file:
108+
decapsulator.render(parsed, html_file)
109+
```
110+
78111
# RTF Specification Links
79112

80113
* [RTF Informative References](https://learn.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfcp/85c0b884-a960-4d1a-874e-53eeee527ca6)
81-
* [RTF Spec 1.9.1](https://go.microsoft.com/fwlink/?LinkId=120924)
114+
* [RTF Specification 1.9.1](https://go.microsoft.com/fwlink/?LinkId=120924)
82115
* [RTF Extensions, MS-OXRTFEX](https://docs.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-oxrtfex/411d0d58-49f7-496c-b8c3-5859b045f6cf)

src/rtfparse/__about__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
#!/usr/bin/env python
22

33

4-
__version__ = "0.9.0"
4+
__version__ = "0.9.1"

src/rtfparse/parser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ def parse_file(self) -> entities.Group:
9393
self.parsed = entities.Group(encoding, file)
9494
except Exception as err:
9595
logger.exception(err)
96-
self.parsed == Namespace()
96+
self.parsed = Namespace()
9797
self.parsed.structure = list()
9898
finally:
9999
if self.rtf_path is not None:

0 commit comments

Comments
 (0)