Releases: microsoft/markitdown
Releases · microsoft/markitdown
v0.1.1
What's Changed
convert_url
renamed to convert_uri
, and now handles data and file URIs by @afourney in #1153
NOTE: convert_url
remains an alias to convert_uri
, for backward compatibility.
Both now accept file URIs and data URIs:
e.g.,
markitdown = MarkItDown()
result = markitdown.convert_uri("file:///path/to/file.txt")
print(result.markdown)
And,
markitdown = MarkItDown()
result = markitdown.convert_uri("data:text/plain;base64,SGVsbG8sIFdvcmxkIQ==")
print(result.markdown)
Full Changelog: v0.1.0...v0.1.1
v0.1.0
Overview
Version 0.1.0 (previously 0.1.0a6) is a large release, bringing many improvements over the previous 0.0.2 version.
High-level changes include:
- Organized dependencies into feature groups — install only the converters you need, or get everything with
pip install markitdown[all]
- A new plugin-based architecture, allowing 3rd-party developers to add functionality to MarkItDown (see the sample plugin)
- All conversions are performed in-memory — no more temporary files
- Support for new formats including EPUB
- Option to keep data URIs in converted Markdown
- Option to override MIME type, extension, and charset in the command-line interface (useful when reading input from a pipe or stdin)
Breaking changes
- As noted above, dependencies are now organized into optional feature groups.
Use pip install markitdown[all]
for backward-compatible behavior. convert_stream()
now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, which also accepted text file-like objects, like io.StringIO.- The
DocumentConverter
class interface has changed to read from file-like streams rather than file paths. No temporary files are created anymore. If you are the maintainer of a plugin or custom DocumentConverter, you likely need to update your code. Otherwise, if you're only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
Detailed list of contributions
- Cleanup and refactor, in preparation for plugin support. by @afourney in #318
- Skip generating md links in 'pre' blocks by @t-kalinowski in #322
- Fix a typo in sample RTF plugin by @rickygao in #320
- Added priority argument to all converter constructors. by @afourney in #324
- Doc Intelligence fixes for refactored code by @KennyZhang1 in #325
- Added CLI tests. by @afourney in #327
- Fix UnboundLocalError in MarkItDown._convert by @menezesandre in #1038
- add necessary imports by @tanreinama in #861
- fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue by @iw4p in #1035
- Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) by @C0dingMast3r in #331
- Make sure extensions are unique in MarkItDown's convert methods. by @afourney in #1076
- Don't have ZipConverter accept OOXML files. by @afourney in #1078
- Print and log better exceptions when file conversions fail. by @afourney in #1080
- Exceptions should subclass Exception not BaseException. by @afourney in #1082
- [Draft] Exploring ways to allow Optional dependencies by @afourney in #1079
- Fixed property name by @afourney in #1085
- Update converter API, user streams rather than filepaths by @afourney in #1088
- Bump version. by @afourney in #1094
- Fixed loading of plugins. by @afourney in #1096
- Fixed version. by @afourney in #1097
- fix(README): correct pip install command formatting by @Piero24 in #1090
- Fixed deepcopy failure when passing llm_client by @scalabreseGD in #1089
- Fixed formatting. by @afourney in #1098
- feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order by @richardye101 in #1104
- feat(docker): improve dockerfile build by @syaghoubi00 in #220
- Fix exiftool in well-known paths. by @afourney in #1106
- fix typo in well-known path list by @0xmohit in #1109
- Switch from puremagic to magika. by @afourney in #1108
- Minimize guesses when guesses are compatible. by @afourney in #1114
- Added CLI options for extension, mime-types, and charset. by @afourney in #1115
- Fix string formatting in FileConversionException error message by @yushihang in #1121
- Handle not supported plot type in pptx by @EmanueleMeazzo in #1122
- Small fixes for autogen integration. by @afourney in #1124
- Added epub test file. by @afourney in #1130
- Fix remaining mypy errors. by @afourney in #1132
- Have magika read from the stream. by @afourney in #1136
- EPub Support. Adapted #123 to not use epublib. by @afourney in #1131
- Consider anything with a charset as plain text-convertible. by @afourney in #1142
- Adjust warning filters and update dependencies by @afourney in #1143
- Add support for preserving base64 encoded images by @BetterAndBetterII in #1140
- Resolve a console encoding error. by @afourney in #1149
- Bump version to 0.1.0 by @afourney in #1150
New Contributors
- @t-kalinowski made their first contribution in #322
- @rickygao made their first contribution in #320
- @menezesandre made their first contribution in #1038
- @tanreinama made their first contribution in #861
- @iw4p made their first contribution in #1035
- @C0dingMast3r made their first contribution in #331
- @Piero24 made their first contribution in #1090
- @scalabreseGD made their first contribution in #1089
- @richardye101 made their first contribution in #1104
- @syaghoubi00 made their first contribution in #220
- @0xmohit made their first contribution in #1109
- @yushihang made their first contribution in #1121
- @EmanueleMeazzo made their first contribution in #1122
- @BetterAndBetterII made their first contribution in #1140
Full Changelog: v0.0.2...v0.1.0
v0.1.0a6
What's Changed
- Add support for preserving base64 encoded images by @BetterAndBetterII in #1140
- Bump version and resolve a console encoding error. by @afourney in #1149
New Contributors
- @BetterAndBetterII made their first contribution in #1140
Full Changelog: v0.1.0a5...v0.1.0a6
v0.1.0a5
v0.1.0a4
Features
- Basic EPub support from @0xRaduan, in collaboration with @afourney
- Switch from puremagic to magika. by @afourney in #1108
- Added CLI options for extension, mime-types, and charset. by @afourney in #1115
- Sort pptx shapes to be parsed in top-to-bottom, left-to-right order by @richardye101 in #1104
Bug fixes and enhancements
- fix(README): correct pip install command formatting by @Piero24 in #1090
- Fixed deepcopy failure when passing llm_client by @scalabreseGD in #1089
- feat(docker): improve dockerfile build by @syaghoubi00 in #220
- Fix exiftool in well-known paths. by @afourney in #1106
- fix typo in well-known path list by @0xmohit in #1109
- Minimize guesses when guesses are compatible. by @afourney in #1114
- Fix string formatting in FileConversionException error message by @yushihang in #1121
- Refactored tests. by @afourney in #1120
- Handle not supported plot type in pptx by @EmanueleMeazzo in #1122
- Fix remaining mypy errors. by @afourney in #1132
- Investigate and silence warnings. by @afourney in #1133
New Contributors
- @0xRaduan made their first contribution in #123
- @Piero24 made their first contribution in #1090
- @scalabreseGD made their first contribution in #1089
- @richardye101 made their first contribution in #1104
- @syaghoubi00 made their first contribution in #220
- @0xmohit made their first contribution in #1109
- @yushihang made their first contribution in #1121
- @EmanueleMeazzo made their first contribution in #1122
Full Changelog: v0.1.0a1...v0.1.0a4
v0.0.2
v0.1.0a1
What's Changed
This MarkItDown alpha introduces numerous bug-fixes, and the following major changes:
- Dependencies are now organized into optional feature-groups (further details below). Use pip install
markitdown[all]
to have backward-compatible behavior. - The DocumentConverter class interface has changed to read from file-like streams rather than file paths. No temporary files are created anymore. If you are the maintainer of a DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI, you should not need to change anything.
- MarkItDown now supports extension through 3rd-party plugins. See markitdown-sample-plugin for more details!
v0.0.1
Promoting v0.0.1a5 to a full release.
For more details see the prior Release Notes.
v0.0.1a5
What's Changed
- Fixed compatibility with markdownify v1.0.0
New Contributors
Full Changelog: v0.0.1a4...v0.0.1a5
MarkItDown version v0.0.1a4
Some of What's Changed
- feat: Add RSSConverter by @Soulter in #97
- feat: Add IpynbConverter by @AumGupta in #71
- feat(devcontainer): Add DevContainer Configuration for Easier Contribution Setup by @l-lumin in #64
- feat: add support for conversion via Document Intelligence by @KennyZhang1 in #303
- feat: add version option to markitdown CLI by @l-lumin in #172
- feat: enable Git support in devcontainer by @numekudi in #136
- feat: outlook ".msg" file converter by @muratcankurtulus in #196
- feat: Add xls support by @yeungadrian in #169
- feat: support image description with LLM for pptx files by @masquare in #306
- fix: Safeguard against path traversal for ZipConverter by @finchy in #129
- fix: support -o param to avoid encoding issues by @Soulter in #116
- fix(transcription): TRANSCRIPTION_CAPABLE should be iniztialized by @absadiki in #194
- fix: added a test for leading spaces. by @afourney in #258
- fix: If puremagic has no guesses, try again after ltrim. by @afourney in #260
- fix: Recognize json as plain text (if no other handlers are present). by @afourney in #261
- fix: Set exiftool path explicitly. by @afourney in #267
- fix: remove leading and trailing \n for HtmlConverter by @ZeyuTeng96 in #262
- fix: argparse CLI option ordering, fixes #268 by @slhck in #290
- fix: for mimetype issue with csv files on windows. by @wunde005 in #273
- docs: update README.md by @eltociear in #182
- docs: Add documentation for docintel by @KennyZhang1 in #312
New Contributors
- @AumGupta made their first contribution in #71
- @diya155 made their first contribution in #80
- @l-lumin made their first contribution in #64
- @waterimp made their first contribution in #98
- @finchy made their first contribution in #129
- @sugatoray made their first contribution in #130
- @PetrAPConsulting made their first contribution in #91
- @SigireddyBalasai made their first contribution in #93
- @dependabot made their first contribution in #177
- @numekudi made their first contribution in #136
- @eltociear made their first contribution in #182
- @absadiki made their first contribution in #194
- @muratcankurtulus made their first contribution in #196
- @yeungadrian made their first contribution in #169
- @KennyZhang1 made their first contribution in #303
- @ZeyuTeng96 made their first contribution in #262
- @jamesmh made their first contribution in #270
- @masquare made their first contribution in #306
- @slhck made their first contribution in #290
- @wunde005 made their first contribution in #273
Full Changelog: v0.0.1a3...v0.0.1a4