.NET Core library to convert Microsoft Office binary files to various formats. This fork focuses exclusively on Word (.doc) files and plain text extraction from legacy Microsoft Word documents (Word 97-2003, Word 95, and Word 6.0).
You can also use the Open XML SDK to manipulate OpenXML files.
Forked from a .NET 2 Mono implementation under the BSD license.
- DOC to Plain Text Conversion: Robust extraction from Word 97-2003, Word 95, and Word 6.0 formats
- Enhanced Compatibility: Handles tables, headers/footers, embedded objects, and complex document structures
- Clean Output: Produces readable text while preserving document flow
- Edge Case Handling: Robust processing of corrupted or non-standard .doc files
- PowerPoint (.ppt) to PPTX conversion
- Excel (.xls) to XLSX conversion
- Word (.doc) to DOCX conversion
Note: This fork maintains these legacy features but does not actively enhance them.
- Enhanced formatting support for lists (numbers, bullet points, indents) and tables
- Configurable text extraction options (--no-headers-footers, --no-textboxes, --no-comments, --no-bullets)
- Performance optimizations for large document processing
- Additional error handling and recovery mechanisms
This project is inspired by and informed by several existing open-source implementations of the Word Binary Format:
Name | Language | Description | Link |
---|---|---|---|
wvWare | C | Original GPL Word97 .doc text extractor |
SourceForge |
OnlyOffice | C++ | Proprietary editor with open-source core, includes DOC parsing | GitHub |
Antiword | C | Lightweight Word .doc to text/postscript converter |
GitHub Mirror |
Apache POI | Java | Java API for Microsoft Documents, includes Word97 support via HWPF | Apache POI - HWPF |
LibreOffice | C++ | Full office suite with robust support for legacy DOC files | GitHub |
Catdoc | C | Lightweight Word .doc to text converter |
GitHub Mirror |
DocToText | C++ | Lightweight any document file to text converter | GitHub |
- Microsoft Office binary files documentation
- Open XML Standard
- Microsoft article on this implementation
- .NET 2 Mono implementation architecture
All code retained from that version ©2009 DIaLOGIKa http://www.dialogika.de/
.NET core port work and move to System.IO.Compression
©2017 Evolution https://www.evolutionjobs.com/