Skip to content

.NET Core library to extract text from doc and convert Microsoft Office binary files (doc, xls and ppt) to Open XML (docx, xlsx and pptx).

License

Notifications You must be signed in to change notification settings

GustavoHennig/b2xtranslator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

b2xtranslator

.NET Core library to convert Microsoft Office binary files to various formats. This fork focuses exclusively on Word (.doc) files and plain text extraction from legacy Microsoft Word documents (Word 97-2003, Word 95, and Word 6.0).

You can also use the Open XML SDK to manipulate OpenXML files.

Forked from a .NET 2 Mono implementation under the BSD license.

Key Features

Text Extraction (Primary Focus)

  • DOC to Plain Text Conversion: Robust extraction from Word 97-2003, Word 95, and Word 6.0 formats
  • Enhanced Compatibility: Handles tables, headers/footers, embedded objects, and complex document structures
  • Clean Output: Produces readable text while preserving document flow
  • Edge Case Handling: Robust processing of corrupted or non-standard .doc files

Legacy Format Support (Not Maintained)

  • PowerPoint (.ppt) to PPTX conversion
  • Excel (.xls) to XLSX conversion
  • Word (.doc) to DOCX conversion

Note: This fork maintains these legacy features but does not actively enhance them.

Roadmap

Planned Enhancements

  • Enhanced formatting support for lists (numbers, bullet points, indents) and tables
  • Configurable text extraction options (--no-headers-footers, --no-textboxes, --no-comments, --no-bullets)
  • Performance optimizations for large document processing
  • Additional error handling and recovery mechanisms

This project is inspired by and informed by several existing open-source implementations of the Word Binary Format:

Name Language Description Link
wvWare C Original GPL Word97 .doc text extractor SourceForge
OnlyOffice C++ Proprietary editor with open-source core, includes DOC parsing GitHub
Antiword C Lightweight Word .doc to text/postscript converter GitHub Mirror
Apache POI Java Java API for Microsoft Documents, includes Word97 support via HWPF Apache POI - HWPF
LibreOffice C++ Full office suite with robust support for legacy DOC files GitHub
Catdoc C Lightweight Word .doc to text converter GitHub Mirror
DocToText C++ Lightweight any document file to text converter GitHub

References

All code retained from that version ©2009 DIaLOGIKa http://www.dialogika.de/
.NET core port work and move to System.IO.Compression ©2017 Evolution https://www.evolutionjobs.com/

About

.NET Core library to extract text from doc and convert Microsoft Office binary files (doc, xls and ppt) to Open XML (docx, xlsx and pptx).

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C# 100.0%