Adding support for legacy Word .doc files #1074
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context (Why the change? What's the scenario?)
This should fix issue #677
Although .doc files are legacy, many organizations still have a lot of their knowledge base in .doc files. It'd be great to support that in kernel memory, helping to have that knowledge in a chatGPT solution.
High level description (Approach, Design)
This feature is using the existing open source library
NPOI
(https://github.com/nissl-lab/npoi), which seems is the best solution to deal with legacy Office documents (not only Word, but also Excel files). I use the library to extract all the text from the .doc file as plain text.I've also created a Test for the new Decoder, and updated the existing 002-dotnet-Serverless sample to Index a legacy .doc file and retrieve an answer from it.
Note: to support some old Encodings (like Windows-1255), I had to add the package
System.Text.Encoding.CodePages
and load the encoding in the Decoder constructor.