Skip to content

Conversation

luismanez
Copy link
Contributor

Motivation and Context (Why the change? What's the scenario?)

This should fix issue #677
Although .doc files are legacy, many organizations still have a lot of their knowledge base in .doc files. It'd be great to support that in kernel memory, helping to have that knowledge in a chatGPT solution.

High level description (Approach, Design)

This feature is using the existing open source library NPOI (https://github.com/nissl-lab/npoi), which seems is the best solution to deal with legacy Office documents (not only Word, but also Excel files). I use the library to extract all the text from the .doc file as plain text.

I've also created a Test for the new Decoder, and updated the existing 002-dotnet-Serverless sample to Index a legacy .doc file and retrieve an answer from it.

Note: to support some old Encodings (like Windows-1255), I had to add the package System.Text.Encoding.CodePages and load the encoding in the Decoder constructor.

@luismanez luismanez requested a review from dluc as a code owner June 25, 2025 20:34
@luismanez
Copy link
Contributor Author

Is this repo still alive @dluc ?? 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant