Adding support for legacy Word .doc files #1074

luismanez · 2025-06-25T20:34:03Z

Motivation and Context (Why the change? What's the scenario?)

This should fix issue #677
Although .doc files are legacy, many organizations still have a lot of their knowledge base in .doc files. It'd be great to support that in kernel memory, helping to have that knowledge in a chatGPT solution.

High level description (Approach, Design)

This feature is using the existing open source library NPOI (https://github.com/nissl-lab/npoi), which seems is the best solution to deal with legacy Office documents (not only Word, but also Excel files). I use the library to extract all the text from the .doc file as plain text.

I've also created a Test for the new Decoder, and updated the existing 002-dotnet-Serverless sample to Index a legacy .doc file and retrieve an answer from it.

Note: to support some old Encodings (like Windows-1255), I had to add the package System.Text.Encoding.CodePages and load the encoding in the Decoder constructor.

…ary)

luismanez · 2025-08-06T14:15:49Z

Is this repo still alive @dluc ?? 😄

Adding support for legacy Word .doc files (using NPOI opensource libr…

847c49d

…ary)

luismanez requested a review from dluc as a code owner June 25, 2025 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for legacy Word .doc files #1074

Adding support for legacy Word .doc files #1074

Uh oh!

luismanez commented Jun 25, 2025

Uh oh!

luismanez commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adding support for legacy Word .doc files #1074

Are you sure you want to change the base?

Adding support for legacy Word .doc files #1074

Uh oh!

Conversation

luismanez commented Jun 25, 2025

Motivation and Context (Why the change? What's the scenario?)

High level description (Approach, Design)

Uh oh!

luismanez commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant