fix: add UTF-8 BOM to CSV for Excel compatibility by MoshiLL · Pull Request #575 · funstory-ai/BabelDOC

MoshiLL · 2026-03-10T09:26:50Z

PR Title

fix: add UTF-8 BOM to CSV for Excel compatibility

Related Issue(s)

Motivation and Context

CSV files with Chinese characters display garbled text when opened in Windows Excel
because Excel doesn't recognize UTF-8 encoding by default. Adding UTF-8 BOM (EF BB BF)
solves this problem.

Summary of Changes

Added UTF-8 BOM to auto_extracted_glossary.csv output in automatic_term_extractor.py
Added UTF-8 BOM to auto_extracted_glossary.csv output in result_merger.py
Added UTF-8 BOM to auto_extracted_glossary.csv output in pdf_creater.py

PR Type

🐛 Bug Fix

Breaking Changes

No, this PR does not introduce breaking changes.

Testing Instructions

Translate a PDF with Chinese text using pdf2zh-next
Find the generated .glossary.csv file
Open the CSV in Windows Excel
Verify Chinese characters display correctly (not garbled)

Contributor Checklist

I have fully read and understood the CONTRIBUTING.md guide.
I have performed a self-review of my own code.
My changes follow the project's code style and guidelines
All new and existing tests passed locally with my changes
My changes generate no new warnings or errors

Summary by cubic

Add a UTF-8 BOM to exported glossary CSVs so Windows Excel detects UTF-8 and displays Chinese correctly. Also save term_extractor_tracking.json and term_extractor_freq.json with UTF-8 BOM, and export glossary CSVs without the index.

^{Written for commit 3bcb1be. Summary will update on new commits.}

## Problem CSV files with Chinese characters show garbled text when opened in Windows Excel ## Solution Add UTF-8 BOM (EF BB BF) to CSV files so Excel recognizes UTF-8 encoding ## Files Modified - babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py - babeldoc/format/pdf/result_merger.py ## Testing - [x] CSV opens correctly in Windows Excel with Chinese characters

cubic-dev-ai

1 issue found across 3 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py">

<violation number="1" location="babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py:397">
P2: JSON debug artifacts are now written with UTF-8 BOM (`utf-8-sig`), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-03-10T09:29:59Z

babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py

            )
            logger.debug(f"save translate tracking to {path}")
-            with Path(path).open("w", encoding="utf-8") as f:
+            with Path(path).open("w", encoding="utf-8-sig") as f:


P2: JSON debug artifacts are now written with UTF-8 BOM (utf-8-sig), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py, line 397: <comment>JSON debug artifacts are now written with UTF-8 BOM (`utf-8-sig`), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.</comment> <file context> @@ -394,14 +394,14 @@ def procress(self, doc_il: ILDocument): ) logger.debug(f"save translate tracking to {path}") - with Path(path).open("w", encoding="utf-8") as f: + with Path(path).open("w", encoding="utf-8-sig") as f: f.write(tracker.to_json()) </file context>

Suggested change

with Path(path).open("w", encoding="utf-8-sig") as f:

with Path(path).open("w", encoding="utf-8") as f:

MoshiLL added 3 commits March 10, 2026 17:20

cubic-dev-ai bot reviewed Mar 10, 2026

View reviewed changes

awwaawwa added the Planned label Mar 10, 2026

corriuneasley1-pixel approved these changes Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add UTF-8 BOM to CSV for Excel compatibility#575

fix: add UTF-8 BOM to CSV for Excel compatibility#575
MoshiLL wants to merge 3 commits intofunstory-ai:mainfrom
MoshiLL:Fix/csv-bom

MoshiLL commented Mar 10, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	with Path(path).open("w", encoding="utf-8-sig") as f:
	with Path(path).open("w", encoding="utf-8") as f:

Conversation

MoshiLL commented Mar 10, 2026

PR Title

Related Issue(s)

Motivation and Context

Summary of Changes

PR Type

Breaking Changes

Testing Instructions

Contributor Checklist

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cubic-dev-ai bot Mar 10, 2026 •

edited

Loading