Skip to content

fix: add UTF-8 BOM to CSV for Excel compatibility#575

Open
MoshiLL wants to merge 3 commits intofunstory-ai:mainfrom
MoshiLL:Fix/csv-bom
Open

fix: add UTF-8 BOM to CSV for Excel compatibility#575
MoshiLL wants to merge 3 commits intofunstory-ai:mainfrom
MoshiLL:Fix/csv-bom

Conversation

@MoshiLL
Copy link

@MoshiLL MoshiLL commented Mar 10, 2026

PR Title

fix: add UTF-8 BOM to CSV for Excel compatibility

Related Issue(s)

Motivation and Context

CSV files with Chinese characters display garbled text when opened in Windows Excel
because Excel doesn't recognize UTF-8 encoding by default. Adding UTF-8 BOM (EF BB BF)
solves this problem.

Summary of Changes

  • Added UTF-8 BOM to auto_extracted_glossary.csv output in automatic_term_extractor.py
  • Added UTF-8 BOM to auto_extracted_glossary.csv output in result_merger.py
  • Added UTF-8 BOM to auto_extracted_glossary.csv output in pdf_creater.py

PR Type

  • 🐛 Bug Fix

Breaking Changes

  • No, this PR does not introduce breaking changes.

Testing Instructions

  1. Translate a PDF with Chinese text using pdf2zh-next
  2. Find the generated .glossary.csv file
  3. Open the CSV in Windows Excel
  4. Verify Chinese characters display correctly (not garbled)

Contributor Checklist

  • I have fully read and understood the CONTRIBUTING.md guide.
  • I have performed a self-review of my own code.
  • My changes follow the project's code style and guidelines
  • All new and existing tests passed locally with my changes
  • My changes generate no new warnings or errors

Summary by cubic

Add a UTF-8 BOM to exported glossary CSVs so Windows Excel detects UTF-8 and displays Chinese correctly. Also save term_extractor_tracking.json and term_extractor_freq.json with UTF-8 BOM, and export glossary CSVs without the index.

Written for commit 3bcb1be. Summary will update on new commits.

MoshiLL added 3 commits March 10, 2026 17:20
## Problem
  CSV files with Chinese characters show garbled text when opened in Windows Excel

  ## Solution
  Add UTF-8 BOM (EF BB BF) to CSV files so Excel recognizes UTF-8 encoding

  ## Files Modified
  - babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py
  - babeldoc/format/pdf/result_merger.py

  ## Testing
  - [x] CSV opens correctly in Windows Excel with Chinese characters
## Problem
  CSV files with Chinese characters show garbled text when opened in Windows Excel

  ## Solution
  Add UTF-8 BOM (EF BB BF) to CSV files so Excel recognizes UTF-8 encoding

  ## Files Modified
  - babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py
  - babeldoc/format/pdf/result_merger.py

  ## Testing
  - [x] CSV opens correctly in Windows Excel with Chinese characters
## Problem
  CSV files with Chinese characters show garbled text when opened in Windows Excel

  ## Solution
  Add UTF-8 BOM (EF BB BF) to CSV files so Excel recognizes UTF-8 encoding

  ## Files Modified
  - babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py
  - babeldoc/format/pdf/result_merger.py

  ## Testing
  - [x] CSV opens correctly in Windows Excel with Chinese characters
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py">

<violation number="1" location="babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py:397">
P2: JSON debug artifacts are now written with UTF-8 BOM (`utf-8-sig`), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

)
logger.debug(f"save translate tracking to {path}")
with Path(path).open("w", encoding="utf-8") as f:
with Path(path).open("w", encoding="utf-8-sig") as f:
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: JSON debug artifacts are now written with UTF-8 BOM (utf-8-sig), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py, line 397:

<comment>JSON debug artifacts are now written with UTF-8 BOM (`utf-8-sig`), reducing interoperability and potentially breaking downstream parsers expecting standard BOM-less UTF-8 JSON.</comment>

<file context>
@@ -394,14 +394,14 @@ def procress(self, doc_il: ILDocument):
             )
             logger.debug(f"save translate tracking to {path}")
-            with Path(path).open("w", encoding="utf-8") as f:
+            with Path(path).open("w", encoding="utf-8-sig") as f:
                 f.write(tracker.to_json())
 
</file context>
Suggested change
with Path(path).open("w", encoding="utf-8-sig") as f:
with Path(path).open("w", encoding="utf-8") as f:
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants