Skip to content

Fix #3171 : 去掉生成绝对路径函数中的窄字符WindowsAPI调用,改用宽字符API和C++17 std::filesystem来避免win平台中文路径问题#3255

Open
XiaoBaiCanDo wants to merge 9 commits intok2-fsa:masterfrom
XiaoBaiCanDo:fix/file-utils-resolve-win-path
Open

Fix #3171 : 去掉生成绝对路径函数中的窄字符WindowsAPI调用,改用宽字符API和C++17 std::filesystem来避免win平台中文路径问题#3255
XiaoBaiCanDo wants to merge 9 commits intok2-fsa:masterfrom
XiaoBaiCanDo:fix/file-utils-resolve-win-path

Conversation

@XiaoBaiCanDo
Copy link

@XiaoBaiCanDo XiaoBaiCanDo commented Mar 3, 2026

修复 #3171

此 PR 解决了 issue #3171 中提到的win平台中文路径解析问题。

修改内容:

  • 重构 ResolveAbsolutePath 函数,使用 std::filesystem API
  • 正确处理绝对路径和相对路径的转换
  • 使用 lexically_normal() 规范化路径格式
  • 添加异常处理,避免路径转换失败时程序崩溃

测试:

  • ✅ Windows MSVC 编译通过

Fixes #3171

Summary by CodeRabbit

  • Refactor
    • Reworked low-level file handling with explicit Windows and POSIX branches for clearer cross-platform behavior.
    • Improved file-existence checks, robust file reading, and absolute-path resolution with safer fallbacks on failure.
    • Introduced cross-platform string/path conversion utilities to support OS-specific operations.
    • Centralized error handling and removed single-platform dependencies for more reliable path normalization.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 3, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces ad-hoc cross-platform file handling in sherpa-onnx/csrc/file-utils.cc with explicit Windows and POSIX implementations. Adds ToWideString/ToString forward declarations and rewrites FileExists, ReadFile, and ResolveAbsolutePath to use platform APIs with try/catch guards and safe fallbacks on failure.

Changes

Cohort / File(s) Summary
File I/O (platform-specific)
sherpa-onnx/csrc/file-utils.cc
Rewrote file operations to use explicit Windows (wide-string APIs: GetFileAttributesW, CreateFileW, GetFullPathNameW, ReadFile) and POSIX (stat, realpath, std::ifstream) branches. Added forward declarations std::wstring ToWideString(const std::string&) and std::string ToString(const std::wstring&). Wrapped OS calls in try/catch and return safe defaults on error.
Platform headers
sherpa-onnx/csrc/file-utils.cc
Added POSIX headers (sys/stat.h, unistd.h) and removed single-platform assumptions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Remove filesystem header #2998: Modifies sherpa-onnx/csrc/file-utils.* and replaces ResolveAbsolutePath with OS-specific logic; closely related to these platform-path changes.

Suggested labels

size:M

Suggested reviewers

  • csukuangfj

Poem

🐰 I hopped through files, wide and bright,
Turned tiny paths from wrong to right.
Wide strings danced and bytes came through,
Now models find their homes anew. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is in Chinese and references fixing #3171 regarding Chinese path handling on Windows using wide-character APIs.
Linked Issues check ✅ Passed The PR addresses the core issue #3171 by implementing wide-character Windows API calls and proper cross-platform path handling for non-ASCII characters.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the Windows Chinese path issue and improving file-utils.cc for cross-platform support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在解决 Windows 平台上因窄字符 API 导致的中文路径解析问题。通过引入 C++17 的 std::filesystem 库和宽字符处理,它重构了文件路径解析和文件存在性检查的逻辑,从而显著提升了跨平台兼容性和对包含多字节字符路径的支持。

Highlights

  • 路径解析重构: ResolveAbsolutePath 函数已重构,采用 C++17 std::filesystem API,以解决 Windows 平台中文路径问题。
  • 宽字符路径支持: FileExists 和 ReadFile 函数在 Windows 平台下已更新,以支持宽字符路径,确保正确处理包含非 ASCII 字符的文件名。
  • 路径规范化: 路径处理逻辑现在包含 lexically_normal() 调用,用于规范化路径格式,提高路径解析的健壮性。
  • 异常处理: 为路径转换操作添加了异常处理机制,防止因路径解析失败导致程序崩溃。
Changelog
  • sherpa-onnx/csrc/file-utils.cc
    • 移除了 Windows 特定的 windows.h 和 POSIX 特定的 limits.h, stdlib.h 头文件。
    • 新增了 头文件引用。
    • 声明了 ToWideString 辅助函数,用于将 std::string 转换为 std::wstring。
    • 修改了 FileExists 函数,在 Windows 平台下使用 ToWideString 转换文件名并使用 std::ifstream(wide_path)。
    • 修改了 ReadFile 函数,在 Windows 平台下使用 ToWideString 转换文件名并使用 std::ifstream(wide_path)。
    • 完全重写了 ResolveAbsolutePath 函数,现在它利用 std::filesystem::path 进行路径解析、判断绝对/相对路径、规范化 (lexically_normal()),并捕获 std::filesystem::filesystem_error 异常。
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

此 PR 通过使用 C++17 std::filesystem 解决了 Windows 平台上的中文路径问题,这是一个很好的改进。代码整体上是正确的,但我有两个建议可以进一步提高代码质量:一个是在 file-utils.cc 中合并多余的命名空间声明,以增强代码清晰度;另一个是在 ResolveAbsolutePath 函数中使用 std::filesystem::u8path 来显式处理 UTF-8 路径,从而提高代码的健壮性和可移植性。

// POSIX: absolute paths start with '/'
if (path[0] == '/') {
try {
std::filesystem::path fs_path(path);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

std::filesystem::pathstd::string 构造时的编码行为是实现定义的。为了明确地将输入字符串 path 解释为 UTF-8 编码,从而避免在处理中文等非 ASCII 字符时出现潜在的编码问题,建议使用 std::filesystem::u8path()。这能让代码的意图更清晰,也更健壮。

Suggested change
std::filesystem::path fs_path(path);
std::filesystem::path fs_path = std::filesystem::u8path(path);

Comment on lines 18 to 20
} // namespace sherpa_onnx

namespace sherpa_onnx {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这两个 namespace sherpa_onnx 块是连续的。建议将它们合并为一个,以提高代码的可读性和维护性。您可以移除此处的命名空间闭合和下一个命名空间的开始部分。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u8path在C++20中不是被弃用了吗?长期看是不是不该在这里用u8path

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
sherpa-onnx/csrc/file-utils.cc (1)

24-26: Use std::filesystem::path instead of std::wstring for cross-platform portability

Lines 24-26 and 41-42 pass std::wstring directly to std::ifstream. While this works on Windows, it is not a standard C++17 portable constructor. std::ifstream provides a standard portable overload for std::filesystem::path that handles encoding correctly on both Windows and POSIX systems.

♻️ Proposed refactor
-  std::wstring wide_path = ToWideString(filename);
-  std::ifstream file(wide_path);
+  std::ifstream file(std::filesystem::path(filename));
...
-  std::wstring wide_path = ToWideString(filename);
-  std::ifstream file(wide_path, std::ios::binary | std::ios::ate);
+  std::ifstream file(std::filesystem::path(filename),
+                     std::ios::binary | std::ios::ate);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 24 - 26, Replace the ad-hoc wide
string conversion and direct wstring use when opening files with
std::filesystem::path: instead of calling ToWideString(filename) to create
wide_path and passing that to std::ifstream, construct a std::filesystem::path
p(filename) and open std::ifstream file(p); do the same for the second
occurrence (lines that create wide_path2 / open file there). Add `#include`
<filesystem> and ensure the code uses std::filesystem::path so the std::ifstream
overload handles platform encoding correctly; keep the original filename
variable and remove the need for ToWideString when opening streams.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 131-137: ResolveAbsolutePath constructs std::filesystem::path from
std::string which uses the ANSI code page on Windows and breaks non-ASCII paths;
change both constructions of fs_path (the one at the initial creation and the
later creation before returning) to use std::filesystem::u8path(path) so the
path is interpreted as UTF-8 on Windows (keep using
fs_path.lexically_normal().u8string() for the return value).

---

Nitpick comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 24-26: Replace the ad-hoc wide string conversion and direct
wstring use when opening files with std::filesystem::path: instead of calling
ToWideString(filename) to create wide_path and passing that to std::ifstream,
construct a std::filesystem::path p(filename) and open std::ifstream file(p); do
the same for the second occurrence (lines that create wide_path2 / open file
there). Add `#include` <filesystem> and ensure the code uses std::filesystem::path
so the std::ifstream overload handles platform encoding correctly; keep the
original filename variable and remove the need for ToWideString when opening
streams.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a2f8d7 and 391b62a.

📒 Files selected for processing (1)
  • sherpa-onnx/csrc/file-utils.cc

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 23-34: The current try/catch blocks that wrap the path conversion
and filesystem checks only catch std::filesystem::filesystem_error but must also
catch allocation and conversion failures from ToWideString()/Utf8ToUtf32()
(e.g., std::bad_alloc and other std::exception types); update the catch clauses
in file-utils.cc (the blocks that call ToWideString and construct
std::filesystem::path and the similar blocks around the filesystem checks) to
catch std::exception (or add an additional catch(std::exception&) alongside the
existing one) so conversion and memory errors are handled gracefully; ensure the
handlers return false like the existing filesystem_error handler and apply the
same change to the other two similar blocks referenced in the review.
- Around line 57-62: tellg() can return -1 on failure; validate the returned
std::streamsize before using it to size the buffer. After calling file.tellg()
(the variable size) check that size is >= 0 (and that file.good()) and return {}
on failure; only then construct the buffer (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size))). This prevents negative tellg() values
from being interpreted as huge unsigned sizes and avoids
std::bad_alloc/std::length_error when reading in the subsequent file.read call.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 391b62a and 750f6a3.

📒 Files selected for processing (1)
  • sherpa-onnx/csrc/file-utils.cc

#include "sherpa-onnx/csrc/file-utils.h"

#include <fstream>
#include <filesystem>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你好,我们现在避免使用 filesystem 这个头文件. 详见 #2998

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的谢谢

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好,看ai审阅还有超过4GB的大文件的读取问题,需要考虑循环分块读取吗?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
sherpa-onnx/csrc/file-utils.cc (2)

8-8: ⚠️ Potential issue | 🟠 Major

<filesystem> reintroduction conflicts with prior maintainer direction

Line 8 adds <filesystem> again, but this PR thread already records maintainer guidance to avoid it (referencing PR #2998). This is a root-level conflict that should be resolved before merge.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` at line 8, Remove the reintroduced `#include`
<filesystem> in file-utils.cc (the added include on line 8) to comply with the
prior maintainer guidance; instead implement file operations using the project's
approved approach — e.g., use POSIX APIs (<sys/stat.h>) or std::ifstream for
existence/size checks or the repository's existing FileUtils helpers — and
update any functions in file-utils.cc that currently rely on std::filesystem to
call the replacement helpers so the file no longer depends on <filesystem>.

54-59: ⚠️ Potential issue | 🟠 Major

Validate tellg() before allocating the read buffer

Line 54 may return a failure sentinel (negative), and Line 57 uses it directly as vector size. That can trigger huge allocation attempts via signed→unsigned conversion before the exception path returns {}.

Suggested patch
-    std::streamsize size = file.tellg();
+    std::streamsize size = file.tellg();
+    if (size < 0) {
+      return {};
+    }
     file.seekg(0, std::ios::beg);

-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::vector<char> buffer(static_cast<std::size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }
#!/bin/bash
# Verify that ReadFile() has a guard for negative tellg() before vector allocation.
# Expected: a check like `if (size < 0) return {};` appears between tellg() and buffer construction.

rg -n -C3 'std::streamsize size = file\.tellg\(\)|std::vector<char> buffer\(' sherpa-onnx/csrc/file-utils.cc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 54 - 59, The code uses
std::streamsize size = file.tellg() then constructs std::vector<char>
buffer(size) without validating size; if tellg() failed it can be negative and
cause a huge unsigned allocation. In the ReadFile implementation in
file-utils.cc, add a guard immediately after the tellg() call to check if size <
0 (or file.fail()) and return {} on failure, then safely cast size to
std::size_t when creating std::vector<char> buffer; keep the existing seekg/read
logic (use buffer.data(), static_cast<std::size_t>(size)) so you don’t attempt
an allocation when tellg() fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Line 8: Remove the reintroduced `#include` <filesystem> in file-utils.cc (the
added include on line 8) to comply with the prior maintainer guidance; instead
implement file operations using the project's approved approach — e.g., use
POSIX APIs (<sys/stat.h>) or std::ifstream for existence/size checks or the
repository's existing FileUtils helpers — and update any functions in
file-utils.cc that currently rely on std::filesystem to call the replacement
helpers so the file no longer depends on <filesystem>.
- Around line 54-59: The code uses std::streamsize size = file.tellg() then
constructs std::vector<char> buffer(size) without validating size; if tellg()
failed it can be negative and cause a huge unsigned allocation. In the ReadFile
implementation in file-utils.cc, add a guard immediately after the tellg() call
to check if size < 0 (or file.fail()) and return {} on failure, then safely cast
size to std::size_t when creating std::vector<char> buffer; keep the existing
seekg/read logic (use buffer.data(), static_cast<std::size_t>(size)) so you
don’t attempt an allocation when tellg() fails.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 750f6a3 and 3e6f259.

📒 Files selected for processing (1)
  • sherpa-onnx/csrc/file-utils.cc

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
sherpa-onnx/csrc/file-utils.cc (1)

92-97: ⚠️ Potential issue | 🟠 Major

Guard tellg() failure before using it as vector size.

At Line 92, tellg() can return a failure sentinel; at Line 95 that value is used directly as std::vector<char> size. Please validate size before allocation and cast explicitly.

Proposed patch
-    std::streamsize size = file.tellg();
-    file.seekg(0, std::ios::beg);
-
-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::streamsize size = file.tellg();
+    if (size < 0 || !file.good()) {
+      return {};
+    }
+
+    file.seekg(0, std::ios::beg);
+    if (!file.good()) {
+      return {};
+    }
+
+    std::vector<char> buffer(static_cast<size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }
#!/bin/bash
# Verify tellg() result is validated before vector allocation in POSIX branch.
nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '88,102p'
# Expected: an explicit guard like `if (size < 0)` before constructing std::vector<char>.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 92 - 97, tellg() can return a
failure sentinel which is currently used directly as the std::vector size; guard
the result before allocation by checking the std::streamsize variable size for
failure (e.g., if (size < 0) return {};), then seek back with file.seekg(0,
std::ios::beg) and allocate the buffer using an explicit cast (e.g.,
std::vector<char> buffer(static_cast<std::size_t>(size));) and keep the existing
file.read(buffer.data(), size) call; refer to the tellg() call, the variable
size, the std::vector<char> buffer allocation, and file.read(...) in
file-utils.cc when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 73-81: The code currently allocates buffer based on
file_size.QuadPart and calls ReadFile with a DWORD length which will overflow
for files > MAXDWORD; add a guard after GetFileSizeEx to check if
file_size.QuadPart > MAXDWORD and return {} (or handle error) before allocating
the std::vector and before casting to DWORD, ensuring ReadFile is only called
with a safe static_cast<DWORD>(buffer.size()) and avoiding silent truncation;
reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD when making
this change.

---

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 92-97: tellg() can return a failure sentinel which is currently
used directly as the std::vector size; guard the result before allocation by
checking the std::streamsize variable size for failure (e.g., if (size < 0)
return {};), then seek back with file.seekg(0, std::ios::beg) and allocate the
buffer using an explicit cast (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size));) and keep the existing
file.read(buffer.data(), size) call; refer to the tellg() call, the variable
size, the std::vector<char> buffer allocation, and file.read(...) in
file-utils.cc when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4695a0e-bf49-4252-8f85-548cd579f44c

📥 Commits

Reviewing files that changed from the base of the PR and between 3e6f259 and 25c5a4e.

📒 Files selected for processing (1)
  • sherpa-onnx/csrc/file-utils.cc

Comment on lines +73 to +81
if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
return {};
}

std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart));

DWORD bytes_read = 0;
if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
bytes_read != buffer.size()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Examine the file around lines 73-81
nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 102


🏁 Script executed:

#!/bin/bash
# Examine the file around lines 73-81 using cat
cat -n sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 939


🏁 Script executed:

#!/bin/bash
# Look at the start of the file for includes and to understand context
head -n 64 sherpa-onnx/csrc/file-utils.cc

Repository: k2-fsa/sherpa-onnx

Length of output: 1509


🏁 Script executed:

#!/bin/bash
# Check the actual definitions of these constants in Windows
python3 << 'EOF'
# On 64-bit systems:
SIZE_MAX_64 = (1 << 64) - 1  # 2^64 - 1
MAXDWORD = (1 << 32) - 1     # 2^32 - 1 = 4,294,967,295

print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)")
print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)")
print(f"\nIf a file is 5GB:")
print(f"  - Passes SIZE_MAX check: YES")
print(f"  - Allocated buffer size: 5GB")
print(f"  - ReadFile call truncates to: {MAXDWORD:,} bytes")
print(f"  - bytes_read: {MAXDWORD:,} bytes")
print(f"  - buffer.size(): 5,368,709,120 bytes")
print(f"  - bytes_read != buffer.size(): TRUE (causes return {})")
print(f"\nResult: File silently fails to read, appearing as read failure")
EOF

Repository: k2-fsa/sherpa-onnx

Length of output: 281


🏁 Script executed:

#!/bin/bash
# Check the exact definitions of these constants in Windows
python3 << 'EOF'
# On 64-bit systems:
SIZE_MAX_64 = (1 << 64) - 1  # 2^64 - 1
MAXDWORD = (1 << 32) - 1     # 2^32 - 1 = 4,294,967,295

print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)")
print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)")
print(f"\nIf a file is 5GB:")
print(f"  - Passes SIZE_MAX check: YES")
print(f"  - Allocated buffer size: 5GB")
print(f"  - ReadFile call truncates to: {MAXDWORD:,} bytes")
print(f"  - bytes_read: {MAXDWORD:,} bytes")
print(f"  - buffer.size(): 5,368,709,120 bytes")
print(f"  - Comparison bytes_read != buffer.size(): TRUE")
print(f"\nResult: File silently fails to read, appearing as read failure")
EOF

Repository: k2-fsa/sherpa-onnx

Length of output: 508


Guard against ReadFile DWORD byte-count limit for large files.

Files larger than 4 GB (MAXDWORD) will silently fail to read. At line 80, ReadFile can read at most ~4.3 GB per call, but the code allocates buffers up to SIZE_MAX (~16 exabytes on 64-bit systems) without checking this limit. For files between 4 GB and SIZE_MAX, the cast static_cast<DWORD>(buffer.size()) truncates silently, bytes_read will be less than buffer.size(), and the check at line 81 will reject the file.

Add a guard to reject files exceeding MAXDWORD before allocation:

Proposed fix
-    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
+    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 ||
+        file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) ||
+        file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) {
       return {};
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 73 - 81, The code currently
allocates buffer based on file_size.QuadPart and calls ReadFile with a DWORD
length which will overflow for files > MAXDWORD; add a guard after GetFileSizeEx
to check if file_size.QuadPart > MAXDWORD and return {} (or handle error) before
allocating the std::vector and before casting to DWORD, ensuring ReadFile is
only called with a safe static_cast<DWORD>(buffer.size()) and avoiding silent
truncation; reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD
when making this change.

struct stat file_stat;
return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode);
#endif
} catch (const std::exception&) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe which function can throw in the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

谢谢,确实触发概率太低,我将去掉

std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart));

DWORD bytes_read = 0;
if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is toooooo complicated. Please use multilines of code to simplify it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好滴

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在修复 Windows 平台下包含中文/非 ASCII 路径时模型文件解析/加载失败的问题(#3171),通过将部分文件与路径相关操作切换到宽字符 Windows API,并改进跨平台文件存在性检测与读取逻辑。

Changes:

  • Windows:FileExists/ReadFile/ResolveAbsolutePath 改用宽字符 Win32 API(GetFileAttributesW/CreateFileW/GetFullPathNameW)。
  • POSIX:FileExists 改用 stat()ResolveAbsolutePath 使用 realpath()
  • 为上述路径/文件操作增加异常捕获并在失败时返回安全兜底值(如 {} 或原始 path)。

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 172 to +205
std::string ResolveAbsolutePath(const std::string &path) {
if (path.empty()) {
return path;
}

try {
#ifdef _WIN32
// Check if path is already absolute (drive letter or UNC path)
if ((path.size() > 1 && path[1] == ':') ||
(path.size() > 1 && path[0] == '\\' && path[1] == '\\')) {
return path;
}

char buffer[MAX_PATH];
if (GetFullPathNameA(path.c_str(), MAX_PATH, buffer, nullptr)) {
return std::string(buffer);
}

return path; // fallback on failure

std::wstring wide_path = ToWideString(path);
DWORD required_size = GetFullPathNameW(wide_path.c_str(), 0, nullptr, nullptr);
if (required_size == 0) {
return path;
}

std::vector<wchar_t> buffer(required_size);
DWORD actual_size = GetFullPathNameW(
wide_path.c_str(),
required_size,
buffer.data(),
nullptr
);

if (actual_size == 0 || actual_size >= required_size) {
return path;
}

std::wstring resolved_wide(buffer.data(), actual_size);
return ToString(resolved_wide);
#else
// POSIX: absolute paths start with '/'
if (path[0] == '/') {
char resolved_path[PATH_MAX];
if (realpath(path.c_str(), resolved_path) == nullptr) {
return path;
}
return std::string(resolved_path);
#endif
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions refactoring ResolveAbsolutePath to use std::filesystem and lexically_normal(), but the implementation still uses GetFullPathNameW/realpath and does not do lexical normalization. Either update the PR description to match the actual approach, or implement the described std::filesystem-based normalization.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +38
struct stat file_stat;
return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode);
#endif
} catch (const std::exception&) {
return false;
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now catches std::exception, but it does not include <exception>. Add the direct standard header to avoid relying on transitive includes.

Copilot uses AI. Check for mistakes.
Comment on lines 22 to +25

namespace sherpa_onnx {
std::wstring ToWideString(const std::string &s);
std::string ToString(const std::wstring &s);
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToWideString/ToString are forward-declared here but defined in text-utils.h/.cc. Prefer including the header instead of duplicating declarations so the dependency is explicit and signatures stay in sync.

Suggested change
namespace sherpa_onnx {
std::wstring ToWideString(const std::string &s);
std::string ToString(const std::wstring &s);
#include "sherpa-onnx/csrc/text-utils.h"
namespace sherpa_onnx {

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +84
DWORD bytes_read = 0;
if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
bytes_read != buffer.size()) {
return {};
}

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows ReadFile casts buffer.size() to DWORD for the ReadFile() call. For files >4GiB this overflows / truncates the requested read size; the function will then fail even though SIZE_MAX is larger. Consider rejecting sizes > DWORD max explicitly or reading in a loop/chunks.

Suggested change
DWORD bytes_read = 0;
if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
bytes_read != buffer.size()) {
return {};
}
size_t total_read = 0;
while (total_read < buffer.size()) {
size_t remaining = buffer.size() - total_read;
DWORD to_read = remaining > static_cast<size_t>(MAXDWORD)
? MAXDWORD
: static_cast<DWORD>(remaining);
DWORD bytes_read = 0;
if (!::ReadFile(hFile, buffer.data() + total_read, to_read, &bytes_read, nullptr)) {
return {};
}
if (bytes_read == 0) {
// Unexpected end of file
return {};
}
total_read += bytes_read;
}

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +97
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);

std::vector<char> buffer(size);
if (!file.read(buffer.data(), size)) {
std::vector<char> buffer(size);
if (!file.read(buffer.data(), size)) {
return {};
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-Windows ReadFile path, tellg() can return -1 on failure; constructing std::vector<char> buffer(size) would then attempt a huge allocation due to signed-to-unsigned conversion. Check that size >= 0 (and that the stream is in a good state) before allocating.

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
sherpa-onnx/csrc/file-utils.cc (2)

94-99: ⚠️ Potential issue | 🟡 Minor

Validate tellg() result before using it as vector length.

At Line 94 and Line 97, tellg() failure (-1) is not checked before vector sizing. This can trigger a huge implicit unsigned conversion and exception path instead of a clean early return.

Proposed fix
-    std::streamsize size = file.tellg();
+    std::streamsize size = file.tellg();
+    if (size < 0) {
+      return {};
+    }
     file.seekg(0, std::ios::beg);
 
-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::vector<char> buffer(static_cast<size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }
#!/bin/bash
# Verify tellg() guard and safe vector sizing in POSIX branch.
rg -n -C3 'tellg\(|size < 0|vector<char> buffer\(|file\.read\(' sherpa-onnx/csrc/file-utils.cc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 94 - 99, The code uses
std::streamsize size = file.tellg() and then constructs std::vector<char>
buffer(size) without validating tellg(); first check that size != -1 (or size >=
0) after calling tellg() and return {} early on failure, then cast size to a
safe size_t when sizing the vector (e.g., static_cast<size_t>(size)); ensure the
validation happens before calling file.seekg and before constructing buffer and
before file.read, and keep references to tellg(), size, buffer
(std::vector<char> buffer(...)), and file.read(...) to locate and update the
logic.

68-84: ⚠️ Potential issue | 🟠 Major

Guard file size against Win32 DWORD read limit before allocation.

At Line 69/Line 79, size is bounded by SIZE_MAX but ReadFile still accepts only DWORD. Files larger than MAXDWORD can be mis-sized/truncated at the call boundary after allocating a large buffer.

Proposed fix
-    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
+    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 ||
+        file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) ||
+        file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) {
       return {};
     }
#!/bin/bash
# Verify whether the Windows branch guards MAXDWORD before ReadFile.
rg -n -C3 'GetFileSizeEx|MAXDWORD|ReadFile\(' sherpa-onnx/csrc/file-utils.cc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 68 - 84, The code checks
file_size against SIZE_MAX but then passes buffer.size() to ReadFile which
accepts a DWORD; before allocating or calling ReadFile in the
GetFileSizeEx/ReadFile path, additionally guard that file_size.QuadPart <=
MAXDWORD (or handle large files by reading in chunks) so you never cast a size >
MAXDWORD into a DWORD; update the logic around file_size, buffer allocation
(vector<char> buffer(...)), and the ReadFile call (and the bytes_read
validation) to either return an error for files > MAXDWORD or implement chunked
reads for large files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 94-99: The code uses std::streamsize size = file.tellg() and then
constructs std::vector<char> buffer(size) without validating tellg(); first
check that size != -1 (or size >= 0) after calling tellg() and return {} early
on failure, then cast size to a safe size_t when sizing the vector (e.g.,
static_cast<size_t>(size)); ensure the validation happens before calling
file.seekg and before constructing buffer and before file.read, and keep
references to tellg(), size, buffer (std::vector<char> buffer(...)), and
file.read(...) to locate and update the logic.
- Around line 68-84: The code checks file_size against SIZE_MAX but then passes
buffer.size() to ReadFile which accepts a DWORD; before allocating or calling
ReadFile in the GetFileSizeEx/ReadFile path, additionally guard that
file_size.QuadPart <= MAXDWORD (or handle large files by reading in chunks) so
you never cast a size > MAXDWORD into a DWORD; update the logic around
file_size, buffer allocation (vector<char> buffer(...)), and the ReadFile call
(and the bytes_read validation) to either return an error for files > MAXDWORD
or implement chunked reads for large files.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66299578-a90c-464a-9882-f141d4ae609b

📥 Commits

Reviewing files that changed from the base of the PR and between 25c5a4e and 43db5d0.

📒 Files selected for processing (1)
  • sherpa-onnx/csrc/file-utils.cc

XiaoBaiCanDo and others added 3 commits March 17, 2026 17:14
包含 <stdlib.h>支持POSIX realpath()的使用

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Windows平台中文路径模型加载失败,希望支持非 ASCII 路径

3 participants