Fix #3171 : 去掉生成绝对路径函数中的窄字符WindowsAPI调用，改用宽字符API和C++17 std::filesystem来避免win平台中文路径问题 by XiaoBaiCanDo · Pull Request #3255 · k2-fsa/sherpa-onnx

XiaoBaiCanDo · 2026-03-03T09:07:12Z

修复 #3171

此 PR 解决了 issue #3171 中提到的win平台中文路径解析问题。

修改内容：

重构 ResolveAbsolutePath 函数，使用 std::filesystem API
正确处理绝对路径和相对路径的转换
使用 lexically_normal() 规范化路径格式
添加异常处理，避免路径转换失败时程序崩溃

测试：

✅ Windows MSVC 编译通过

Fixes #3171

Summary by CodeRabbit

Refactor
- Reworked low-level file handling with explicit Windows and POSIX branches for clearer cross-platform behavior.
- Improved file-existence checks, robust file reading, and absolute-path resolution with safer fallbacks on failure.
- Introduced cross-platform string/path conversion utilities to support OS-specific operations.
- Centralized error handling and removed single-platform dependencies for more reliable path normalization.

…进行win平台宽字符适配

coderabbitai · 2026-03-03T09:07:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces ad-hoc cross-platform file handling in sherpa-onnx/csrc/file-utils.cc with explicit Windows and POSIX implementations. Adds ToWideString/ToString forward declarations and rewrites FileExists, ReadFile, and ResolveAbsolutePath to use platform APIs with try/catch guards and safe fallbacks on failure.

Changes

Cohort / File(s)	Summary
File I/O (platform-specific) `sherpa-onnx/csrc/file-utils.cc`	Rewrote file operations to use explicit Windows (wide-string APIs: `GetFileAttributesW`, `CreateFileW`, `GetFullPathNameW`, `ReadFile`) and POSIX (`stat`, `realpath`, `std::ifstream`) branches. Added forward declarations `std::wstring ToWideString(const std::string&)` and `std::string ToString(const std::wstring&)`. Wrapped OS calls in try/catch and return safe defaults on error.
Platform headers `sherpa-onnx/csrc/file-utils.cc`	Added POSIX headers (`sys/stat.h`, `unistd.h`) and removed single-platform assumptions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Remove filesystem header #2998: Modifies sherpa-onnx/csrc/file-utils.* and replaces ResolveAbsolutePath with OS-specific logic; closely related to these platform-path changes.

Suggested labels

size:M

Suggested reviewers

csukuangfj

Poem

🐰 I hopped through files, wide and bright,
Turned tiny paths from wrong to right.
Wide strings danced and bytes came through,
Now models find their homes anew. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title is in Chinese and references fixing `#3171` regarding Chinese path handling on Windows using wide-character APIs.
Linked Issues check	✅ Passed	The PR addresses the core issue `#3171` by implementing wide-character Windows API calls and proper cross-platform path handling for non-ASCII characters.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the Windows Chinese path issue and improving file-utils.cc for cross-platform support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

gemini-code-assist · 2026-03-03T09:07:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在解决 Windows 平台上因窄字符 API 导致的中文路径解析问题。通过引入 C++17 的 std::filesystem 库和宽字符处理，它重构了文件路径解析和文件存在性检查的逻辑，从而显著提升了跨平台兼容性和对包含多字节字符路径的支持。

Highlights

路径解析重构: ResolveAbsolutePath 函数已重构，采用 C++17 std::filesystem API，以解决 Windows 平台中文路径问题。
宽字符路径支持: FileExists 和 ReadFile 函数在 Windows 平台下已更新，以支持宽字符路径，确保正确处理包含非 ASCII 字符的文件名。
路径规范化: 路径处理逻辑现在包含 lexically_normal() 调用，用于规范化路径格式，提高路径解析的健壮性。
异常处理: 为路径转换操作添加了异常处理机制，防止因路径解析失败导致程序崩溃。

Changelog

sherpa-onnx/csrc/file-utils.cc
- 移除了 Windows 特定的 windows.h 和 POSIX 特定的 limits.h, stdlib.h 头文件。
- 新增了头文件引用。
- 声明了 ToWideString 辅助函数，用于将 std::string 转换为 std::wstring。
- 修改了 FileExists 函数，在 Windows 平台下使用 ToWideString 转换文件名并使用 std::ifstream(wide_path)。
- 修改了 ReadFile 函数，在 Windows 平台下使用 ToWideString 转换文件名并使用 std::ifstream(wide_path)。
- 完全重写了 ResolveAbsolutePath 函数，现在它利用 std::filesystem::path 进行路径解析、判断绝对/相对路径、规范化 (lexically_normal())，并捕获 std::filesystem::filesystem_error 异常。

Activity

解决了 issue [bug] Windows平台中文路径模型加载失败，希望支持非 ASCII 路径 #3171 中报告的 Windows 平台中文路径解析问题。
在 Windows MSVC 环境下通过了编译测试。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

此 PR 通过使用 C++17 std::filesystem 解决了 Windows 平台上的中文路径问题，这是一个很好的改进。代码整体上是正确的，但我有两个建议可以进一步提高代码质量：一个是在 file-utils.cc 中合并多余的命名空间声明，以增强代码清晰度；另一个是在 ResolveAbsolutePath 函数中使用 std::filesystem::u8path 来显式处理 UTF-8 路径，从而提高代码的健壮性和可移植性。

gemini-code-assist · 2026-03-03T09:11:01Z

sherpa-onnx/csrc/file-utils.cc

-  // POSIX: absolute paths start with '/'
-  if (path[0] == '/') {
+  try {
+    std::filesystem::path fs_path(path);


std::filesystem::path 从 std::string 构造时的编码行为是实现定义的。为了明确地将输入字符串 path 解释为 UTF-8 编码，从而避免在处理中文等非 ASCII 字符时出现潜在的编码问题，建议使用 std::filesystem::u8path()。这能让代码的意图更清晰，也更健壮。

Suggested change

std::filesystem::path fs_path(path);

std::filesystem::path fs_path = std::filesystem::u8path(path);

gemini-code-assist · 2026-03-03T09:11:02Z

sherpa-onnx/csrc/file-utils.cc

+}  // namespace sherpa_onnx
+
 namespace sherpa_onnx {


这两个 namespace sherpa_onnx 块是连续的。建议将它们合并为一个，以提高代码的可读性和维护性。您可以移除此处的命名空间闭合和下一个命名空间的开始部分。

u8path在C++20中不是被弃用了吗？长期看是不是不该在这里用u8path

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

sherpa-onnx/csrc/file-utils.cc (1)

24-26: Use std::filesystem::path instead of std::wstring for cross-platform portability

Lines 24-26 and 41-42 pass std::wstring directly to std::ifstream. While this works on Windows, it is not a standard C++17 portable constructor. std::ifstream provides a standard portable overload for std::filesystem::path that handles encoding correctly on both Windows and POSIX systems.

♻️ Proposed refactor

-  std::wstring wide_path = ToWideString(filename);
-  std::ifstream file(wide_path);
+  std::ifstream file(std::filesystem::path(filename));
...
-  std::wstring wide_path = ToWideString(filename);
-  std::ifstream file(wide_path, std::ios::binary | std::ios::ate);
+  std::ifstream file(std::filesystem::path(filename),
+                     std::ios::binary | std::ios::ate);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 24 - 26, Replace the ad-hoc wide
string conversion and direct wstring use when opening files with
std::filesystem::path: instead of calling ToWideString(filename) to create
wide_path and passing that to std::ifstream, construct a std::filesystem::path
p(filename) and open std::ifstream file(p); do the same for the second
occurrence (lines that create wide_path2 / open file there). Add `#include`
<filesystem> and ensure the code uses std::filesystem::path so the std::ifstream
overload handles platform encoding correctly; keep the original filename
variable and remove the need for ToWideString when opening streams.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 131-137: ResolveAbsolutePath constructs std::filesystem::path from
std::string which uses the ANSI code page on Windows and breaks non-ASCII paths;
change both constructions of fs_path (the one at the initial creation and the
later creation before returning) to use std::filesystem::u8path(path) so the
path is interpreted as UTF-8 on Windows (keep using
fs_path.lexically_normal().u8string() for the return value).

---

Nitpick comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 24-26: Replace the ad-hoc wide string conversion and direct
wstring use when opening files with std::filesystem::path: instead of calling
ToWideString(filename) to create wide_path and passing that to std::ifstream,
construct a std::filesystem::path p(filename) and open std::ifstream file(p); do
the same for the second occurrence (lines that create wide_path2 / open file
there). Add `#include` <filesystem> and ensure the code uses std::filesystem::path
so the std::ifstream overload handles platform encoding correctly; keep the
original filename variable and remove the need for ToWideString when opening
streams.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a2f8d7 and 391b62a.

📒 Files selected for processing (1)

sherpa-onnx/csrc/file-utils.cc

sherpa-onnx/csrc/file-utils.cc

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 23-34: The current try/catch blocks that wrap the path conversion
and filesystem checks only catch std::filesystem::filesystem_error but must also
catch allocation and conversion failures from ToWideString()/Utf8ToUtf32()
(e.g., std::bad_alloc and other std::exception types); update the catch clauses
in file-utils.cc (the blocks that call ToWideString and construct
std::filesystem::path and the similar blocks around the filesystem checks) to
catch std::exception (or add an additional catch(std::exception&) alongside the
existing one) so conversion and memory errors are handled gracefully; ensure the
handlers return false like the existing filesystem_error handler and apply the
same change to the other two similar blocks referenced in the review.
- Around line 57-62: tellg() can return -1 on failure; validate the returned
std::streamsize before using it to size the buffer. After calling file.tellg()
(the variable size) check that size is >= 0 (and that file.good()) and return {}
on failure; only then construct the buffer (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size))). This prevents negative tellg() values
from being interpreted as huge unsigned sizes and avoids
std::bad_alloc/std::length_error when reading in the subsequent file.read call.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 391b62a and 750f6a3.

📒 Files selected for processing (1)

sherpa-onnx/csrc/file-utils.cc

sherpa-onnx/csrc/file-utils.cc

csukuangfj · 2026-03-03T09:49:18Z

sherpa-onnx/csrc/file-utils.cc

 #include "sherpa-onnx/csrc/file-utils.h"

 #include <fstream>
+#include <filesystem>


你好，我们现在避免使用 filesystem 这个头文件. 详见 #2998

好的谢谢

您好，看ai审阅还有超过4GB的大文件的读取问题，需要考虑循环分块读取吗？

coderabbitai

♻️ Duplicate comments (2)

sherpa-onnx/csrc/file-utils.cc (2)

8-8: ⚠️ Potential issue | 🟠 Major

<filesystem> reintroduction conflicts with prior maintainer direction

Line 8 adds <filesystem> again, but this PR thread already records maintainer guidance to avoid it (referencing PR #2998). This is a root-level conflict that should be resolved before merge.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` at line 8, Remove the reintroduced `#include`
<filesystem> in file-utils.cc (the added include on line 8) to comply with the
prior maintainer guidance; instead implement file operations using the project's
approved approach — e.g., use POSIX APIs (<sys/stat.h>) or std::ifstream for
existence/size checks or the repository's existing FileUtils helpers — and
update any functions in file-utils.cc that currently rely on std::filesystem to
call the replacement helpers so the file no longer depends on <filesystem>.

54-59: ⚠️ Potential issue | 🟠 Major

Validate tellg() before allocating the read buffer

Line 54 may return a failure sentinel (negative), and Line 57 uses it directly as vector size. That can trigger huge allocation attempts via signed→unsigned conversion before the exception path returns {}.

Suggested patch

-    std::streamsize size = file.tellg();
+    std::streamsize size = file.tellg();
+    if (size < 0) {
+      return {};
+    }
     file.seekg(0, std::ios::beg);

-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::vector<char> buffer(static_cast<std::size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }

#!/bin/bash
# Verify that ReadFile() has a guard for negative tellg() before vector allocation.
# Expected: a check like `if (size < 0) return {};` appears between tellg() and buffer construction.

rg -n -C3 'std::streamsize size = file\.tellg\(\)|std::vector<char> buffer\(' sherpa-onnx/csrc/file-utils.cc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 54 - 59, The code uses
std::streamsize size = file.tellg() then constructs std::vector<char>
buffer(size) without validating size; if tellg() failed it can be negative and
cause a huge unsigned allocation. In the ReadFile implementation in
file-utils.cc, add a guard immediately after the tellg() call to check if size <
0 (or file.fail()) and return {} on failure, then safely cast size to
std::size_t when creating std::vector<char> buffer; keep the existing seekg/read
logic (use buffer.data(), static_cast<std::size_t>(size)) so you don’t attempt
an allocation when tellg() fails.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Line 8: Remove the reintroduced `#include` <filesystem> in file-utils.cc (the
added include on line 8) to comply with the prior maintainer guidance; instead
implement file operations using the project's approved approach — e.g., use
POSIX APIs (<sys/stat.h>) or std::ifstream for existence/size checks or the
repository's existing FileUtils helpers — and update any functions in
file-utils.cc that currently rely on std::filesystem to call the replacement
helpers so the file no longer depends on <filesystem>.
- Around line 54-59: The code uses std::streamsize size = file.tellg() then
constructs std::vector<char> buffer(size) without validating size; if tellg()
failed it can be negative and cause a huge unsigned allocation. In the ReadFile
implementation in file-utils.cc, add a guard immediately after the tellg() call
to check if size < 0 (or file.fail()) and return {} on failure, then safely cast
size to std::size_t when creating std::vector<char> buffer; keep the existing
seekg/read logic (use buffer.data(), static_cast<std::size_t>(size)) so you
don’t attempt an allocation when tellg() fails.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 750f6a3 and 3e6f259.

📒 Files selected for processing (1)

sherpa-onnx/csrc/file-utils.cc

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

sherpa-onnx/csrc/file-utils.cc (1)

92-97: ⚠️ Potential issue | 🟠 Major

Guard tellg() failure before using it as vector size.

At Line 92, tellg() can return a failure sentinel; at Line 95 that value is used directly as std::vector<char> size. Please validate size before allocation and cast explicitly.

Proposed patch

-    std::streamsize size = file.tellg();
-    file.seekg(0, std::ios::beg);
-
-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::streamsize size = file.tellg();
+    if (size < 0 || !file.good()) {
+      return {};
+    }
+
+    file.seekg(0, std::ios::beg);
+    if (!file.good()) {
+      return {};
+    }
+
+    std::vector<char> buffer(static_cast<size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }

#!/bin/bash
# Verify tellg() result is validated before vector allocation in POSIX branch.
nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '88,102p'
# Expected: an explicit guard like `if (size < 0)` before constructing std::vector<char>.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 92 - 97, tellg() can return a
failure sentinel which is currently used directly as the std::vector size; guard
the result before allocation by checking the std::streamsize variable size for
failure (e.g., if (size < 0) return {};), then seek back with file.seekg(0,
std::ios::beg) and allocate the buffer using an explicit cast (e.g.,
std::vector<char> buffer(static_cast<std::size_t>(size));) and keep the existing
file.read(buffer.data(), size) call; refer to the tellg() call, the variable
size, the std::vector<char> buffer allocation, and file.read(...) in
file-utils.cc when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 73-81: The code currently allocates buffer based on
file_size.QuadPart and calls ReadFile with a DWORD length which will overflow
for files > MAXDWORD; add a guard after GetFileSizeEx to check if
file_size.QuadPart > MAXDWORD and return {} (or handle error) before allocating
the std::vector and before casting to DWORD, ensuring ReadFile is only called
with a safe static_cast<DWORD>(buffer.size()) and avoiding silent truncation;
reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD when making
this change.

---

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 92-97: tellg() can return a failure sentinel which is currently
used directly as the std::vector size; guard the result before allocation by
checking the std::streamsize variable size for failure (e.g., if (size < 0)
return {};), then seek back with file.seekg(0, std::ios::beg) and allocate the
buffer using an explicit cast (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size));) and keep the existing
file.read(buffer.data(), size) call; refer to the tellg() call, the variable
size, the std::vector<char> buffer allocation, and file.read(...) in
file-utils.cc when making the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4695a0e-bf49-4252-8f85-548cd579f44c

📥 Commits

Reviewing files that changed from the base of the PR and between 3e6f259 and 25c5a4e.

📒 Files selected for processing (1)

sherpa-onnx/csrc/file-utils.cc

coderabbitai · 2026-03-05T04:08:06Z

sherpa-onnx/csrc/file-utils.cc

+    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
+      return {};
+    }
+
+    std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart));
+
+    DWORD bytes_read = 0;
+    if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
+        bytes_read != buffer.size()) {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Examine the file around lines 73-81 nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 102

🏁 Script executed:

#!/bin/bash # Examine the file around lines 73-81 using cat cat -n sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'

Repository: k2-fsa/sherpa-onnx

Length of output: 939

🏁 Script executed:

#!/bin/bash # Look at the start of the file for includes and to understand context head -n 64 sherpa-onnx/csrc/file-utils.cc

Repository: k2-fsa/sherpa-onnx

Length of output: 1509

🏁 Script executed:

#!/bin/bash # Check the actual definitions of these constants in Windows python3 << 'EOF' # On 64-bit systems: SIZE_MAX_64 = (1 << 64) - 1 # 2^64 - 1 MAXDWORD = (1 << 32) - 1 # 2^32 - 1 = 4,294,967,295 print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)") print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)") print(f"\nIf a file is 5GB:") print(f" - Passes SIZE_MAX check: YES") print(f" - Allocated buffer size: 5GB") print(f" - ReadFile call truncates to: {MAXDWORD:,} bytes") print(f" - bytes_read: {MAXDWORD:,} bytes") print(f" - buffer.size(): 5,368,709,120 bytes") print(f" - bytes_read != buffer.size(): TRUE (causes return {})") print(f"\nResult: File silently fails to read, appearing as read failure") EOF

Repository: k2-fsa/sherpa-onnx

Length of output: 281

🏁 Script executed:

#!/bin/bash # Check the exact definitions of these constants in Windows python3 << 'EOF' # On 64-bit systems: SIZE_MAX_64 = (1 << 64) - 1 # 2^64 - 1 MAXDWORD = (1 << 32) - 1 # 2^32 - 1 = 4,294,967,295 print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)") print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)") print(f"\nIf a file is 5GB:") print(f" - Passes SIZE_MAX check: YES") print(f" - Allocated buffer size: 5GB") print(f" - ReadFile call truncates to: {MAXDWORD:,} bytes") print(f" - bytes_read: {MAXDWORD:,} bytes") print(f" - buffer.size(): 5,368,709,120 bytes") print(f" - Comparison bytes_read != buffer.size(): TRUE") print(f"\nResult: File silently fails to read, appearing as read failure") EOF

Repository: k2-fsa/sherpa-onnx

Length of output: 508

Guard against ReadFile DWORD byte-count limit for large files.

Files larger than 4 GB (MAXDWORD) will silently fail to read. At line 80, ReadFile can read at most ~4.3 GB per call, but the code allocates buffers up to SIZE_MAX (~16 exabytes on 64-bit systems) without checking this limit. For files between 4 GB and SIZE_MAX, the cast static_cast<DWORD>(buffer.size()) truncates silently, bytes_read will be less than buffer.size(), and the check at line 81 will reject the file.

Add a guard to reject files exceeding MAXDWORD before allocation:

Proposed fix

- if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) { + if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 || + file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) || + file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) { return {}; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 73 - 81, The code currently allocates buffer based on file_size.QuadPart and calls ReadFile with a DWORD length which will overflow for files > MAXDWORD; add a guard after GetFileSizeEx to check if file_size.QuadPart > MAXDWORD and return {} (or handle error) before allocating the std::vector and before casting to DWORD, ensuring ReadFile is only called with a safe static_cast<DWORD>(buffer.size()) and avoiding silent truncation; reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD when making this change.

…ix/file-utils-resolve-win-path

csukuangfj · 2026-03-05T08:33:17Z

sherpa-onnx/csrc/file-utils.cc

+    struct stat file_stat;
+    return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode);
+#endif
+  } catch (const std::exception&) {


Can you describe which function can throw in the code?

谢谢，确实触发概率太低，我将去掉

csukuangfj · 2026-03-05T08:35:18Z

sherpa-onnx/csrc/file-utils.cc

+    std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart));
+
+    DWORD bytes_read = 0;
+    if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||


The condition is toooooo complicated. Please use multilines of code to simplify it.

Copilot

Pull request overview

该 PR 旨在修复 Windows 平台下包含中文/非 ASCII 路径时模型文件解析/加载失败的问题（#3171），通过将部分文件与路径相关操作切换到宽字符 Windows API，并改进跨平台文件存在性检测与读取逻辑。

Changes:

Windows：FileExists/ReadFile/ResolveAbsolutePath 改用宽字符 Win32 API（GetFileAttributesW/CreateFileW/GetFullPathNameW）。
POSIX：FileExists 改用 stat()，ResolveAbsolutePath 使用 realpath()。
为上述路径/文件操作增加异常捕获并在失败时返回安全兜底值（如 {} 或原始 path）。

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-05T08:36:14Z

sherpa-onnx/csrc/file-utils.cc

 std::string ResolveAbsolutePath(const std::string &path) {
  if (path.empty()) {
    return path;
  }

+  try {
 #ifdef _WIN32
-  // Check if path is already absolute (drive letter or UNC path)
-  if ((path.size() > 1 && path[1] == ':') ||
-      (path.size() > 1 && path[0] == '\\' && path[1] == '\\')) {
-    return path;
-  }
-
-  char buffer[MAX_PATH];
-  if (GetFullPathNameA(path.c_str(), MAX_PATH, buffer, nullptr)) {
-    return std::string(buffer);
-  }
-
-  return path;  // fallback on failure
-
+    std::wstring wide_path = ToWideString(path);
+    DWORD required_size = GetFullPathNameW(wide_path.c_str(), 0, nullptr, nullptr);
+    if (required_size == 0) {
+      return path;
+    }
+
+    std::vector<wchar_t> buffer(required_size);
+    DWORD actual_size = GetFullPathNameW(
+        wide_path.c_str(),
+        required_size,
+        buffer.data(),
+        nullptr
+    );
+
+    if (actual_size == 0 || actual_size >= required_size) {
+      return path;
+    }
+
+    std::wstring resolved_wide(buffer.data(), actual_size);
+    return ToString(resolved_wide);
 #else
-  // POSIX: absolute paths start with '/'
-  if (path[0] == '/') {
+    char resolved_path[PATH_MAX];
+    if (realpath(path.c_str(), resolved_path) == nullptr) {
+      return path;
+    }
+    return std::string(resolved_path);
+#endif


The PR description mentions refactoring ResolveAbsolutePath to use std::filesystem and lexically_normal(), but the implementation still uses GetFullPathNameW/realpath and does not do lexical normalization. Either update the PR description to match the actual approach, or implement the described std::filesystem-based normalization.

Copilot · 2026-03-05T08:36:15Z

sherpa-onnx/csrc/file-utils.cc

+    struct stat file_stat;
+    return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode);
+#endif
+  } catch (const std::exception&) {
+    return false;


This file now catches std::exception, but it does not include <exception>. Add the direct standard header to avoid relying on transitive includes.

sherpa-onnx/csrc/file-utils.cc

Copilot · 2026-03-05T08:36:15Z

sherpa-onnx/csrc/file-utils.cc


 namespace sherpa_onnx {
+std::wstring ToWideString(const std::string &s);
+std::string ToString(const std::wstring &s);


ToWideString/ToString are forward-declared here but defined in text-utils.h/.cc. Prefer including the header instead of duplicating declarations so the dependency is explicit and signatures stay in sync.

Suggested change

namespace sherpa_onnx {

std::wstring ToWideString(const std::string &s);

std::string ToString(const std::wstring &s);

#include "sherpa-onnx/csrc/text-utils.h"

namespace sherpa_onnx {

Copilot · 2026-03-05T08:36:15Z

sherpa-onnx/csrc/file-utils.cc

+    DWORD bytes_read = 0;
+    if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
+        bytes_read != buffer.size()) {
+      return {};
+    }
+


Windows ReadFile casts buffer.size() to DWORD for the ReadFile() call. For files >4GiB this overflows / truncates the requested read size; the function will then fail even though SIZE_MAX is larger. Consider rejecting sizes > DWORD max explicitly or reading in a loop/chunks.

Suggested change

DWORD bytes_read = 0;

if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||

bytes_read != buffer.size()) {

return {};

}

size_t total_read = 0;

while (total_read < buffer.size()) {

size_t remaining = buffer.size() - total_read;

DWORD to_read = remaining > static_cast<size_t>(MAXDWORD)

? MAXDWORD

: static_cast<DWORD>(remaining);

DWORD bytes_read = 0;

if (!::ReadFile(hFile, buffer.data() + total_read, to_read, &bytes_read, nullptr)) {

return {};

}

if (bytes_read == 0) {

// Unexpected end of file

return {};

}

total_read += bytes_read;

}

Copilot · 2026-03-05T08:36:16Z

sherpa-onnx/csrc/file-utils.cc

+    std::streamsize size = file.tellg();
+    file.seekg(0, std::ios::beg);

-  std::vector<char> buffer(size);
-  if (!file.read(buffer.data(), size)) {
+    std::vector<char> buffer(size);
+    if (!file.read(buffer.data(), size)) {
+      return {};


In the non-Windows ReadFile path, tellg() can return -1 on failure; constructing std::vector<char> buffer(size) would then attempt a huge allocation due to signed-to-unsigned conversion. Check that size >= 0 (and that the stream is in a good state) before allocating.

coderabbitai

♻️ Duplicate comments (2)

sherpa-onnx/csrc/file-utils.cc (2)

94-99: ⚠️ Potential issue | 🟡 Minor

Validate tellg() result before using it as vector length.

At Line 94 and Line 97, tellg() failure (-1) is not checked before vector sizing. This can trigger a huge implicit unsigned conversion and exception path instead of a clean early return.

Proposed fix

-    std::streamsize size = file.tellg();
+    std::streamsize size = file.tellg();
+    if (size < 0) {
+      return {};
+    }
     file.seekg(0, std::ios::beg);
 
-    std::vector<char> buffer(size);
-    if (!file.read(buffer.data(), size)) {
+    std::vector<char> buffer(static_cast<size_t>(size));
+    if (size > 0 && !file.read(buffer.data(), size)) {
       return {};
     }

#!/bin/bash
# Verify tellg() guard and safe vector sizing in POSIX branch.
rg -n -C3 'tellg\(|size < 0|vector<char> buffer\(|file\.read\(' sherpa-onnx/csrc/file-utils.cc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 94 - 99, The code uses
std::streamsize size = file.tellg() and then constructs std::vector<char>
buffer(size) without validating tellg(); first check that size != -1 (or size >=
0) after calling tellg() and return {} early on failure, then cast size to a
safe size_t when sizing the vector (e.g., static_cast<size_t>(size)); ensure the
validation happens before calling file.seekg and before constructing buffer and
before file.read, and keep references to tellg(), size, buffer
(std::vector<char> buffer(...)), and file.read(...) to locate and update the
logic.

68-84: ⚠️ Potential issue | 🟠 Major

Guard file size against Win32 DWORD read limit before allocation.

At Line 69/Line 79, size is bounded by SIZE_MAX but ReadFile still accepts only DWORD. Files larger than MAXDWORD can be mis-sized/truncated at the call boundary after allocating a large buffer.

Proposed fix

-    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
+    if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 ||
+        file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) ||
+        file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) {
       return {};
     }

#!/bin/bash
# Verify whether the Windows branch guards MAXDWORD before ReadFile.
rg -n -C3 'GetFileSizeEx|MAXDWORD|ReadFile\(' sherpa-onnx/csrc/file-utils.cc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@sherpa-onnx/csrc/file-utils.cc` around lines 68 - 84, The code checks
file_size against SIZE_MAX but then passes buffer.size() to ReadFile which
accepts a DWORD; before allocating or calling ReadFile in the
GetFileSizeEx/ReadFile path, additionally guard that file_size.QuadPart <=
MAXDWORD (or handle large files by reading in chunks) so you never cast a size >
MAXDWORD into a DWORD; update the logic around file_size, buffer allocation
(vector<char> buffer(...)), and the ReadFile call (and the bytes_read
validation) to either return an error for files > MAXDWORD or implement chunked
reads for large files.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 94-99: The code uses std::streamsize size = file.tellg() and then
constructs std::vector<char> buffer(size) without validating tellg(); first
check that size != -1 (or size >= 0) after calling tellg() and return {} early
on failure, then cast size to a safe size_t when sizing the vector (e.g.,
static_cast<size_t>(size)); ensure the validation happens before calling
file.seekg and before constructing buffer and before file.read, and keep
references to tellg(), size, buffer (std::vector<char> buffer(...)), and
file.read(...) to locate and update the logic.
- Around line 68-84: The code checks file_size against SIZE_MAX but then passes
buffer.size() to ReadFile which accepts a DWORD; before allocating or calling
ReadFile in the GetFileSizeEx/ReadFile path, additionally guard that
file_size.QuadPart <= MAXDWORD (or handle large files by reading in chunks) so
you never cast a size > MAXDWORD into a DWORD; update the logic around
file_size, buffer allocation (vector<char> buffer(...)), and the ReadFile call
(and the bytes_read validation) to either return an error for files > MAXDWORD
or implement chunked reads for large files.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 66299578-a90c-464a-9882-f141d4ae609b

📥 Commits

Reviewing files that changed from the base of the PR and between 25c5a4e and 43db5d0.

📒 Files selected for processing (1)

sherpa-onnx/csrc/file-utils.cc

包含 <stdlib.h>支持POSIX realpath()的使用 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…XiaoBaiCanDo/sherpa-onnx into fix/file-utils-resolve-win-path

Fix:improve ResolveAbsolutePath function in file-utils.cc,还有判断文件存在函数也…

391b62a

…进行win平台宽字符适配

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

sherpa-onnx/csrc/file-utils.cc Show resolved Hide resolved

优化：FileExists和ReadFile都使用更新的filesystem API，补充：ResolveAbsolutePath的宽字符API

750f6a3

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

sherpa-onnx/csrc/file-utils.cc Outdated Show resolved Hide resolved

sherpa-onnx/csrc/file-utils.cc Show resolved Hide resolved

csukuangfj reviewed Mar 3, 2026

View reviewed changes

优化：扩大catch范围，合并前向声明

3e6f259

coderabbitai bot reviewed Mar 3, 2026

View reviewed changes

改用windowsAPI而非filesystem

25c5a4e

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Mar 5, 2026

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

Merge branch 'master' of https://github.com/k2-fsa/sherpa-onnx into f…

f27c320

…ix/file-utils-resolve-win-path

csukuangfj requested a review from Copilot March 5, 2026 08:32

Copilot started reviewing on behalf of csukuangfj March 5, 2026 08:32 View session

csukuangfj requested changes Mar 5, 2026

View reviewed changes

Copilot AI reviewed Mar 5, 2026

View reviewed changes

去掉不必要的try-catch，拆分过于复杂的判断条件

43db5d0

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

XiaoBaiCanDo and others added 3 commits March 17, 2026 17:14

Apply suggestion from @Copilot

01881d3

包含 <stdlib.h>支持POSIX realpath()的使用 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

file.tellg()可能返回-1，需要检查，显示转换size为无符号整数

989488c

Merge branch 'fix/file-utils-resolve-win-path' of https://github.com/…

853d5fa

…XiaoBaiCanDo/sherpa-onnx into fix/file-utils-resolve-win-path

	std::filesystem::path fs_path(path);
	std::filesystem::path fs_path = std::filesystem::u8path(path);

-    DWORD bytes_read = 0;
-    if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) ||
-        bytes_read != buffer.size()) {
-      return {};
-    }
+    size_t total_read = 0;
+    while (total_read < buffer.size()) {
+      size_t remaining = buffer.size() - total_read;
+      DWORD to_read = remaining > static_cast<size_t>(MAXDWORD)
+                          ? MAXDWORD
+                          : static_cast<DWORD>(remaining);
+      DWORD bytes_read = 0;
+      if (!::ReadFile(hFile, buffer.data() + total_read, to_read, &bytes_read, nullptr)) {
+        return {};
+      }
+      if (bytes_read == 0) {
+        // Unexpected end of file
+        return {};
+      }
+      total_read += bytes_read;
+    }

Conversation

XiaoBaiCanDo commented Mar 3, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

修复 #3171

修改内容：

测试：

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XiaoBaiCanDo commented Mar 3, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading