Fix #3171 : 去掉生成绝对路径函数中的窄字符WindowsAPI调用,改用宽字符API和C++17 std::filesystem来避免win平台中文路径问题#3255
Fix #3171 : 去掉生成绝对路径函数中的窄字符WindowsAPI调用,改用宽字符API和C++17 std::filesystem来避免win平台中文路径问题#3255XiaoBaiCanDo wants to merge 9 commits intok2-fsa:masterfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces ad-hoc cross-platform file handling in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在解决 Windows 平台上因窄字符 API 导致的中文路径解析问题。通过引入 C++17 的 std::filesystem 库和宽字符处理,它重构了文件路径解析和文件存在性检查的逻辑,从而显著提升了跨平台兼容性和对包含多字节字符路径的支持。 Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
sherpa-onnx/csrc/file-utils.cc
Outdated
| // POSIX: absolute paths start with '/' | ||
| if (path[0] == '/') { | ||
| try { | ||
| std::filesystem::path fs_path(path); |
There was a problem hiding this comment.
sherpa-onnx/csrc/file-utils.cc
Outdated
| } // namespace sherpa_onnx | ||
|
|
||
| namespace sherpa_onnx { |
There was a problem hiding this comment.
u8path在C++20中不是被弃用了吗?长期看是不是不该在这里用u8path
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
sherpa-onnx/csrc/file-utils.cc (1)
24-26: Usestd::filesystem::pathinstead ofstd::wstringfor cross-platform portabilityLines 24-26 and 41-42 pass
std::wstringdirectly tostd::ifstream. While this works on Windows, it is not a standard C++17 portable constructor.std::ifstreamprovides a standard portable overload forstd::filesystem::paththat handles encoding correctly on both Windows and POSIX systems.♻️ Proposed refactor
- std::wstring wide_path = ToWideString(filename); - std::ifstream file(wide_path); + std::ifstream file(std::filesystem::path(filename)); ... - std::wstring wide_path = ToWideString(filename); - std::ifstream file(wide_path, std::ios::binary | std::ios::ate); + std::ifstream file(std::filesystem::path(filename), + std::ios::binary | std::ios::ate);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 24 - 26, Replace the ad-hoc wide string conversion and direct wstring use when opening files with std::filesystem::path: instead of calling ToWideString(filename) to create wide_path and passing that to std::ifstream, construct a std::filesystem::path p(filename) and open std::ifstream file(p); do the same for the second occurrence (lines that create wide_path2 / open file there). Add `#include` <filesystem> and ensure the code uses std::filesystem::path so the std::ifstream overload handles platform encoding correctly; keep the original filename variable and remove the need for ToWideString when opening streams.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 131-137: ResolveAbsolutePath constructs std::filesystem::path from
std::string which uses the ANSI code page on Windows and breaks non-ASCII paths;
change both constructions of fs_path (the one at the initial creation and the
later creation before returning) to use std::filesystem::u8path(path) so the
path is interpreted as UTF-8 on Windows (keep using
fs_path.lexically_normal().u8string() for the return value).
---
Nitpick comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 24-26: Replace the ad-hoc wide string conversion and direct
wstring use when opening files with std::filesystem::path: instead of calling
ToWideString(filename) to create wide_path and passing that to std::ifstream,
construct a std::filesystem::path p(filename) and open std::ifstream file(p); do
the same for the second occurrence (lines that create wide_path2 / open file
there). Add `#include` <filesystem> and ensure the code uses std::filesystem::path
so the std::ifstream overload handles platform encoding correctly; keep the
original filename variable and remove the need for ToWideString when opening
streams.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 23-34: The current try/catch blocks that wrap the path conversion
and filesystem checks only catch std::filesystem::filesystem_error but must also
catch allocation and conversion failures from ToWideString()/Utf8ToUtf32()
(e.g., std::bad_alloc and other std::exception types); update the catch clauses
in file-utils.cc (the blocks that call ToWideString and construct
std::filesystem::path and the similar blocks around the filesystem checks) to
catch std::exception (or add an additional catch(std::exception&) alongside the
existing one) so conversion and memory errors are handled gracefully; ensure the
handlers return false like the existing filesystem_error handler and apply the
same change to the other two similar blocks referenced in the review.
- Around line 57-62: tellg() can return -1 on failure; validate the returned
std::streamsize before using it to size the buffer. After calling file.tellg()
(the variable size) check that size is >= 0 (and that file.good()) and return {}
on failure; only then construct the buffer (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size))). This prevents negative tellg() values
from being interpreted as huge unsigned sizes and avoids
std::bad_alloc/std::length_error when reading in the subsequent file.read call.
sherpa-onnx/csrc/file-utils.cc
Outdated
| #include "sherpa-onnx/csrc/file-utils.h" | ||
|
|
||
| #include <fstream> | ||
| #include <filesystem> |
There was a problem hiding this comment.
您好,看ai审阅还有超过4GB的大文件的读取问题,需要考虑循环分块读取吗?
There was a problem hiding this comment.
♻️ Duplicate comments (2)
sherpa-onnx/csrc/file-utils.cc (2)
8-8:⚠️ Potential issue | 🟠 Major
<filesystem>reintroduction conflicts with prior maintainer directionLine 8 adds
<filesystem>again, but this PR thread already records maintainer guidance to avoid it (referencing PR#2998). This is a root-level conflict that should be resolved before merge.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` at line 8, Remove the reintroduced `#include` <filesystem> in file-utils.cc (the added include on line 8) to comply with the prior maintainer guidance; instead implement file operations using the project's approved approach — e.g., use POSIX APIs (<sys/stat.h>) or std::ifstream for existence/size checks or the repository's existing FileUtils helpers — and update any functions in file-utils.cc that currently rely on std::filesystem to call the replacement helpers so the file no longer depends on <filesystem>.
54-59:⚠️ Potential issue | 🟠 MajorValidate
tellg()before allocating the read bufferLine 54 may return a failure sentinel (negative), and Line 57 uses it directly as vector size. That can trigger huge allocation attempts via signed→unsigned conversion before the exception path returns
{}.Suggested patch
- std::streamsize size = file.tellg(); + std::streamsize size = file.tellg(); + if (size < 0) { + return {}; + } file.seekg(0, std::ios::beg); - std::vector<char> buffer(size); - if (!file.read(buffer.data(), size)) { + std::vector<char> buffer(static_cast<std::size_t>(size)); + if (size > 0 && !file.read(buffer.data(), size)) { return {}; }#!/bin/bash # Verify that ReadFile() has a guard for negative tellg() before vector allocation. # Expected: a check like `if (size < 0) return {};` appears between tellg() and buffer construction. rg -n -C3 'std::streamsize size = file\.tellg\(\)|std::vector<char> buffer\(' sherpa-onnx/csrc/file-utils.cc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 54 - 59, The code uses std::streamsize size = file.tellg() then constructs std::vector<char> buffer(size) without validating size; if tellg() failed it can be negative and cause a huge unsigned allocation. In the ReadFile implementation in file-utils.cc, add a guard immediately after the tellg() call to check if size < 0 (or file.fail()) and return {} on failure, then safely cast size to std::size_t when creating std::vector<char> buffer; keep the existing seekg/read logic (use buffer.data(), static_cast<std::size_t>(size)) so you don’t attempt an allocation when tellg() fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Line 8: Remove the reintroduced `#include` <filesystem> in file-utils.cc (the
added include on line 8) to comply with the prior maintainer guidance; instead
implement file operations using the project's approved approach — e.g., use
POSIX APIs (<sys/stat.h>) or std::ifstream for existence/size checks or the
repository's existing FileUtils helpers — and update any functions in
file-utils.cc that currently rely on std::filesystem to call the replacement
helpers so the file no longer depends on <filesystem>.
- Around line 54-59: The code uses std::streamsize size = file.tellg() then
constructs std::vector<char> buffer(size) without validating size; if tellg()
failed it can be negative and cause a huge unsigned allocation. In the ReadFile
implementation in file-utils.cc, add a guard immediately after the tellg() call
to check if size < 0 (or file.fail()) and return {} on failure, then safely cast
size to std::size_t when creating std::vector<char> buffer; keep the existing
seekg/read logic (use buffer.data(), static_cast<std::size_t>(size)) so you
don’t attempt an allocation when tellg() fails.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
sherpa-onnx/csrc/file-utils.cc (1)
92-97:⚠️ Potential issue | 🟠 MajorGuard
tellg()failure before using it as vector size.At Line 92,
tellg()can return a failure sentinel; at Line 95 that value is used directly asstd::vector<char>size. Please validatesizebefore allocation and cast explicitly.Proposed patch
- std::streamsize size = file.tellg(); - file.seekg(0, std::ios::beg); - - std::vector<char> buffer(size); - if (!file.read(buffer.data(), size)) { + std::streamsize size = file.tellg(); + if (size < 0 || !file.good()) { + return {}; + } + + file.seekg(0, std::ios::beg); + if (!file.good()) { + return {}; + } + + std::vector<char> buffer(static_cast<size_t>(size)); + if (size > 0 && !file.read(buffer.data(), size)) { return {}; }#!/bin/bash # Verify tellg() result is validated before vector allocation in POSIX branch. nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '88,102p' # Expected: an explicit guard like `if (size < 0)` before constructing std::vector<char>.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 92 - 97, tellg() can return a failure sentinel which is currently used directly as the std::vector size; guard the result before allocation by checking the std::streamsize variable size for failure (e.g., if (size < 0) return {};), then seek back with file.seekg(0, std::ios::beg) and allocate the buffer using an explicit cast (e.g., std::vector<char> buffer(static_cast<std::size_t>(size));) and keep the existing file.read(buffer.data(), size) call; refer to the tellg() call, the variable size, the std::vector<char> buffer allocation, and file.read(...) in file-utils.cc when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 73-81: The code currently allocates buffer based on
file_size.QuadPart and calls ReadFile with a DWORD length which will overflow
for files > MAXDWORD; add a guard after GetFileSizeEx to check if
file_size.QuadPart > MAXDWORD and return {} (or handle error) before allocating
the std::vector and before casting to DWORD, ensuring ReadFile is only called
with a safe static_cast<DWORD>(buffer.size()) and avoiding silent truncation;
reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD when making
this change.
---
Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 92-97: tellg() can return a failure sentinel which is currently
used directly as the std::vector size; guard the result before allocation by
checking the std::streamsize variable size for failure (e.g., if (size < 0)
return {};), then seek back with file.seekg(0, std::ios::beg) and allocate the
buffer using an explicit cast (e.g., std::vector<char>
buffer(static_cast<std::size_t>(size));) and keep the existing
file.read(buffer.data(), size) call; refer to the tellg() call, the variable
size, the std::vector<char> buffer allocation, and file.read(...) in
file-utils.cc when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b4695a0e-bf49-4252-8f85-548cd579f44c
📒 Files selected for processing (1)
sherpa-onnx/csrc/file-utils.cc
sherpa-onnx/csrc/file-utils.cc
Outdated
| if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) { | ||
| return {}; | ||
| } | ||
|
|
||
| std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart)); | ||
|
|
||
| DWORD bytes_read = 0; | ||
| if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) || | ||
| bytes_read != buffer.size()) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Examine the file around lines 73-81
nl -ba sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'Repository: k2-fsa/sherpa-onnx
Length of output: 102
🏁 Script executed:
#!/bin/bash
# Examine the file around lines 73-81 using cat
cat -n sherpa-onnx/csrc/file-utils.cc | sed -n '65,90p'Repository: k2-fsa/sherpa-onnx
Length of output: 939
🏁 Script executed:
#!/bin/bash
# Look at the start of the file for includes and to understand context
head -n 64 sherpa-onnx/csrc/file-utils.ccRepository: k2-fsa/sherpa-onnx
Length of output: 1509
🏁 Script executed:
#!/bin/bash
# Check the actual definitions of these constants in Windows
python3 << 'EOF'
# On 64-bit systems:
SIZE_MAX_64 = (1 << 64) - 1 # 2^64 - 1
MAXDWORD = (1 << 32) - 1 # 2^32 - 1 = 4,294,967,295
print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)")
print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)")
print(f"\nIf a file is 5GB:")
print(f" - Passes SIZE_MAX check: YES")
print(f" - Allocated buffer size: 5GB")
print(f" - ReadFile call truncates to: {MAXDWORD:,} bytes")
print(f" - bytes_read: {MAXDWORD:,} bytes")
print(f" - buffer.size(): 5,368,709,120 bytes")
print(f" - bytes_read != buffer.size(): TRUE (causes return {})")
print(f"\nResult: File silently fails to read, appearing as read failure")
EOFRepository: k2-fsa/sherpa-onnx
Length of output: 281
🏁 Script executed:
#!/bin/bash
# Check the exact definitions of these constants in Windows
python3 << 'EOF'
# On 64-bit systems:
SIZE_MAX_64 = (1 << 64) - 1 # 2^64 - 1
MAXDWORD = (1 << 32) - 1 # 2^32 - 1 = 4,294,967,295
print(f"SIZE_MAX on 64-bit: {SIZE_MAX_64:,} bytes ({SIZE_MAX_64 / (1024**4):.1f} TB)")
print(f"MAXDWORD (ReadFile limit): {MAXDWORD:,} bytes ({MAXDWORD / (1024**3):.2f} GB)")
print(f"\nIf a file is 5GB:")
print(f" - Passes SIZE_MAX check: YES")
print(f" - Allocated buffer size: 5GB")
print(f" - ReadFile call truncates to: {MAXDWORD:,} bytes")
print(f" - bytes_read: {MAXDWORD:,} bytes")
print(f" - buffer.size(): 5,368,709,120 bytes")
print(f" - Comparison bytes_read != buffer.size(): TRUE")
print(f"\nResult: File silently fails to read, appearing as read failure")
EOFRepository: k2-fsa/sherpa-onnx
Length of output: 508
Guard against ReadFile DWORD byte-count limit for large files.
Files larger than 4 GB (MAXDWORD) will silently fail to read. At line 80, ReadFile can read at most ~4.3 GB per call, but the code allocates buffers up to SIZE_MAX (~16 exabytes on 64-bit systems) without checking this limit. For files between 4 GB and SIZE_MAX, the cast static_cast<DWORD>(buffer.size()) truncates silently, bytes_read will be less than buffer.size(), and the check at line 81 will reject the file.
Add a guard to reject files exceeding MAXDWORD before allocation:
Proposed fix
- if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) {
+ if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 ||
+ file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) ||
+ file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) {
return {};
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sherpa-onnx/csrc/file-utils.cc` around lines 73 - 81, The code currently
allocates buffer based on file_size.QuadPart and calls ReadFile with a DWORD
length which will overflow for files > MAXDWORD; add a guard after GetFileSizeEx
to check if file_size.QuadPart > MAXDWORD and return {} (or handle error) before
allocating the std::vector and before casting to DWORD, ensuring ReadFile is
only called with a safe static_cast<DWORD>(buffer.size()) and avoiding silent
truncation; reference GetFileSizeEx, file_size, buffer, ReadFile and MAXDWORD
when making this change.
…ix/file-utils-resolve-win-path
sherpa-onnx/csrc/file-utils.cc
Outdated
| struct stat file_stat; | ||
| return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode); | ||
| #endif | ||
| } catch (const std::exception&) { |
There was a problem hiding this comment.
Can you describe which function can throw in the code?
sherpa-onnx/csrc/file-utils.cc
Outdated
| std::vector<char> buffer(static_cast<size_t>(file_size.QuadPart)); | ||
|
|
||
| DWORD bytes_read = 0; | ||
| if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) || |
There was a problem hiding this comment.
The condition is toooooo complicated. Please use multilines of code to simplify it.
There was a problem hiding this comment.
Pull request overview
该 PR 旨在修复 Windows 平台下包含中文/非 ASCII 路径时模型文件解析/加载失败的问题(#3171),通过将部分文件与路径相关操作切换到宽字符 Windows API,并改进跨平台文件存在性检测与读取逻辑。
Changes:
- Windows:
FileExists/ReadFile/ResolveAbsolutePath改用宽字符 Win32 API(GetFileAttributesW/CreateFileW/GetFullPathNameW)。 - POSIX:
FileExists改用stat(),ResolveAbsolutePath使用realpath()。 - 为上述路径/文件操作增加异常捕获并在失败时返回安全兜底值(如
{}或原始path)。
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string ResolveAbsolutePath(const std::string &path) { | ||
| if (path.empty()) { | ||
| return path; | ||
| } | ||
|
|
||
| try { | ||
| #ifdef _WIN32 | ||
| // Check if path is already absolute (drive letter or UNC path) | ||
| if ((path.size() > 1 && path[1] == ':') || | ||
| (path.size() > 1 && path[0] == '\\' && path[1] == '\\')) { | ||
| return path; | ||
| } | ||
|
|
||
| char buffer[MAX_PATH]; | ||
| if (GetFullPathNameA(path.c_str(), MAX_PATH, buffer, nullptr)) { | ||
| return std::string(buffer); | ||
| } | ||
|
|
||
| return path; // fallback on failure | ||
|
|
||
| std::wstring wide_path = ToWideString(path); | ||
| DWORD required_size = GetFullPathNameW(wide_path.c_str(), 0, nullptr, nullptr); | ||
| if (required_size == 0) { | ||
| return path; | ||
| } | ||
|
|
||
| std::vector<wchar_t> buffer(required_size); | ||
| DWORD actual_size = GetFullPathNameW( | ||
| wide_path.c_str(), | ||
| required_size, | ||
| buffer.data(), | ||
| nullptr | ||
| ); | ||
|
|
||
| if (actual_size == 0 || actual_size >= required_size) { | ||
| return path; | ||
| } | ||
|
|
||
| std::wstring resolved_wide(buffer.data(), actual_size); | ||
| return ToString(resolved_wide); | ||
| #else | ||
| // POSIX: absolute paths start with '/' | ||
| if (path[0] == '/') { | ||
| char resolved_path[PATH_MAX]; | ||
| if (realpath(path.c_str(), resolved_path) == nullptr) { | ||
| return path; | ||
| } | ||
| return std::string(resolved_path); | ||
| #endif |
There was a problem hiding this comment.
The PR description mentions refactoring ResolveAbsolutePath to use std::filesystem and lexically_normal(), but the implementation still uses GetFullPathNameW/realpath and does not do lexical normalization. Either update the PR description to match the actual approach, or implement the described std::filesystem-based normalization.
sherpa-onnx/csrc/file-utils.cc
Outdated
| struct stat file_stat; | ||
| return stat(filename.c_str(), &file_stat) == 0 && S_ISREG(file_stat.st_mode); | ||
| #endif | ||
| } catch (const std::exception&) { | ||
| return false; |
There was a problem hiding this comment.
This file now catches std::exception, but it does not include <exception>. Add the direct standard header to avoid relying on transitive includes.
|
|
||
| namespace sherpa_onnx { | ||
| std::wstring ToWideString(const std::string &s); | ||
| std::string ToString(const std::wstring &s); |
There was a problem hiding this comment.
ToWideString/ToString are forward-declared here but defined in text-utils.h/.cc. Prefer including the header instead of duplicating declarations so the dependency is explicit and signatures stay in sync.
| namespace sherpa_onnx { | |
| std::wstring ToWideString(const std::string &s); | |
| std::string ToString(const std::wstring &s); | |
| #include "sherpa-onnx/csrc/text-utils.h" | |
| namespace sherpa_onnx { |
| DWORD bytes_read = 0; | ||
| if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) || | ||
| bytes_read != buffer.size()) { | ||
| return {}; | ||
| } | ||
|
|
There was a problem hiding this comment.
Windows ReadFile casts buffer.size() to DWORD for the ReadFile() call. For files >4GiB this overflows / truncates the requested read size; the function will then fail even though SIZE_MAX is larger. Consider rejecting sizes > DWORD max explicitly or reading in a loop/chunks.
| DWORD bytes_read = 0; | |
| if (!::ReadFile(hFile, buffer.data(), static_cast<DWORD>(buffer.size()), &bytes_read, nullptr) || | |
| bytes_read != buffer.size()) { | |
| return {}; | |
| } | |
| size_t total_read = 0; | |
| while (total_read < buffer.size()) { | |
| size_t remaining = buffer.size() - total_read; | |
| DWORD to_read = remaining > static_cast<size_t>(MAXDWORD) | |
| ? MAXDWORD | |
| : static_cast<DWORD>(remaining); | |
| DWORD bytes_read = 0; | |
| if (!::ReadFile(hFile, buffer.data() + total_read, to_read, &bytes_read, nullptr)) { | |
| return {}; | |
| } | |
| if (bytes_read == 0) { | |
| // Unexpected end of file | |
| return {}; | |
| } | |
| total_read += bytes_read; | |
| } |
| std::streamsize size = file.tellg(); | ||
| file.seekg(0, std::ios::beg); | ||
|
|
||
| std::vector<char> buffer(size); | ||
| if (!file.read(buffer.data(), size)) { | ||
| std::vector<char> buffer(size); | ||
| if (!file.read(buffer.data(), size)) { | ||
| return {}; |
There was a problem hiding this comment.
In the non-Windows ReadFile path, tellg() can return -1 on failure; constructing std::vector<char> buffer(size) would then attempt a huge allocation due to signed-to-unsigned conversion. Check that size >= 0 (and that the stream is in a good state) before allocating.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
sherpa-onnx/csrc/file-utils.cc (2)
94-99:⚠️ Potential issue | 🟡 MinorValidate
tellg()result before using it as vector length.At Line 94 and Line 97,
tellg()failure (-1) is not checked before vector sizing. This can trigger a huge implicit unsigned conversion and exception path instead of a clean early return.Proposed fix
- std::streamsize size = file.tellg(); + std::streamsize size = file.tellg(); + if (size < 0) { + return {}; + } file.seekg(0, std::ios::beg); - std::vector<char> buffer(size); - if (!file.read(buffer.data(), size)) { + std::vector<char> buffer(static_cast<size_t>(size)); + if (size > 0 && !file.read(buffer.data(), size)) { return {}; }#!/bin/bash # Verify tellg() guard and safe vector sizing in POSIX branch. rg -n -C3 'tellg\(|size < 0|vector<char> buffer\(|file\.read\(' sherpa-onnx/csrc/file-utils.cc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 94 - 99, The code uses std::streamsize size = file.tellg() and then constructs std::vector<char> buffer(size) without validating tellg(); first check that size != -1 (or size >= 0) after calling tellg() and return {} early on failure, then cast size to a safe size_t when sizing the vector (e.g., static_cast<size_t>(size)); ensure the validation happens before calling file.seekg and before constructing buffer and before file.read, and keep references to tellg(), size, buffer (std::vector<char> buffer(...)), and file.read(...) to locate and update the logic.
68-84:⚠️ Potential issue | 🟠 MajorGuard file size against Win32
DWORDread limit before allocation.At Line 69/Line 79, size is bounded by
SIZE_MAXbutReadFilestill accepts onlyDWORD. Files larger thanMAXDWORDcan be mis-sized/truncated at the call boundary after allocating a large buffer.Proposed fix
- if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart > SIZE_MAX) { + if (!GetFileSizeEx(hFile, &file_size) || file_size.QuadPart < 0 || + file_size.QuadPart > static_cast<LONGLONG>(SIZE_MAX) || + file_size.QuadPart > static_cast<LONGLONG>(MAXDWORD)) { return {}; }#!/bin/bash # Verify whether the Windows branch guards MAXDWORD before ReadFile. rg -n -C3 'GetFileSizeEx|MAXDWORD|ReadFile\(' sherpa-onnx/csrc/file-utils.cc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sherpa-onnx/csrc/file-utils.cc` around lines 68 - 84, The code checks file_size against SIZE_MAX but then passes buffer.size() to ReadFile which accepts a DWORD; before allocating or calling ReadFile in the GetFileSizeEx/ReadFile path, additionally guard that file_size.QuadPart <= MAXDWORD (or handle large files by reading in chunks) so you never cast a size > MAXDWORD into a DWORD; update the logic around file_size, buffer allocation (vector<char> buffer(...)), and the ReadFile call (and the bytes_read validation) to either return an error for files > MAXDWORD or implement chunked reads for large files.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@sherpa-onnx/csrc/file-utils.cc`:
- Around line 94-99: The code uses std::streamsize size = file.tellg() and then
constructs std::vector<char> buffer(size) without validating tellg(); first
check that size != -1 (or size >= 0) after calling tellg() and return {} early
on failure, then cast size to a safe size_t when sizing the vector (e.g.,
static_cast<size_t>(size)); ensure the validation happens before calling
file.seekg and before constructing buffer and before file.read, and keep
references to tellg(), size, buffer (std::vector<char> buffer(...)), and
file.read(...) to locate and update the logic.
- Around line 68-84: The code checks file_size against SIZE_MAX but then passes
buffer.size() to ReadFile which accepts a DWORD; before allocating or calling
ReadFile in the GetFileSizeEx/ReadFile path, additionally guard that
file_size.QuadPart <= MAXDWORD (or handle large files by reading in chunks) so
you never cast a size > MAXDWORD into a DWORD; update the logic around
file_size, buffer allocation (vector<char> buffer(...)), and the ReadFile call
(and the bytes_read validation) to either return an error for files > MAXDWORD
or implement chunked reads for large files.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 66299578-a90c-464a-9882-f141d4ae609b
📒 Files selected for processing (1)
sherpa-onnx/csrc/file-utils.cc
包含 <stdlib.h>支持POSIX realpath()的使用 Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…XiaoBaiCanDo/sherpa-onnx into fix/file-utils-resolve-win-path
修复 #3171
此 PR 解决了 issue #3171 中提到的win平台中文路径解析问题。
修改内容:
ResolveAbsolutePath函数,使用std::filesystemAPIlexically_normal()规范化路径格式测试:
Fixes #3171
Summary by CodeRabbit