[bugfix] fix direct io bad address#174
Conversation
Summary of ChangesHello @chloroethylene, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a bug related to "direct io bad address" by refining the host pinned memory allocation mechanisms. It ensures that memory intended for direct I/O operations is correctly pinned and initialized, particularly for compatibility with Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix a "bad address" error with direct I/O by changing the memory allocation strategy for pinned memory from aclrtMallocHost to mmap and mlock in alloc_pinned_ptr and alloc_pinned_numa_ptr. However, a critical memory management vulnerability has been introduced: the deallocation function free_pinned_ptr is not updated to match the new mmap allocation, still using aclrtFreeHost. This mismatch will cause severe runtime problems, including crashes or undefined behavior during memory deallocation. The alloc_pinned_numa_ptr correctly uses munmap for deallocation, highlighting this inconsistency.
|
maybe |
csrc/common/mem_alloc.cpp
Outdated
| strerror(errno)); | ||
| } | ||
|
|
||
| memset(ptr, 0, size); |
There was a problem hiding this comment.
I think you should memset before mlock.
| } | ||
|
|
||
| // In kernels 5.10 and earlier, the aclrtHostRegister requires pinned memory | ||
| if (mlock(ptr, size) != 0) { |
csrc/common/mem_alloc.cpp
Outdated
| aclError err = aclrtMallocHost(&ptr, size); | ||
| if (err != ACL_SUCCESS) { | ||
| throw std::runtime_error("aclrtMallocHost failed: " + std::to_string(err)); | ||
| void *ptr = mmap(nullptr, size, PROT_READ | PROT_WRITE, |
There was a problem hiding this comment.
yeah, after release, the size also should be passed in.
There was a problem hiding this comment.
it is possible to restructure the mem_alloc to be more inline with managed_mem to have an internal map, therefore we can avoid changing the pybind api which would break compatibility.
For example when we have alloc_xxx_ptr(xx)
we have a pinned_manager that stored the ptr, and sizes and its configuration. once free is being invoked, we obtain the sizes and anything relevant if we need to. @chloroethylene
e40bb93 to
dea3e31
Compare
0b19bb9 to
8ab4215
Compare
|
|
||
| // Lock the memory to ensure it's pinned | ||
| if (mlock(ptr, size) != 0) { | ||
| std::cerr << "[allocMem] mlock failed: " << strerror(errno) |
There was a problem hiding this comment.
fall through while failed. because unit test always get this error, maybe due to CAP_IPC_LOCK or other reason. we get ENOMEM when exec mlock:
RuntimeError: [allocMem] mlock failed: Cannot allocate memory
226d6bf to
93adc72
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request aims to address a 'direct IO bad address' bug by replacing aclrtMallocHost with mmap for memory allocation, which is a sound approach for obtaining memory suitable for direct I/O. However, the new memory management scheme introduces security and stability concerns. Specifically, the allocMem function fails to treat mlock failure as a fatal error, which could lead to unsafe DMA operations on unpinned memory, potentially causing data corruption or system instability. Additionally, there's a critical memory leak vulnerability if a newly allocated pointer is already tracked, and the error handling in free_pinned_numa_ptr could lead to double munmap calls. The related refactoring to introduce a separate allocatedMap and its own mutex in HostRegisteredMemoryManager improves code clarity and reduces potential lock contention.
| auto unRegErr = unregister_ptr(ptr); | ||
|
|
||
| // Unmap the memory | ||
| auto unMapErr = munmap(ptr, size); | ||
|
|
||
| if (unRegErr) { | ||
| throw std::runtime_error("unregister_ptr failed: " + | ||
| std::to_string(unRegErr)); |
There was a problem hiding this comment.
In free_pinned_numa_ptr, munmap is called to unmap memory after attempting to unregister the pointer. If unregister_ptr fails, the function throws a std::runtime_error after munmap has already been executed. If the caller of free_pinned_numa_ptr attempts to retry the operation upon failure, it will result in a double munmap on the same pointer. Double munmap is a security vulnerability that can lead to memory corruption by unmapping memory that might have been re-allocated to another part of the process. The logic should ensure that the exception is thrown in a way that doesn't lead to unsafe retries, or that the memory is only unmapped if unregistration succeeds (if that's the intended semantics).
…ed memory - Split single mux into regMux (for registeredMap) and allocMux (for allocatedMap) - Move allocMem/freeMem logic into class methods with consolidated mmap/mlock/memset
93adc72 to
781da7e
Compare
|
Also it might be possible to not use mlock, as halhostregister automatically pinned. But its worth checking @chloroethylene |
[bugfix] fix direct io bad address
Fixes #171
Changes:
aclrtMallocHosttommapfor allocating memory