Skip to content

[bugfix] fix direct io bad address#174

Open
chloroethylene wants to merge 1 commit intoLMCache:mainfrom
chloroethylene:bugfix/fix_dio_bad_address_bug
Open

[bugfix] fix direct io bad address#174
chloroethylene wants to merge 1 commit intoLMCache:mainfrom
chloroethylene:bugfix/fix_dio_bad_address_bug

Conversation

@chloroethylene
Copy link
Collaborator

@chloroethylene chloroethylene commented Feb 13, 2026

[bugfix] fix direct io bad address

Fixes #171

Changes:

  1. change aclrtMallocHost to mmap for allocating memory

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @chloroethylene, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug related to "direct io bad address" by refining the host pinned memory allocation mechanisms. It ensures that memory intended for direct I/O operations is correctly pinned and initialized, particularly for compatibility with aclrtHostRegister on specific kernel versions, thereby preventing potential memory access issues.

Highlights

  • Pinned Memory Allocation Refactor: The alloc_pinned_ptr function was refactored to use mmap and mlock for allocating pinned host memory, replacing the aclrtMallocHost call to ensure proper memory pinning.
  • NUMA Pinned Memory Enhancement: The alloc_pinned_numa_ptr function was updated to explicitly call mlock after mbind to ensure memory is properly pinned, addressing compatibility requirements for aclrtHostRegister on specific kernel versions.
  • Memory Initialization: Both alloc_pinned_ptr and alloc_pinned_numa_ptr now initialize the allocated memory to zero using memset after successful allocation.
  • Improved Error Reporting: Error messages for mmap, mlock, and mbind failures were enhanced to include the specific function name where the error occurred, aiding in debugging.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a "bad address" error with direct I/O by changing the memory allocation strategy for pinned memory from aclrtMallocHost to mmap and mlock in alloc_pinned_ptr and alloc_pinned_numa_ptr. However, a critical memory management vulnerability has been introduced: the deallocation function free_pinned_ptr is not updated to match the new mmap allocation, still using aclrtFreeHost. This mismatch will cause severe runtime problems, including crashes or undefined behavior during memory deallocation. The alloc_pinned_numa_ptr correctly uses munmap for deallocation, highlighting this inconsistency.

@chloroethylene chloroethylene marked this pull request as draft February 13, 2026 07:15
@chloroethylene
Copy link
Collaborator Author

maybe free_pinned_ptr need to pass size para

strerror(errno));
}

memset(ptr, 0, size);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should memset before mlock.

}

// In kernels 5.10 and earlier, the aclrtHostRegister requires pinned memory
if (mlock(ptr, size) != 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should after memset.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @chloroethylene

aclError err = aclrtMallocHost(&ptr, size);
if (err != ACL_SUCCESS) {
throw std::runtime_error("aclrtMallocHost failed: " + std::to_string(err));
void *ptr = mmap(nullptr, size, PROT_READ | PROT_WRITE,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, after release, the size also should be passed in.

Copy link
Collaborator

@matthewygf matthewygf Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible to restructure the mem_alloc to be more inline with managed_mem to have an internal map, therefore we can avoid changing the pybind api which would break compatibility.

For example when we have alloc_xxx_ptr(xx)
we have a pinned_manager that stored the ptr, and sizes and its configuration. once free is being invoked, we obtain the sizes and anything relevant if we need to. @chloroethylene

@chloroethylene chloroethylene force-pushed the bugfix/fix_dio_bad_address_bug branch from e40bb93 to dea3e31 Compare February 15, 2026 16:17
@chloroethylene chloroethylene marked this pull request as ready for review February 16, 2026 13:24
@chloroethylene chloroethylene force-pushed the bugfix/fix_dio_bad_address_bug branch 2 times, most recently from 0b19bb9 to 8ab4215 Compare February 16, 2026 15:01

// Lock the memory to ensure it's pinned
if (mlock(ptr, size) != 0) {
std::cerr << "[allocMem] mlock failed: " << strerror(errno)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fall through while failed. because unit test always get this error, maybe due to CAP_IPC_LOCK or other reason. we get ENOMEM when exec mlock:
RuntimeError: [allocMem] mlock failed: Cannot allocate memory

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ulimit of lock memory may change

@chloroethylene chloroethylene force-pushed the bugfix/fix_dio_bad_address_bug branch 3 times, most recently from 226d6bf to 93adc72 Compare February 16, 2026 16:58
@matthewygf
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to address a 'direct IO bad address' bug by replacing aclrtMallocHost with mmap for memory allocation, which is a sound approach for obtaining memory suitable for direct I/O. However, the new memory management scheme introduces security and stability concerns. Specifically, the allocMem function fails to treat mlock failure as a fatal error, which could lead to unsafe DMA operations on unpinned memory, potentially causing data corruption or system instability. Additionally, there's a critical memory leak vulnerability if a newly allocated pointer is already tracked, and the error handling in free_pinned_numa_ptr could lead to double munmap calls. The related refactoring to introduce a separate allocatedMap and its own mutex in HostRegisteredMemoryManager improves code clarity and reduces potential lock contention.

Comment on lines 94 to 101
auto unRegErr = unregister_ptr(ptr);

// Unmap the memory
auto unMapErr = munmap(ptr, size);

if (unRegErr) {
throw std::runtime_error("unregister_ptr failed: " +
std::to_string(unRegErr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

In free_pinned_numa_ptr, munmap is called to unmap memory after attempting to unregister the pointer. If unregister_ptr fails, the function throws a std::runtime_error after munmap has already been executed. If the caller of free_pinned_numa_ptr attempts to retry the operation upon failure, it will result in a double munmap on the same pointer. Double munmap is a security vulnerability that can lead to memory corruption by unmapping memory that might have been re-allocated to another part of the process. The logic should ensure that the exception is thrown in a way that doesn't lead to unsafe retries, or that the memory is only unmapped if unregistration succeeds (if that's the intended semantics).

…ed memory

- Split single mux into regMux (for registeredMap) and allocMux (for allocatedMap)
- Move allocMem/freeMem logic into class methods with consolidated mmap/mlock/memset
@chloroethylene chloroethylene force-pushed the bugfix/fix_dio_bad_address_bug branch from 93adc72 to 781da7e Compare March 13, 2026 03:44
@matthewygf
Copy link
Collaborator

Also it might be possible to not use mlock, as halhostregister automatically pinned. But its worth checking @chloroethylene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] direct io fails in latest version

3 participants