Skip to content

perf: Implement BufferPool for efficient memory reuse#61

Open
Leahlijuan wants to merge 9 commits intomainfrom
perf/bufferpool
Open

perf: Implement BufferPool for efficient memory reuse#61
Leahlijuan wants to merge 9 commits intomainfrom
perf/bufferpool

Conversation

@Leahlijuan
Copy link
Collaborator

@Leahlijuan Leahlijuan commented Feb 25, 2026

Introduces BufferPool to manage and reuse BufferIO objects across checkpoint steps, significantly reducing memory allocation overhead and filesystem churn.

Key Changes:

  • Core: Added BufferPool class for buffer acquisition, lifecycle management, and reuse via symbolic links.
  • Manager: Integrated BufferPool into CheckpointObjectManager.
  • Adapter: Updated NeMo CheckpointIO to leverage BufferPool.
  • BufferIOProxy: Introduced BufferIOProxy to wrap BufferIO operations, preventing premature closure of the underlying buffer object.

BufferPool Lifecycle:

  • Initialization: Lazily initialized as a per-process singleton upon the first checkpoint save.
  • Acquisition: Acquires buffers from the pool when available, creating symlinks to the underlying buffer objects for the current checkpoint step.
  • Cleanup: External checkpoint removal deletes the symlinks.
  • Reuse: The BufferPool dynamically checks for symlink existence; if a symlink is missing, the corresponding buffer is reclaimed and returned to the pool for reuse.

Fixes #40

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Leahlijuan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances performance and resource management by introducing a buffer pooling mechanism. The new BufferPool allows for the reuse of memory-mapped buffers, minimizing the overhead of file system operations during checkpointing. By integrating dynamic resizing capabilities into the buffer objects and centralizing their management, the system can more efficiently handle varying data sizes and reduce memory fragmentation, leading to a more robust and performant checkpointing process.

Highlights

  • Buffer Pool Implementation: Introduced a new BufferPool class and BufferIOProxy to manage and reuse BufferIO objects efficiently, reducing overhead associated with frequent buffer creation and destruction.
  • Buffer Resizing Capability: Added resize functionality to BufferIO and its underlying C++ BufferObject, allowing buffers to dynamically adjust their capacity as needed during write operations.
  • Integrated Buffer Management: Modified MLFlashpointCheckpointIO to initialize and tear down the BufferPool in worker processes, and updated CheckpointObjectManager to acquire buffers from this pool, falling back to standalone creation if the pool is unavailable.
  • Streamlined Buffer Lifecycle: Refactored ReplicationManager to remove explicit buffer closing, as the BufferIOProxy now handles the lifecycle, including truncation and returning buffers to the pool.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/ml_flashpoint/adapter/nemo/checkpoint_io.py
    • Imported BufferPool and DEFAULT_INITIAL_BUFFER_SIZE_BYTES.
    • Added _init_pool_in_worker and _teardown_pool_in_worker functions for BufferPool management.
    • Modified MLFlashpointAsyncFinalizableCheckpointIO to initialize the BufferPool on the first save and schedule its teardown.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_io.py
    • Added a resize method to BufferIO to change the buffer's capacity, handling memoryview release and recreation.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_object/bindings.cpp
    • Exposed the resize method of BufferObject to Python bindings.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_object/buffer_helper.cpp
    • Implemented resize_mmap function to safely unmap, ftruncate, and remap memory-mapped files to a new size.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_object/buffer_helper.h
    • Declared the resize_mmap function.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_object/buffer_object.cpp
    • Implemented the resize method for BufferObject, including checks for closed or read-only states.
  • src/ml_flashpoint/checkpoint_object_manager/buffer_object/buffer_object.h
    • Declared the resize method in the BufferObject class.
  • src/ml_flashpoint/checkpoint_object_manager/checkpoint_object_manager.py
    • Renamed create_buffer to acquire_buffer.
    • Modified acquire_buffer to first attempt to acquire a buffer from the BufferPool and fall back to creating a standalone BufferObject if the pool is unavailable.
  • src/ml_flashpoint/core/buffer_pool.py
    • Added new file buffer_pool.py.
    • Implemented BufferPool as a singleton for managing reusable BufferIO objects.
    • Implemented BufferIOProxy to wrap BufferIO instances, preventing premature closing and providing auto-resizing logic.
    • Included logic for pre-allocating buffers, garbage collection of orphaned buffers, and thread-safe acquisition.
  • src/ml_flashpoint/core/checkpoint_saver.py
    • Updated calls from create_buffer to acquire_buffer in _write_to_buffer_from_queue_worker.
  • src/ml_flashpoint/replication/replication_manager.py
    • Removed explicit logging import.
    • Removed close_buffer calls from async_replicate and _final_replication_callback, as buffer lifecycle is now managed by BufferIOProxy.
  • tests/adapter/nemo/test_checkpoint_io.py
    • Updated imports to include new functions and classes.
    • Modified test fixtures and mock setups to accommodate BufferPool initialization and teardown.
    • Adjusted assertions for schedule_async_request to account for BufferPool initialization calls.
    • Added tests for BufferPool initialization and teardown scheduling.
  • tests/adapter/nemo/test_wrapper_util.py
    • Added MockMLFlashpointCheckpointIO class.
    • Updated mock setups for MLFlashpointCheckpointIO to include flashpoint_base_dir, trainer, and save_strategy attributes for BufferPool related logic.
  • tests/adapter/pytorch/test_memory_storage_reader.py
    • Updated create_buffer calls to acquire_buffer.
  • tests/adapter/pytorch/test_memory_storage_writer.py
    • Removed DummySaver class.
  • tests/checkpoint_object_manager/buffer_object/buffer_helper_test.cpp
    • Added comprehensive tests for the new resize_mmap function, covering success cases for larger/smaller sizes and failure cases for invalid file descriptors or read-only files.
  • tests/checkpoint_object_manager/buffer_object/buffer_object_test.cpp
    • Added tests for the BufferObject::resize method, verifying capacity changes, data preservation, and error handling for closed/read-only buffers or zero capacity.
  • tests/checkpoint_object_manager/test_buffer_io.py
    • Added TestResizeOperations class with tests for BufferIO.resize, verifying capacity increase, memoryview updates, and error handling for closed buffers.
  • tests/checkpoint_object_manager/test_checkpoint_object_manager.py
    • Updated manager_setup and mock_buffer_manager fixtures to mock BufferPool and simulate its initialization state.
    • Renamed TestCreateBuffer to TestAcquireBuffer and updated all create_buffer calls to acquire_buffer.
    • Adjusted error messages and expected exceptions in tests to reflect the new acquire_buffer logic, especially for existing files and directory paths.
  • tests/core/test_buffer_pool.py
    • Added new file test_buffer_pool.py.
    • Implemented extensive tests for BufferPool and BufferIOProxy, covering buffer acquisition, reuse, garbage collection, pool exhaustion, auto-resizing, concurrent access, and error handling.
  • tests/core/test_checkpoint_loader.py
    • Updated create_buffer calls to acquire_buffer.
  • tests/core/test_checkpoint_saver.py
    • Updated chkpt_object_manager fixture to initialize and teardown BufferPool.
    • Modified mock for create_buffer to acquire_buffer in test_write_data_io_error.
    • Updated create_buffer to acquire_buffer in test_async_replicate_object.
  • tests/replication/test_replication_manager.py
    • Updated assertions to reflect the removal of explicit close_buffer calls.
  • tests/replication/test_replication_manager_e2e.py
    • Removed explicit time.sleep and assertion for buffer_io.closed after replication, replacing it with a direct buffer_io.close() call.
Activity
  • The pull request introduces a new BufferPool and BufferIOProxy for managing buffer objects.
  • It modifies the MLFlashpointCheckpointIO to integrate the BufferPool for efficient checkpoint saving.
  • The underlying C++ BufferObject and Python BufferIO now support dynamic resizing.
  • Explicit buffer closing logic was removed from ReplicationManager as the BufferIOProxy handles it.
  • Numerous tests were added and updated to cover the new buffer pooling and resizing functionalities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a BufferPool to manage and reuse memory-mapped buffers, aiming to improve performance by reducing the overhead of buffer creation and destruction. The changes include a new BufferPool singleton class with a proxy (BufferIOProxy) to manage buffer lifecycle and provide auto-resizing capabilities, C++ layer extensions to support resizing of BufferObjects, and refactoring of CheckpointObjectManager to use the pool. The implementation is robust and well-tested. My feedback focuses on improving maintainability by removing dependencies on private members and clarifying constants.

@Leahlijuan Leahlijuan changed the title Perf/bufferpool perf: Implement BufferPool for efficient memory reuse Feb 25, 2026
@Leahlijuan Leahlijuan requested review from g-husam and kkkapu February 25, 2026 18:25
Introduces BufferPool to manage and reuse BufferIO objects across checkpoint
steps, significantly reducing memory allocation overhead and filesystem churn.

Key Changes:
- Core: Added BufferPool class for buffer acquisition, lifecycle management,
  and reuse via symbolic links.
- Manager: Integrated BufferPool into CheckpointObjectManager.
- Adapter: Updated NeMo CheckpointIO to leverage BufferPool.
- BufferIO: Updated buffer_io and buffer_object to support reusable buffers.
- BufferIOProxy: Introduced BufferIOProxy to wrap BufferIO operations,
  preventing premature closure of the underlying buffer object.

BufferPool Lifecycle:
- Initialization: Lazily initialized as a per-process singleton upon the
  first checkpoint save.
- Acquisition: Acquires buffers from the pool when available, creating
  symlinks to the underlying buffer objects for the current checkpoint step.
- Cleanup: External checkpoint removal deletes the symlinks.
- Reuse: The BufferPool dynamically checks for symlink existence; if a
  symlink is missing, the corresponding buffer is reclaimed and returned to
  the pool for reuse.
@github-actions
Copy link

Python Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint 100% 100%
src.ml_flashpoint.adapter 100% 100%
src.ml_flashpoint.adapter.megatron 97% 94%
src.ml_flashpoint.adapter.nemo 98% 94%
src.ml_flashpoint.adapter.pytorch 99% 88%
src.ml_flashpoint.checkpoint_object_manager 91% 88%
src.ml_flashpoint.core 95% 91%
src.ml_flashpoint.replication 81% 82%
Summary 94% (2267 / 2405) 90% (519 / 578)

Minimum allowed line rate is 90%

@github-actions
Copy link

C++ Code Coverage Summary

Code Coverage

Package Line Rate Branch Rate Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object 93% 54%
src.ml_flashpoint.checkpoint_object_manager.object_manager 70% 37%
src.ml_flashpoint.replication.transfer_service 79% 40%
Summary 81% (916 / 1126) 43% (687 / 1604)

Minimum allowed line rate is 80%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reuse mmap buffers when saving checkpoint objects

1 participant