-
Notifications
You must be signed in to change notification settings - Fork 589
Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Hi @ArronHZG! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@TroyGarden has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Do we plan to disable page swap in the future diff? I saw we are using std::pmr::new_delete_resource(), its allocation doesn't lock the page |
Jemalloc Mempool and Adaptation for CPU HASHTABLE
1 Current Status
Code: https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/src/dram_kv_embedding_cache
In the current implementation, the value structure of the CPU hashtable uses std::vector, which directly requests memory from the system via std::allocator. Without memory pool management, frequent allocations incur significant system call overhead.
2 Proposed Solution
2.1 Overview
Given that the memory requirements for embedding allocations are known upfront, we can leverage this information to customize bin sizes more appropriately.
Our scenario (embedding hashtable) does not involve frequent memory deallocations (only during ID eviction) but requires frequent allocations. Freed memory can be directly reused for new allocations, minimizing fragmentation concerns.
Compared to jemalloc, our approach avoids complex Buddy/Slob algorithms, memory block merging, and multi-scale slot designs. The core idea is:
Implement dedicated memory pool management for each shard of fixed emb_dim hashtables (including merged tables)
Reuse existing lock mechanisms from SynchronizedShardedMap for concurrency control.
Create a "lock-free per-table memory pool" design that minimizes code invasiveness.
Design features:
Leverages existing SharedMutexWritePriority from SynchronizedShardedMap
Memory pool operations share the same critical section with hashtable insertions
2.2 FixedBlockPool Design
Three-Level Structure Model
Core Data Structures
Workflow
Alignment
Ensure block addresses meet alignment requirements (power-of-2 alignment, block size multiples)
2.3 Implementation Details
Chunk Handling
Custom memory_resource Class
Special Case Handling
For allocations ≤8 bytes (sizeof(void*)): Requires additional handling to prevent metadata overwrite
3 Benefit Analysis
Comparative advantages vs baseline & jemalloc:
vs Jemalloc
SynchronizedShardedMap
through lock-free per-table memory pool designFree Block Management
std::pmr Advantages