Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

ArronHZG · 2025-05-20T04:03:25Z

Jemalloc Mempool and Adaptation for CPU HASHTABLE

1 Current Status

Code: https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/src/dram_kv_embedding_cache

In the current implementation, the value structure of the CPU hashtable uses std::vector, which directly requests memory from the system via std::allocator. Without memory pool management, frequent allocations incur significant system call overhead.

2 Proposed Solution

2.1 Overview

Given that the memory requirements for embedding allocations are known upfront, we can leverage this information to customize bin sizes more appropriately.

Our scenario (embedding hashtable) does not involve frequent memory deallocations (only during ID eviction) but requires frequent allocations. Freed memory can be directly reused for new allocations, minimizing fragmentation concerns.

Compared to jemalloc, our approach avoids complex Buddy/Slob algorithms, memory block merging, and multi-scale slot designs. The core idea is:

Implement dedicated memory pool management for each shard of fixed emb_dim hashtables (including merged tables)
Reuse existing lock mechanisms from SynchronizedShardedMap for concurrency control.
Create a "lock-free per-table memory pool" design that minimizes code invasiveness.

Design features:
Leverages existing SharedMutexWritePriority from SynchronizedShardedMap
Memory pool operations share the same critical section with hashtable insertions

template <typename K, typename V, typename M = folly::SharedMutexWritePriority>
class SynchronizedShardedMap {
 public:
   ...
 private:
  std::vector<folly::Synchronized<folly::F14FastMap<K, V>, M>> shards_;
  // New memory pool with same count as F14FastMap
  std::vector<FixedBlockPool> mempool_; 
};

2.2 FixedBlockPool Design

Three-Level Structure Model

Pool: Manages multiple chunks, maintains global free list, interfaces with PMR
Chunk: Preallocated contiguous memory (e.g., 1024 blocks), divided into equal-sized blocks with alignment
Block: Minimum allocation unit. Stores next-block pointer in first sizeof(void*) bytes when free

Core Data Structures

Free List: Singly-linked list using head-of-block pointers (O(1) alloc/dealloc)
Chunk List: Stores pointers to allocated chunks for destruction-time cleanup

Workflow

Allocate: Take head block from free list. Allocate new chunk when empty.
Deallocate: Return block to free list head.

Alignment

Ensure block addresses meet alignment requirements (power-of-2 alignment, block size multiples)

2.3 Implementation Details

Chunk Handling

Minimum chunk size: sizeof(void*) to store link pointers
Chunk splitting: Fixed-size blocks with zero metadata overhead
Memory alignment: Enforced during chunk allocation

Custom memory_resource Class

Inherits from std::pmr::memory_resource
Parameters: block_size, blocks_per_chunk, alignment
Maintains free list and chunk tracking

Special Case Handling

For allocations ≤8 bytes (sizeof(void*)): Requires additional handling to prevent metadata overwrite

3 Benefit Analysis

Comparative advantages vs baseline & jemalloc:

vs Jemalloc
- Tighter integration with SynchronizedShardedMap through lock-free per-table memory pool design
- Avoids multi-threaded resource contention overhead via shared lock mechanism
Free Block Management
- O(1) allocation/deallocation via embedded pointer single-linked list
- Uses block's own memory for free list pointers
std::pmr Advantages
- Enforces chunk alignment (cache-line friendly)
- Reduces CPU cache-miss penalties
- Built-in memory resource chaining support

netlify · 2025-05-20T04:03:29Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`463c152`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/682d99bec39bec0008ad198c
😎 Deploy Preview	https://deploy-preview-4154--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-05-20T04:03:31Z

Hi @ArronHZG!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2025-05-27T20:20:30Z

@TroyGarden has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

duduyi2013 · 2025-06-01T02:20:15Z

Do we plan to disable page swap in the future diff? I saw we are using std::pmr::new_delete_resource(), its allocation doesn't lock the page

FixedBlockPool

71525b8

houzhenggang added 3 commits May 20, 2025 12:35

merge master

930cda7

use weights_data_ptr

f712994

update some annotate

9c209a2

ArronHZG changed the title ~~DramFixedBlockPool~~ jemalloc Mempool and Adaptation for CPU HASHTABLE May 20, 2025

ArronHZG changed the title ~~jemalloc Mempool and Adaptation for CPU HASHTABLE~~ Jemalloc Mempool and Adaptation for CPU HASHTABLE May 20, 2025

facebook-github-bot added the cla signed label May 20, 2025

ArronHZG added 2 commits May 21, 2025 14:32

Merge branch 'main' into feature/dram_fixed_block_pool

6dd3055

Merge branch 'pytorch:main' into feature/dram_fixed_block_pool

463c152

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

Uh oh!

ArronHZG commented May 20, 2025 •

edited

Loading

Uh oh!

netlify bot commented May 20, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented May 20, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

duduyi2013 commented Jun 1, 2025

Uh oh!

Uh oh!

Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

Are you sure you want to change the base?

Jemalloc Mempool and Adaptation for CPU HASHTABLE #4154

Uh oh!

Conversation

ArronHZG commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Jemalloc Mempool and Adaptation for CPU HASHTABLE

1 Current Status

2 Proposed Solution

2.1 Overview

2.2 FixedBlockPool Design

Three-Level Structure Model

Core Data Structures

Workflow

Alignment

2.3 Implementation Details

Chunk Handling

Custom memory_resource Class

Special Case Handling

3 Benefit Analysis

Uh oh!

netlify bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented May 20, 2025

Action Required

Process

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

duduyi2013 commented Jun 1, 2025

Uh oh!

Uh oh!

ArronHZG commented May 20, 2025 •

edited

Loading

netlify bot commented May 20, 2025 •

edited

Loading