Release Note

Highlights

Table Batched Embedding

For GPU

New Table Batched Embedding (TBE) operators and momentum type support
New Intraining Embedding Pruning (ITEP) operators
VBE support for Dense TBE
Global weight decay support in TBE
New type support and improvement to SSD TBE
Improvement and bug fixes for TBE training and inference modules and sparse operators

For MTIA

MTIA support for DenseTBE

Generative AI

Gen AI Ops integration
Support for Triton-based and CUTLASS-based operators (#2552, #2537)
New FP8 GEMM and quantization operators
New query attention operators
New Car and All-To-All (NCCL-based) communication operators
AMD Support for FP8

Others

New MX4 quantization operators
Support for CUDA 12.4

Better engineering

Code refactoring and reorganization for faster builds
New tests and benchmarks
Improved AMD support

Software Requirements

FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:

PyTorch: v2.4
CUDA: v11.8, 12.1, 12.4
Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

[New] VBE support for Dense TBE (#2628, #2620, #2641)
[New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
[New] Global weight decay support (#2516, #2507, #2506)
[New] Multi-pass prefetch for memory efficiency (#2566)
[Improvement] Work around masked_select for numel > MAX_INT (#2648)
[Improvement] Fused optim in backward capability with aot_autograd (#2651)
[Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
[Improvement] Helper ops to support cache conflict misses (#2571)
[Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
[Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
[Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)

For MTIA

[New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)

SSD Table batched embedding (TBE) operators

[New] Add FP16 weight and output support to SSD TBE (#2638)
[New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
[Improvement] Removal of D->H sync when calling lxu_cache_lookup (#2672)
[Improvement] Recording of functions in SSD TBE (#2670)
[Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
[Improvement] SSD TBE backend fixes (#2645, #2671)

New Operator Groups

[New] Intraining Embedding Pruning (ITEP) ops (#2700, #2690, #2682)
[New] Populate bucketize permute kernel (#2533)
[New] MX4 quantization support (#2709, #2703, #2696, #2675, #2659)

GenAI FP8 Operators

[New] FP8 enablement (#2615, #2637)
[New] CK FP8 GEMM kernels (#2630)
[New] FP8 Rowwise GEMM (#2585, #2622)
[New] FP8 quantization and conversions to FP32/FP16 (#2686, #2681, #2593, #2540, #2677)
[New] FP8 blockwise GEMM (#2676, #2600)
[New] Triton-based FP8 GEMM and quantization support (#2701, #2688, #2643)
[New] AMD support for FP8 (#2582, #2658, #2611)

GenAI Support and Operators

[New] Integrated Gen AI ops into the build (#2512)
[New] Support for Triton-based operators (#2570, #2618)
[New] Support for CUTLASS-based operators (#2552, #2537)
[New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
[New] Grouped query attention ops (#2673, #2504)
[New] CK BF16 GEMM (#2617)
[New] W4A8 GEMM kernels (#2558, #2607)

Pooled Embeddings

[Improvement] Clean up unused pooled embedding ops (#2626)
[Improvement] PyTorch compatibility fixes (#2619, #2629)

Sparse Operators

[Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
[Improvement] UINT8 support for reorder sequence embedding operator (#2531)
[Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)

Benchmarks / Tests

[New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
[Improvement] SSD TBE benchmarks (#2579, #2580)
[Improvement] SSD TBE tests (#2665, #2647)
[Improvement] Fixes for TBE tests and benchmarks (#2632)
[Improvement] nbit_cache benchmark bandwidth calculation (#2511)

Build / CI improvements and Fixes

[New] Support for CUDA 12.4 (#2565)
[Improvement] Improved AMD support (#2541, #2679)
[Improvement] Strengthened artifact installation process (#2491)
[Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
[Improvement] Refactoring of large header files (#2650)
[Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FBGEMM_GPU v0.8.0 Release Notes

Release Note

Highlights

Table Batched Embedding

For GPU

For MTIA

Generative AI

Others

Better engineering

Software Requirements

Availability

Changes

Table batched embedding (TBE) operators

For GPU

For MTIA

SSD Table batched embedding (TBE) operators

New Operator Groups

GenAI FP8 Operators

GenAI Support and Operators

Pooled Embeddings

Sparse Operators

Benchmarks / Tests

Build / CI improvements and Fixes