Skip to content

FBGEMM_GPU v0.8.0 Release Notes

Latest
Compare
Choose a tag to compare
@spcyppt spcyppt released this 29 Jul 22:40
· 439 commits to main since this release

Release Note

Highlights

Table Batched Embedding

For GPU

  • New Table Batched Embedding (TBE) operators and momentum type support
  • New Intraining Embedding Pruning (ITEP) operators
  • VBE support for Dense TBE
  • Global weight decay support in TBE
  • New type support and improvement to SSD TBE
  • Improvement and bug fixes for TBE training and inference modules and sparse operators

For MTIA

  • MTIA support for DenseTBE

Generative AI

  • Gen AI Ops integration
  • Support for Triton-based and CUTLASS-based operators (#2552, #2537)
  • New FP8 GEMM and quantization operators
  • New query attention operators
  • New Car and All-To-All (NCCL-based) communication operators
  • AMD Support for FP8

Others

  • New MX4 quantization operators
  • Support for CUDA 12.4

Better engineering

  • Code refactoring and reorganization for faster builds
  • New tests and benchmarks
  • Improved AMD support

Software Requirements

FBGEMM_GPU v0.8.0 has been tested and known to work on the following setups:

  • PyTorch: v2.4
  • CUDA: v11.8, 12.1, 12.4
  • Python: v3.8, 3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.1 variant is available)
pip install fbgemm-gpu==0.8.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==0.8.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==0.8.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

  • [New] VBE support for Dense TBE (#2628, #2620, #2641)
  • [New] BF16 momentum support in PARTIAL_ROWWISE_ADAM (#2524, #2522, #2518)
  • [New] Global weight decay support (#2516, #2507, #2506)
  • [New] Multi-pass prefetch for memory efficiency (#2566)
  • [Improvement] Work around masked_select for numel > MAX_INT (#2648)
  • [Improvement] Fused optim in backward capability with aot_autograd (#2651)
  • [Improvement] Weights mutations declaration in TBE backward ops schemas (#2698)
  • [Improvement] Helper ops to support cache conflict misses (#2571)
  • [Improvement] Fixed the hang issue in some TBE GPU optimizers (#2509)
  • [Improvement] Misc TBE fixes and refactoring (#2583, #2597, #2529)
  • [Improvement] Cache prefetch and conflict miss improvements (#2596, #2514)

For MTIA

  • [New] Support MTIA in DenseTableBatchedEmbeddingBagsCodegen (#2680)

SSD Table batched embedding (TBE) operators

  • [New] Add FP16 weight and output support to SSD TBE (#2638)
  • [New] Implementation of PS KV DB for FBGEMM TBE operator (#2664, #2642)
  • [Improvement] Removal of D->H sync when calling lxu_cache_lookup (#2672)
  • [Improvement] Recording of functions in SSD TBE (#2670)
  • [Improvement] Added options, assertions and logs for training and inference SSD TBE (#2689, #2657)
  • [Improvement] SSD TBE backend fixes (#2645, #2671)

New Operator Groups

GenAI FP8 Operators

GenAI Support and Operators

  • [New] Integrated Gen AI ops into the build (#2512)
  • [New] Support for Triton-based operators (#2570, #2618)
  • [New] Support for CUTLASS-based operators (#2552, #2537)
  • [New] Car and All-To-All (NCCL-based) communication ops (#2606, #2667, #2631, #2624)
  • [New] Grouped query attention ops (#2673, #2504)
  • [New] CK BF16 GEMM (#2617)
  • [New] W4A8 GEMM kernels (#2558, #2607)

Pooled Embeddings

  • [Improvement] Clean up unused pooled embedding ops (#2626)
  • [Improvement] PyTorch compatibility fixes (#2619, #2629)

Sparse Operators

  • [Improvement] Increased dynamic shared memory size to support larger bucket sizes (#2500)
  • [Improvement] UINT8 support for reorder sequence embedding operator (#2531)
  • [Improvement] Fixed CPU blocking D2H in JaggedIndexSelect2dOp backward (#2510)

Benchmarks / Tests

  • [New] Unified benchmarks and unit tests for FP8 (#2609, #2699, #2666)
  • [Improvement] SSD TBE benchmarks (#2579, #2580)
  • [Improvement] SSD TBE tests (#2665, #2647)
  • [Improvement] Fixes for TBE tests and benchmarks (#2632)
  • [Improvement] nbit_cache benchmark bandwidth calculation (#2511)

Build / CI improvements and Fixes

  • [New] Support for CUDA 12.4 (#2565)
  • [Improvement] Improved AMD support (#2541, #2679)
  • [Improvement] Strengthened artifact installation process (#2491)
  • [Improvement] Memcheck added across operators (#2576, #2574, #2572, #2612, #2594, #2589, #2578)
  • [Improvement] Refactoring of large header files (#2650)
  • [Improvement] Improved build scripts to support debug flags and custom (i.e. GenAI) variants (#2702)