Skip to content

FBGEMM_GPU v1.1.0 Release Notes

Latest
Compare
Choose a tag to compare
@q10 q10 released this 29 Jan 20:54
· 71 commits to main since this release

Highlights

TBE GPU

  • Added support for int64_t table indices and offsets in TBE inference
  • Introducing support for int32_t indices in TBE training
  • Extended TBE support for larger embedding dimensions
  • Made the learning rate a tensor value
  • Improvements on indices bounds checking

TBE CPU

  • Improved ARM support with SVE implementations for matrix multiplication and float matrix transpose
  • Improved the EmbeddingSpMDMAutovec API
  • Migrated FP32 ops to OSS

TBE SSD

  • Enabled VBE in SSD-TBE
  • Async initialization of RockDB SSD tensors and pad before writing to rocksDB
  • Improvements on indices bounds and other constraints checking

Gen AI Ops

  • Added nccl_alltoall function
  • Custom allgather support multiple dtypes, with dtype checking to prevent silent failures

ROCm

  • Add CK FP8 Batched GEMM and Rowwise GEMM kernels along with heuristic tuning
  • Fixed CK FP8 rowwise quantization for some GEMM shapes
  • Introduced HIP-specific optimizations to the TBE forward and backward passes

SLL ops

  • Migrated Sequence Learning Library (SLL) ops to OSS

Better Engineering

  • Restructured the build to produce multipiple smaller shared libraries instead of a single large binary
  • New and improved tests and benchmarks
  • Improved ROCm build variant support
  • Add build support for CUDA 12.6 and Python 3.13

Software Requirements

FBGEMM_GPU v1.1.0 has been tested and known to work on the following setups:

  • PyTorch: v2.6
  • CUDA: v11.8, 12.4, 12.6
  • Python: v3.9, 3.10, 3.11, 3.12, 3.13

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.1.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.1.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu124/
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cu126/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.1.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table Batched Embedding (TBE) operators

For GPU

  • [New] Add support for int32_t indices in TBE training (#3377, #3375, #3374, #3372, #3371, #3324, #3267, #3264, #3263 #3257)
  • [New] Add support for int64_t indices and offsets in TBE inference (#3254, #3233)
  • [New] Extend TBE support for larger embedding dimensions (#3462, #3467)
  • [New] Make learning rate tensor (Backend) (#3287, #3310, #3332)
  • [New] Add PTA checks to embedding_bounds_check kernels" (#3318)
  • [Fix] Fix PackedTensorAccessor for batch_index_select (#3281)
  • [Fix] Set cache_precision = weights_precision in TBE if it is not explicitly set (#3370)
  • [Fix] Fix pt2_wrapper registration for unified TBE interface (#3238)
  • [Fix] Fix PT2 compliant opcheck tests (#3404)
  • [Fix] Fix FBGEMM_GPU_MEMCHECK in Split optimizers (#3416)
  • [Fix] Fix learning rate as tensor for PT2 compile (#3407)
  • [New] Add new optimizer state row_counter for Adam [Frontend] (#3558)
  • [New] Add new optimizer state row_counter for Adam [Backend] (#3342)
  • [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [7C/N]"" (#3258)
  • [Fix] Back out ""Add support for int64_t indices and offsets in TBE inference [8/N]"" (#3255)
  • [Fix] Fix global weight decay Faketensor test (#3341)
  • [Fix] Fix pt2_wrapper registration for unified TBE interface (#3237)
  • [Fix] Fix ""Cannot call numel() on tensor with symbolic sizes/strides"" (#3368)
  • [Fix] Fix grid size overflow in generate_vbe_metadata (#3484)
  • [Fix] Fix an integer overflow in permute_multi_embedding() (#3465)
  • [Fix] Fix the sync point caused by iter_cpu.item() (#3401)
  • [Fix] Fix global weight decay Faketensor test (#3341)
  • [Fix] Hot fix to skip VBE CPU reshaping for MTIA (#3466)
  • [Fix] address mem over used during flushing (#3460)
  • [Improvement] Add iter singular value into TBE optimizer state (#3228)
  • [Improvement] V2 fwd modified warps (#3570)
  • [Improvement] Add enable_async_update into tbe signature and config (#3431, #3461)"
  • [Improvement] Adjust kNumThreads for bounds_check_indices_kernel (#3299)
  • [Improvement] Reduce registers in bounds_check_indices" (#3298)
  • [Improvement] Mark unified autograd function traceable (#3378)
  • [Improvement] Improve bounds_check_indices for VBE (#3388, #3386)
  • [Improvement] Do not call scalar_type (#3394)
  • [Improvement] optimizer 1d -- EMA in place (fbgemm part) (#3402)
  • [Improvement] Clean up nbit_forward tests (#3286)
  • [Improvement] Remove unused-variable in some generated code (#3327)
  • [Improvement] Limit grid size of bounds_check_indices" (#3282)
  • [Improvement] Support config based bound check version via extended modes (#3418)
  • [Improvement] Use int64_t index for SplitOptimizer grad (#3447)
  • [Improvement] Remove unused arg from generate_vbe_metadata frontend (#3453)
  • [Improvement] Add generate_vbe_metadata test (#3483)
  • [Improvement] Support config based bound check version via extended modes (#3454)
  • [Improvement] make iter PT2 compatible (#3253)
  • [Improvement] Add meta function for PT2 wrappers (#3240)
  • [Improvement] Nesterov (#3232)

For CPU

  • [New] Introduce SVE function for matrix multiplication (#3348)
  • [New] Add sve implementation for float matrix transpose (#3421)
  • [New] autovec specialization framework (#3393)
  • [New] Move FP32 kernels to OSS (#3568)
  • [Improvement] Pull in PR for Kleidi-based FP16 kernel (#3507)
  • [Improvement] Use local buffer where possible (#3304)
  • [Improvement] Refactor GenerateEmbeddingXXX functions (#3307)
  • [Improvement] Increase local_storage size to 512 floats (#3357)
  • [Improvement] Adjust EmbeddingSpMDMAutovec API (#3366)
  • [Improvement] Split loops to work around loop vectorizer weakness (#3406)
  • [Improvement] Do an early check that data_size is not negative (#3305)
  • [Improvement] Fix strict aliasing violation, code cleanup (#3306)

SSD TBE Operators

  • [New] Enable VBE in SSD-TBE (#3247)
  • [Improvement] put KVTensorWrapper in its own header (#3575)
  • [Improvement] Moving KVTensorWrapper to a header file to be used in ModelStore checkpointing code (#3276)
  • [Improvement] Async initialization of RockDB SSD tensors (#3520)
  • [Improvement] pad before writing to rocksDB (#3245)
  • [Improvement] use RocksDB iterator to read key range from ssd embedding (#3495)
  • [Improvement] Log total duration spent prefetching (#3487)
  • [Improvement] address mem over used during flushing (#3460)
  • [Improvement] Create move TBE to right device, and set Cache Load in TBE class (#3438)
  • [Improvement] Unit test for new move tbe from device/cache_load method (#3437)
  • [Improvement] make L2/rocksdb update async optional (#3429)
  • [Improvement] Drop RoPE when filling KV cache (#3346)
  • [Improvement] Remove setting total_cache_hash_size as buffer (#3441)
  • [Improvement] Add meta registrations for kv_cache operators (#3442)
  • [Improvement] remove output dtype restriction in SSD TBE (#3524)
  • [Improvement] change pmt require grad to false when detached (#3525)
  • [Improvement] add more attributes to PartiallyMaterializedTensor (#3300)
  • [Improvement] skip broken inference test that uses ssd TBE (#3494)
  • [Improvement] "coro => fut" (#3430)
  • [Improvement] Reland of D65489998 Optimize sharding performance of embeddings (#3549)
  • [Improvement] Remove torch.jit.script (#3562)

GenAI Support and Operators

  • [New] Add nccl_alltoall function (#3551)
  • [New] custom allgather support multiple dtypes (#3498)
  • [Improvement] Make sure fake tensor functions return on proper device (#3258)
  • [Improvement] Add CPU registrations to custom operators (#3262)
  • [Improvement] Check src & dst dtypes in allgather to prevent silent failures (#3523)
  • [Improvement] Better shape function registration (#3237, #3340)
  • [Improvement] Package re-organization improvements (#3546, #3251, #3419, #3268, #3512)

FP8 and other Quantization support

  • [New] New autotune config for M=4 (#3277)
  • [New] MoE FP8 grouped GEMM (#3321)
  • [New] Add shape check on GroupedGEMM kernel (#3449)
  • [New] Tuning for fp8 gemm with emu1.7 shapes (#3436)
  • [Improvement] more fp8 tuning for decode and not need to pad (#3576)
  • [Improvement] llm decode shapes fp8 rowwise gemm tuning (#3565)
  • [Improvement] Split FP8 Grouped Gemm into dynamic and static version (#3543)
  • [Improvement] Warp-specialized FP8 rowsise GEMM kernel (#3532)
  • [Improvement] Add Cutlass FP8 Grouped Gemm to Quantize Bench (#3530)
  • [Improvement] Fixed FBGEMM fp8 rowwise for irregular shapes (#3491)
  • [Improvement] Properly define preallocated output as mutable in fp8 rowwise gemm (#3476)
  • [Improvement] Fix FP8 Rowwise Gemm Compilation with Auto-functionalize V2 (#3457)
  • [Improvement] Support zero-size inputs in FP8 cuda quantize kernel (#3448)
  • [Improvement] update FP8 GEMM tuning for emu1.7 7B shapes (#3391)
  • [Improvement] Customize FP8 grouped GEMM for non-zero calculation for token choice MoE (#3383)
  • [Improvement] Support FP8 grouped GEMM with cudagraph (#3373)
  • [Improvement] Refactor FP8 grouped GEMM to prepare cudagraph support (#3369)
  • [Improvement] Improve FP8 BMM heuristic for large shapes and MoE E2E performance (#3344)
  • [Improvement] retune some of the EMU1.6 7B FP8 GEMM shapes (#3328)
  • [Improvement] Make FP8 BMM output contiguous (#3270)
  • [Improvement] Tune FP8 rowwise bmm tile hueristic (#3256)
  • [Improvement] more FP8 GEMM tuning for LDM shapes (#3414)
  • [Improvement] Split up f8f8bf16_rowwise_batched.cu (#3381)
  • [Improvement] use sym int in quantize.cpp for f8f8bf16_rowwise_meta (#3410)
  • [Improvement] Remove triton.ops dependency from fbgemm (#3329)
  • [Improvement] Improve performance of prefill mode FP8 Grouped Gemm (#3522)
  • [Improvement] support quantize_fp8_row for up to 4d non contiguous tensor (#3508)
  • [Improvement] Back out ""support quantize_fp8_row for up to 4d non contiguous tensor"" (#3505)
  • [Improvement] Make the scale match the shape of quantized value with N-D tensors (#3396)
  • [Improvement] Fix out-of-bound load in row scaling (#3527)

ROCm

  • [New] More CK FP8 rowwise GEMM instances and tuning (#3455)
  • [New] Setup for ck fp8 batched gemm (#3322)
  • [New] CK FP8 Batched Gemm Heuristic Tuning (#3336)
  • [New] CK FP8 Grouped Gemm Support (#3316)
  • [New] Enable v2 forward test for ROCm (#3573)
  • [New] Add fused_moe kernel to ck_extension (#3518)
  • [Improvement] Implement Vec2 load/store for ROCm devices (#3413, #3475)
  • [Improvement] Manual loop unroll for rocm inference (#3439, #3405)
  • [Improvement] Optimzed backward pass for ROCm devices (#3367)
  • [Improvement] Add manual loop unroll for rocm devices in fwd pass (#3309, #3345)
  • [Improvement] [ROCm] debug v2 kernel for ROCm (#3266)
  • [Improvement] Optimzed backward pass for ROCm devices (#3511, #3488)
  • [Improvement] FP8 Rowwise compile fix followup for AMD (#3478)
  • [Improvement] Use output zero fill into grouped gemm kernel setup (#3537)
  • [Improvement] ROCm] remove the duplicated ROCm version print as it has been done in Pytorch (#3330)
  • [Improvement] Small cleanup of CK kernels (#3278)
  • [Improvement] Cherry-pick CK PR #1636 for fp8 GEMM rowwise for 70B Prefill (#3517)
  • [Improvement] Heuristic Tuning for CK FP8 Grouped Gemm (#3356)
  • [Improvement] Temporary disable nbit_forward_test on OSS rocm clang (#3445)
  • [Fix] Fix CK FP8 rowwise quantization for some GEMM shapes (#3486)

SLL

Sparse Operators

Sparse Ops

  • [Improvement] Register fake tensor impl for fbgemm::all_to_one_device (#3320)
  • [Improvement] Code cleanups to sparse bucketize and sparse block bucketize kernels (#3296,
    #3295, #3302)
  • [Improvement] Update impl_abstract in sparse ops (#3311)
  • [Improvement] Cleanup stray testing line (#3353)
  • [Improvement] Print the node infos when CUDA p2p init fails (#3390)
  • [Improvement] Add large my_size support in _block_bucketize_pooled_sparse_features_cuda_kernel2 (#3294)
  • [Improvement] Kernel support for multiple buckets per rank (#3323)
  • [Improvement] Add CPU group_index_select fwd and bwd impl (#3273)
  • [Improvement] Skip check_all_same_device if only CPU and meta tensors appear (#3241)
  • [Improvement] create pack_segments_v2 with additional pad_minf and presence_mask functionality (#3427)

Quantization Operators

Quantize Ops

  • [Improvement] Add meta dispatch for FusedNBitRowwiseQuantizedSBHalfToFloatOrHalf (#3248, #3231)
  • [Improvement] Add torch checks for QuantizedCommCodec (#3260, #3389)
  • [Fix] Fix index overflow for superlarge inputs (#3519)

MX4 Ops

  • [Improvement] MX4 group size configuration for pyper (#3516)
  • [Fix] Various illegal memory access fixes (#3229, #3509, #3349)

Better Engineering

Benchmarks and tests

  • [New] Add a benchmark for VBE (#3464)
  • [New] Add Machete to fbgemm quantize bench (#3259)
  • [Improvement] Improve bounds check indices benchmark (#3283)
  • [Improvement] Add trace for nbit_device (#3292)
  • [Improvement] Use cudagraph for autotune (#3291)
  • [Improvement] Improve benchmark accuracy with warmups and kineto profiling (#3585, #3580)
  • [Fix] Fix test error (#3480)
  • [Fix] Disable SLL test in OSS (#3545)

Build / CI improvements

Misc Cleanups