Major Features and Improvements
Train/Eval/Predict/Export
- Enhance HSTU export in #443
- Support unified one-stage AOTI export with torch.export compatibility fixes in #475
- Support generic
--additional_export_configJSON for export in #481 - Reduce AOTI compile memory usage by releasing verify-forward activations before compile in #491
Model
- DlrmHSTU:
- ULTRA-HSTU:
- Add label smoothing support to BinaryCrossEntropy loss in #455
Embedding
- Update DynamicEmbedding to use
align_to_table_sizein #460 - Integrate DynamicEmbedding table fusion in #466
Feature
Dataset
- Add
start.timestamp.mssupport to KafkaDataset in #446 - Add heartbeat thread to prevent Kafka MAX_POLL_EXCEEDED in #471
Optimizer
- Add CosineAnnealingLR and CosineAnnealingWarmRestartsLR schedules in #454
Upgrade
- Upgrade PyTorch to v2.11, TorchRec to v1.6.0, and FBGEMM to v1.6.0 in #479
Note
For TorchEasyRec 1.2.x, you should use Docker image version 1.2.
- For the GPU version (CUDA 12.9) with tensorrt:
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cu129- PyTorch: v2.11 CUDA: v12.9 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11
- Supported GPUs:
sm_75 / 80 / 86 / 90 / 100 / 120. It supports Turing (T4), Ampere/Ada (A10/A30/A100/L4/L20), Hopper (H100/H200/H20), Blackwell (B100/B200), and other GPUs with CC 7.5-12.0.
- For the GPU version (CUDA 12.6) with tensorrt:
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cu126- PyTorch: v2.11 CUDA: v12.6 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11
- Supported GPUs:
sm_70 / 75 / 80 / 86 / 90. It supports Volta (V100), Turing (T4), Ampere/Ada (A10/A30/A100/L4/L20), Hopper (H100/H20), and other GPUs with CC 7.0-9.0. It does not support Blackwell GPUs.
- For the CPU version:
mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cpu- PyTorch: v2.11 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11
Bug Fixes and Other Changes
- [bugfix] fix readthedocs build failure by @tiankongdeguiji in #439
- [bugfix] fix list-to-integer comparison in embedding sequence encoder validation by @tiankongdeguiji in #429
- [bugfix] remove redundant .data access in pe_mtl_loss by @tiankongdeguiji in #430
- [bugfix] clarify sample_weight fallback value in match and rank models by @tiankongdeguiji in #435
- [feat] bump up pyfg to 1.0.2 by @tiankongdeguiji in #427
- [bugfix] fix data_config mutation during model export by @tiankongdeguiji in #441
- [feat] replace claude-code-action with direct claude -p for code review by @tiankongdeguiji in #444
- [bugfix] rename loop variable to avoid shadowing builtin input() by @tiankongdeguiji in #438
- [bugfix] replace deprecated torch.autograd.Variable in optimizer test by @tiankongdeguiji in #437
- [bugfix] fix unclosed file handle in benchmark by @tiankongdeguiji in #431
- [bugfix] fix potential socket resource leak in get_free_port by @tiankongdeguiji in #432
- [bugfix] strengthen doc reviewer to cross-reference existing user-facing docs by @tiankongdeguiji in #449
- [bugfix] fix contextual_seq_len not passed from preprocessor to STULayer by @tiankongdeguiji in #450
- filter non grad when adding to summaries by @eric-gecheng in #448
- [bugfix] fix flaky TRT test by adding allow_tf32 to predict() by @tiankongdeguiji in #456
- [bugfix] suppress false-positive range validation warnings for dynamicemb features by @tiankongdeguiji in #458
- [bugfix] fix sequence feature default_value inconsistency by @tiankongdeguiji in #461
- [docs] add FAQ for Triton v3.6.0 WGMMA crash on Hopper GPUs by @tiankongdeguiji in #452
- [bugfix] ensure predict threads are joined on exception by @tiankongdeguiji in #433
- [bugfix] fix ZCH finetune from checkpoint with different world size by @tiankongdeguiji in #467
- [bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close by @tiankongdeguiji in #469
- [bugfix] fix NameError on
sampledwhen TDMSampler is combined with sample_mask by @tiankongdeguiji in #468 - [doc] fix dynamicemb pip install command by @tiankongdeguiji in #473
- [bugfix] fix fbgemm int32 overflow during embedding quantization by @tiankongdeguiji in #472
- [chore] bump pyfg to 1.0.4 by @tiankongdeguiji in #482
- [bugfix] fix two-stage AOTI predict hang under multi-thread workers by @tiankongdeguiji in #484
- [bugfix] share Dim across grouped-sequence tensors in legacy AOT export by @tiankongdeguiji in #485
- [bugfix] add CombineFeature to SINGLE_INPUT_FEATURE_CLASSES by @tiankongdeguiji in #487
- [bump] pyfg 1.0.4 -> 1.0.5; doc updates and TokenizeFeature fix by @tiankongdeguiji in #489
Full Changelog: v1.1.0...v1.2.0