Release v1.2.0 · alibaba/TorchEasyRec

Major Features and Improvements

Train/Eval/Predict/Export

Enhance HSTU export in #443
Support unified one-stage AOTI export with torch.export compatibility fixes in #475
Support generic --additional_export_config JSON for export in #481
Reduce AOTI compile memory usage by releasing verify-forward activations before compile in #491

Model

DlrmHSTU:
- Add CUTLASS kernel backend for HSTU attention in #465
- Add concat_contextual_features option in #459
- Support scaling_seqlen in HSTU attention stack in #480
- Support per-task loss weight in FusionSubTaskConfig in #453
ULTRA-HSTU:
- Add Semi-Local Attention and selective activation rematerialization in #486
- Add mid-stack attention truncation in #488
- Add Mixture of Transducers in #492
Add label smoothing support to BinaryCrossEntropy loss in #455

Embedding

Update DynamicEmbedding to use align_to_table_size in #460
Integrate DynamicEmbedding table fusion in #466

Feature

Add CombineFeature support in #447
Support TokenizeFeature as token-level sequence input in #470

Dataset

Add start.timestamp.ms support to KafkaDataset in #446
Add heartbeat thread to prevent Kafka MAX_POLL_EXCEEDED in #471

Optimizer

Add CosineAnnealingLR and CosineAnnealingWarmRestartsLR schedules in #454

Upgrade

Upgrade PyTorch to v2.11, TorchRec to v1.6.0, and FBGEMM to v1.6.0 in #479

Note

For TorchEasyRec 1.2.x, you should use Docker image version 1.2.

For the GPU version (CUDA 12.9) with tensorrt:
- mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cu129
- PyTorch: v2.11 CUDA: v12.9 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11
- Supported GPUs: sm_75 / 80 / 86 / 90 / 100 / 120. It supports Turing (T4), Ampere/Ada (A10/A30/A100/L4/L20), Hopper (H100/H200/H20), Blackwell (B100/B200), and other GPUs with CC 7.5-12.0.
For the GPU version (CUDA 12.6) with tensorrt:
- mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cu126
- PyTorch: v2.11 CUDA: v12.6 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11
- Supported GPUs: sm_70 / 75 / 80 / 86 / 90. It supports Volta (V100), Turing (T4), Ampere/Ada (A10/A30/A100/L4/L20), Hopper (H100/H20), and other GPUs with CC 7.0-9.0. It does not support Blackwell GPUs.
For the CPU version:
- mybigpai-public-registry.cn-beijing.cr.aliyuncs.com/easyrec/tzrec-devel:1.2-cpu
- PyTorch: v2.11 FBGEMM: v1.6.0 TorchRec: v1.6.0 Python: v3.11

Bug Fixes and Other Changes

[bugfix] fix readthedocs build failure by @tiankongdeguiji in #439
[bugfix] fix list-to-integer comparison in embedding sequence encoder validation by @tiankongdeguiji in #429
[bugfix] remove redundant .data access in pe_mtl_loss by @tiankongdeguiji in #430
[bugfix] clarify sample_weight fallback value in match and rank models by @tiankongdeguiji in #435
[feat] bump up pyfg to 1.0.2 by @tiankongdeguiji in #427
[bugfix] fix data_config mutation during model export by @tiankongdeguiji in #441
[feat] replace claude-code-action with direct claude -p for code review by @tiankongdeguiji in #444
[bugfix] rename loop variable to avoid shadowing builtin input() by @tiankongdeguiji in #438
[bugfix] replace deprecated torch.autograd.Variable in optimizer test by @tiankongdeguiji in #437
[bugfix] fix unclosed file handle in benchmark by @tiankongdeguiji in #431
[bugfix] fix potential socket resource leak in get_free_port by @tiankongdeguiji in #432
[bugfix] strengthen doc reviewer to cross-reference existing user-facing docs by @tiankongdeguiji in #449
[bugfix] fix contextual_seq_len not passed from preprocessor to STULayer by @tiankongdeguiji in #450
filter non grad when adding to summaries by @eric-gecheng in #448
[bugfix] fix flaky TRT test by adding allow_tf32 to predict() by @tiankongdeguiji in #456
[bugfix] suppress false-positive range validation warnings for dynamicemb features by @tiankongdeguiji in #458
[bugfix] fix sequence feature default_value inconsistency by @tiankongdeguiji in #461
[docs] add FAQ for Triton v3.6.0 WGMMA crash on Hopper GPUs by @tiankongdeguiji in #452
[bugfix] ensure predict threads are joined on exception by @tiankongdeguiji in #433
[bugfix] fix ZCH finetune from checkpoint with different world size by @tiankongdeguiji in #467
[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close by @tiankongdeguiji in #469
[bugfix] fix NameError on sampled when TDMSampler is combined with sample_mask by @tiankongdeguiji in #468
[doc] fix dynamicemb pip install command by @tiankongdeguiji in #473
[bugfix] fix fbgemm int32 overflow during embedding quantization by @tiankongdeguiji in #472
[chore] bump pyfg to 1.0.4 by @tiankongdeguiji in #482
[bugfix] fix two-stage AOTI predict hang under multi-thread workers by @tiankongdeguiji in #484
[bugfix] share Dim across grouped-sequence tensors in legacy AOT export by @tiankongdeguiji in #485
[bugfix] add CombineFeature to SINGLE_INPUT_FEATURE_CLASSES by @tiankongdeguiji in #487
[bump] pyfg 1.0.4 -> 1.0.5; doc updates and TokenizeFeature fix by @tiankongdeguiji in #489

Full Changelog: v1.1.0...v1.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Major Features and Improvements

Train/Eval/Predict/Export

Model

Embedding

Feature

Dataset

Optimizer

Upgrade

Note

Bug Fixes and Other Changes

Contributors

Uh oh!