Description
🚀 The feature
Note: To track the progress of the project check out this board.
This is the 3rd phase of TorchVision's modernization project (see phase 1 and 2). We aim to keep TorchVision relevant by ensuring it provides off-the-shelf all the necessary primitives, model architectures and recipe utilities to produce SOTA results for the supported Computer Vision tasks.
1. New Primitives
To enable our users to reproduce the latest state-of-the-art research we will enhance TorchVision with the following data augmentations, layers, losses and other operators:
Data Augmentations
- AutoAugment for Detection [1, 2] - Implement AutoAugment for Detection #6224 Implement AutoAugment for Detection #6609
- Mosaic [1, 2] - Mosaic Transform #6534
- Mixup for Detection [1, 2] - New Feature: Mixup Transform for Object Detection #6720 NEW Feature: Mixup transform for Object Detection #6721
Losses
- Dice Loss [1, 2] - New Feature: Dice Loss #6435 Add Dice Loss #6960
- Poly Loss [1, 2] - Add support for PolyLoss in torchvision #6439 feat: Added support of Poly Loss #6457
Operators added in PyTorch Core
- LARS Optimizer [1, 2] - WIP: feat: LARS optimizer pytorch#88106
- LAMB Optimizer [1, 2] - Implementation of LAMB optimizer #6868
- Polynomial LR Scheduler [1, 2] - code - feat: add PolynomialLR scheduler pytorch#82769
2. New Architectures & Model Iterations
To ensure that our users have access to the most popular SOTA models, we will add the following architectures along with pre-trained weights:
Image Classification
- Swin Transformer V2 - Add SwinV2 in TorchVision #6242 Add SwinV2 #6246
- MobileViT v1 & v2 [1, 2] - [FEAT] Add MobileViT v1 & v2 #6404
- MaxViT - MaxVit model #6342
Video Classification
- MViTv2 [1] - Add support of MViTv2 video variants #6373
- Swin3d [1] - Port SwinTransformer3d from torchmultimodal #6499 Add Video SwinTransformer #6521
- S3D [1] - S3D feature request #6402 Add the S3D architecture to TorchVision #6412 Update S3D weights #6537
3. Improved Training Recipes & Pre-trained models
To ensure that are users can have access to strong baselines and SOTA weights, we will improve our training recipes to incorporate the newly released primitives and offer improved pre-trained models:
Reference Scripts
- Update the Reference Scripts to use the latest primitives - refactor: replace LambdaLR with PolynomialLR in segmentation training script #6405 Prototype references #6433
Pre-trained weights
- Improve the accuracy of Video models
Other Candidates
There are several other Operators (#5414), Losses (#2980), Augmentations (#3817) and Models (#2707) proposed by the community. Here are some potential candidates that we could implement depending on bandwidth. Contributions are welcome for any of the below:
- YOLOX [1] - [RFC] Support YOLOX detection model #6341
- DeTR - Add DETR model #5922 NEW Feature: DeTR Model to torchvision #6922
- U-Net - [RFC] U-Net framework #6610 [DRAFT][DONT MERGE] U-net proposal #6611
- MViTv2 for Images [1]
- Video Transformer Network [1]
- MTV
- Deformable DeTR
- Shortcut Regularizer (FX-based)
- Hide-and-Seek - ‘Hide-and-Seek’ Random Masking Transform #6796
Activity