Skip to content

Commit 4abeb50

Browse files
Add D-FINE Model into Transformers (#36261)
* copy the last changes from broken PR * small format * some fixes and refactoring after review * format * add config attr for loss * some fixes and refactoring * fix copies * fix style * add test for d-fine resnet * fix decoder layer prop * fix dummies * format init * remove extra print * refactor modeling, move resnet into separate folder * fix resnet config * change resnet on hgnet_v2, add clamp into decoder * fix init * fix config doc * fix init * fix dummies * fix config docs * fix hgnet_v2 config typo * format modular * add image classification for hgnet, some refactoring * format tests * fix dummies * fix init * fix style * fix init for hgnet v2 * fix index.md, add init rnage for hgnet * fix conversion * add missing attr to encoder * add loss for d-fine, add additional output for rt-detr decoder * tests and docs fixes * fix rt_detr v2 conversion * some fixes for loos and decoder output * some fixes for loss * small fix for converted modeling * add n model config, some todo comments for modular * convert script adjustments and fixes, small refact * remove extra output for rt_detr * make some outputs optionsl, fix conversion * some posr merge fixes * small fix * last field fix * fix not split for hgnet_v2 * disable parallelism test for hgnet_v2 image classification * skip multi gpu for d-fine * adjust after merge init * remove extra comment * fix repo name references * small fixes for tests * Fix checkpoint path * Fix consistency * Fixing docs --------- Co-authored-by: Pavel Iakubovskii <[email protected]>
1 parent 4602059 commit 4abeb50

24 files changed

+7711
-4
lines changed

docs/source/en/_toctree.yml

+4
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,8 @@
499499
title: Helium
500500
- local: model_doc/herbert
501501
title: HerBERT
502+
- local: model_doc/hgnet_v2
503+
title: HGNet-V2
502504
- local: model_doc/ibert
503505
title: I-BERT
504506
- local: model_doc/jamba
@@ -691,6 +693,8 @@
691693
title: ConvNeXTV2
692694
- local: model_doc/cvt
693695
title: CvT
696+
- local: model_doc/d_fine
697+
title: D-FINE
694698
- local: model_doc/dab-detr
695699
title: DAB-DETR
696700
- local: model_doc/deformable_detr

docs/source/en/model_doc/d_fine.md

+76
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# D-FINE
18+
19+
## Overview
20+
21+
The D-FINE model was proposed in [D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement](https://arxiv.org/abs/2410.13842) by
22+
Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
23+
24+
The abstract from the paper is the following:
25+
26+
*We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD).
27+
FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.*
28+
29+
This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
30+
The original code can be found [here](https://github.com/Peterande/D-FINE).
31+
32+
## Usage tips
33+
34+
```python
35+
>>> import torch
36+
>>> from transformers.image_utils import load_image
37+
>>> from transformers import DFineForObjectDetection, AutoImageProcessor
38+
39+
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
40+
>>> image = load_image(url)
41+
42+
>>> image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine_x_coco")
43+
>>> model = DFineForObjectDetection.from_pretrained("ustc-community/dfine_x_coco")
44+
45+
>>> inputs = image_processor(images=image, return_tensors="pt")
46+
47+
>>> with torch.no_grad():
48+
... outputs = model(**inputs)
49+
50+
>>> results = image_processor.post_process_object_detection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)
51+
52+
>>> for result in results:
53+
... for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
54+
... score, label = score.item(), label_id.item()
55+
... box = [round(i, 2) for i in box.tolist()]
56+
... print(f"{model.config.id2label[label]}: {score:.2f} {box}")
57+
cat: 0.96 [344.49, 23.4, 639.84, 374.27]
58+
cat: 0.96 [11.71, 53.52, 316.64, 472.33]
59+
remote: 0.95 [40.46, 73.7, 175.62, 117.57]
60+
sofa: 0.92 [0.59, 1.88, 640.25, 474.74]
61+
remote: 0.89 [333.48, 77.04, 370.77, 187.3]
62+
```
63+
64+
## DFineConfig
65+
66+
[[autodoc]] DFineConfig
67+
68+
## DFineModel
69+
70+
[[autodoc]] DFineModel
71+
- forward
72+
73+
## DFineForObjectDetection
74+
75+
[[autodoc]] DFineForObjectDetection
76+
- forward

docs/source/en/model_doc/hgnet_v2.md

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# HGNet-V2
18+
19+
## Overview
20+
21+
A HGNet-V2 (High Performance GPU Net) image classification model.
22+
HGNet arhtictecture was proposed in [HGNET: A Hierarchical Feature Guided Network for Occupancy Flow Field Prediction](https://arxiv.org/abs/2407.01097) by
23+
Zhan Chen, Chen Tang, Lu Xiong
24+
25+
The abstract from the HGNET paper is the following:
26+
27+
*Predicting the motion of multiple traffic participants has always been one of the most challenging tasks in autonomous driving. The recently proposed occupancy flow field prediction method has shown to be a more effective and scalable representation compared to general trajectory prediction methods. However, in complex multi-agent traffic scenarios, it remains difficult to model the interactions among various factors and the dependencies among prediction outputs at different time steps. In view of this, we propose a transformer-based hierarchical feature guided network (HGNET), which can efficiently extract features of agents and map information from visual and vectorized inputs, modeling multimodal interaction relationships. Second, we design the Feature-Guided Attention (FGAT) module to leverage the potential guiding effects between different prediction targets, thereby improving prediction accuracy. Additionally, to enhance the temporal consistency and causal relationships of the predictions, we propose a Time Series Memory framework to learn the conditional distribution models of the prediction outputs at future time steps from multivariate time series. The results demonstrate that our model exhibits competitive performance, which ranks 3rd in the 2024 Waymo Occupancy and Flow Prediction Challenge.*
28+
29+
This model was contributed by [VladOS95-cyber](https://github.com/VladOS95-cyber).
30+
The original code can be found [here](https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py).
31+
32+
## HGNetV2Config
33+
34+
[[autodoc]] HGNetV2Config
35+
36+
37+
## HGNetV2Backbone
38+
39+
[[autodoc]] HGNetV2Backbone
40+
- forward
41+
42+
43+
## HGNetV2ForImageClassification
44+
45+
[[autodoc]] HGNetV2ForImageClassification
46+
- forward

0 commit comments

Comments
 (0)