Integration of dinov3, including optional train and inference scripts.#324
Integration of dinov3, including optional train and inference scripts.#324S-Mahoney wants to merge 7 commits intoroboflow:developfrom
Conversation
|
I have read the CLA Document and I sign the CLA. |
|
changing backbone with DinoV3 is super better for object detection, might be latency issue... |
|
I'd add the option to use the ConvNeXt Yet I think the main issue is the license of the DINOv3 model. |
Hey there, I wonder if you just load the weight of dinov3_convext_tiny? or do you load another weight of model? |
|
Hi @S-Mahoney, thank you for the work! Could you sync the pull request to include the newest changes from the main repo? |
There was a problem hiding this comment.
Pull request overview
This pull request integrates DINOv3 (the latest version of Meta's DINO vision transformer) as an alternative backbone encoder into the RF-DETR object detection framework. The integration allows users to choose between DINOv2 and DINOv3 encoders through configuration, with support for loading weights from either HuggingFace Hub or local PyTorch Hub repositories.
Changes:
- Added DINOv3 wrapper class with flexible weight loading (HuggingFace or local repo)
- Extended configuration system to support DINOv3 encoder variants (small, base, large) with automatic parameter validation and adjustment
- Improved training engine with better AMP handling and gradient context management
- Added example training and inference scripts demonstrating v2/v3 encoder selection
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 31 comments.
Show a summary per file
| File | Description |
|---|---|
| rfdetr/models/backbone/dinov3.py | New DINOv3 wrapper implementing multi-path forward logic with HF and PyTorch Hub support |
| rfdetr/models/backbone/backbone.py | Extended to branch between dinov2 and dinov3 encoders based on model name prefix |
| rfdetr/models/backbone/dinov3_configs/*.json | Configuration files for dinov3 small/base/large model architectures |
| rfdetr/config.py | Added EncoderName type, DINOv3 config fields, and validators for automatic parameter adjustment |
| rfdetr/engine.py | Updated AMP context manager logic and replaced inference_mode with no_grad for interpolation |
| rfdetr/train_v2_or_v3.py | New training script with encoder aliasing and environment-based configuration |
| rfdetr/inference_test.py | New demo script for testing DINOv2/v3 inference with URL-based image input |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| torch.set_grad_enabled(True) # safety | ||
| with torch.inference_mode(False): | ||
| with autocast(**get_autocast_args(args)): | ||
| outputs = model(new_samples, new_targets) | ||
| loss_dict = criterion(outputs, new_targets) | ||
| weight_dict = criterion.weight_dict | ||
| losses = sum( | ||
| (1 / args.grad_accum_steps) * loss_dict[k] * weight_dict[k] | ||
| for k in loss_dict.keys() | ||
| if k in weight_dict | ||
| ) |
There was a problem hiding this comment.
The indentation of the autocast block and its contents appears to be changed. While the torch.inference_mode(False) wrapper was added for safety, the call to torch.set_grad_enabled(True) at line 133 is redundant since torch.inference_mode(False) already ensures gradients are enabled. Consider removing the torch.set_grad_enabled(True) call to simplify the code.
| num_select: int = 300 | ||
| projector_scale: List[Literal["P3", "P4", "P5"]] = ["P4"] | ||
| out_feature_indexes: List[int] = [2, 5, 8, 11] | ||
| out_feature_indexes: List[int] = [2, 4, 5, 9] |
There was a problem hiding this comment.
The default out_feature_indexes is changed from [2, 5, 8, 11] to [2, 4, 5, 9], but there's a validator at lines 101-108 that forces it to [8, 11] when using dinov3 encoders. This creates inconsistency and makes it unclear what the actual indexes will be. Consider documenting why these specific indexes were chosen and whether the default should be different for v2 vs v3.
| if cand is not None: | ||
| if torch.is_tensor(cand) and cand.dim() == 4: | ||
| # Already a spatial map [B, C, Hp, Wp] | ||
| # Repeat to match requested out_feature_indexes count | ||
| C = cand.shape[1] | ||
| if C != self.hidden_size: | ||
| self.hidden_size = int(C) | ||
| self._out_feature_channels = [self.hidden_size] * len(self.out_feature_indexes) | ||
| return [cand for _ in self.out_feature_indexes] | ||
| # Otherwise assume tokens | ||
| tokens = cand | ||
| # If [HW, C] or [B*HW, C], _tokens_to_map will handle reshape | ||
| feats = [self._tokens_to_map(tokens, B, H, W) for _ in self.out_feature_indexes] | ||
| return feats |
There was a problem hiding this comment.
In the forward_features fallback path, when no suitable candidate is found in the dictionary (line 179), the code falls through without raising an error. This could lead to unexpected behavior. Consider raising a more informative error if cand remains None after attempting to extract features from the dictionary.
|
|
||
| # auto-fit out_feature_indexes to avoid projector shape mismatches | ||
| @field_validator("out_feature_indexes", mode="after") | ||
| def _coerce_out_feats_for_backbone(cls, v, info: ValidationInfo): |
There was a problem hiding this comment.
Normal methods should have 'self', rather than 'cls', as their first parameter.
Remove leftover ChatGPT appendix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Removal of unused import 'Field' Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Removal of unused import 'List' Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Removal of unused import 'platform' Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Removal of unused import 'nullcontext' Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Removing commented code clutter Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
60b16c1 to
523f9df
Compare
Description
Integration of Dinov3 wrapper into RFDETR pipeline.
Included are scripts which allow training of both v2 and v3 using rfdetr, as well as inference test scripts.
Due to the recent release of Dinov3; the weights required can be accessed by setting your HUGGINGFACE_HUB_TOKEN for private access (requires permission) or by requesting access to Dinov3 weights from Meta and cloning the Dinov3 repo locally.
Type of change
How has this change been tested, please provide a testcase or example of how you tested the change?
Tested via the training and inference scripts supplied, still allows activation and use of v2 whilst enabling user to also use v3. More training of pre-trained weights required for inference to be on par with dinov2 pretrained rfdetr.
Any specific deployment considerations
Licensing requirements need to be checked by RFDETR owners before deployment of this branch.
I have read the CLA Document and I sign the CLA.