Soft-RFDETR : An optionnal Mixture of Experts Module to RFDETR#245
Soft-RFDETR : An optionnal Mixture of Experts Module to RFDETR#245LeosCtrt wants to merge 5 commits intoroboflow:developfrom
Conversation
|
we are leveraging a pretrained backbone, so without robust pretrained weights we can't use a different architecture |
|
additionally, have you looked at TRT speed for this? many of these types of architectures actually end up being meaningfully slower in terms of latency when compiled due to memory bottleneck |
60b16c1 to
523f9df
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an optional Soft Mixture of Experts (MoE) architecture to RF-DETR, based on the paper "From Sparse to Soft Mixtures of Experts". The implementation allows users to optionally enable MoE in the transformer decoder layers to potentially improve model accuracy, though with increased computational cost.
Changes:
- Added a new
FFNBlockclass to encapsulate the feedforward network logic needed by the MoE layer - Modified
TransformerDecoderLayerto conditionally use either standard FFN or MoE-wrapped FFN based on configuration - Added
MoEandMoE_paramsconfiguration fields toModelConfig - Added
soft_moeas a new required dependency
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 19 comments.
| File | Description |
|---|---|
| rfdetr/models/transformer.py | Adds FFNBlock class and modifies TransformerDecoderLayer to support optional MoE architecture with conditional FFN replacement |
| rfdetr/config.py | Adds MoE configuration fields (MoE boolean flag and MoE_params list) to ModelConfig |
| pyproject.toml | Adds soft_moe library as a required dependency |
| from soft_moe import SoftMoELayerWrapper | ||
| import torch.nn.functional as F | ||
| from torch import nn, Tensor | ||
|
|
There was a problem hiding this comment.
The import statement for soft_moe should be placed after the standard library imports and before the local imports, following PEP 8 import order conventions. Currently it's between torch and torch.nn.functional imports. Additionally, since this is a conditional dependency (only used when MoE=True), consider either: 1) Making it an optional dependency in pyproject.toml and importing it conditionally within the init method where it's used, or 2) Documenting that soft_moe is required for MoE functionality.
| from soft_moe import SoftMoELayerWrapper | |
| import torch.nn.functional as F | |
| from torch import nn, Tensor | |
| import torch.nn.functional as F | |
| from torch import nn, Tensor | |
| from soft_moe import SoftMoELayerWrapper |
| class FFNBlock(nn.Module): | ||
| def __init__(self, d_model, dim_feedforward, dropout): | ||
| super().__init__() | ||
| self.net = nn.Sequential( | ||
| nn.Linear(d_model, dim_feedforward), | ||
| nn.ReLU(), | ||
| nn.Dropout(dropout), | ||
| nn.Linear(dim_feedforward, d_model), | ||
| ) | ||
|
|
||
| def forward(self, x): | ||
| return self.net(x) |
There was a problem hiding this comment.
The FFNBlock class is missing mandatory type hints for all parameters and the return type, as required by the coding guidelines. Additionally, it lacks a Google-style docstring explaining its purpose, parameters, and return value. This is required for all new classes according to CONTRIBUTING.md.
| skip_self_attn=False, MoE=False, MoE_params=[32,1]): | ||
| super().__init__() |
There was a problem hiding this comment.
Using mutable default arguments (list) is a Python anti-pattern that can lead to unexpected behavior. The default value MoE_params=[32,1] should be replaced with None and then initialized inside the function.
| skip_self_attn=False, MoE=False, MoE_params=[32,1]): | |
| super().__init__() | |
| skip_self_attn=False, MoE=False, MoE_params=None): | |
| super().__init__() | |
| if MoE_params is None: | |
| MoE_params = [32, 1] |
| tgt = tgt + self.dropout2(tgt2) | ||
| tgt = self.norm2(tgt) | ||
| tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) | ||
| if self.MoE == True: |
There was a problem hiding this comment.
Comparing boolean values with == True is not Pythonic. Use if self.MoE: instead of if self.MoE == True:.
| if self.MoE == True: | |
| if self.MoE: |
| group_detr: int = 13 | ||
| gradient_checkpointing: bool = False | ||
| MoE: bool = False | ||
| MoE_params: List[int] = [32, 1] |
There was a problem hiding this comment.
Using a mutable default value (list) in Pydantic model fields can lead to unexpected behavior. Use Field(default_factory=lambda: [32, 1]) instead of MoE_params: List[int] = [32, 1] to ensure each instance gets a fresh list.
| from typing import Optional | ||
|
|
||
| import torch | ||
| from soft_moe import SoftMoELayerWrapper |
There was a problem hiding this comment.
The soft_moe dependency is imported unconditionally at the module level, but it's only used when MoE is enabled. This means the library will be required even for users who don't use the MoE feature. Consider making this import conditional to avoid forcing all users to install this dependency. One approach is to import it inside the __init__ method of TransformerDecoderLayer when MoE=True, with a helpful error message if the import fails directing users to install it.
| from soft_moe import SoftMoELayerWrapper | |
| try: | |
| from soft_moe import SoftMoELayerWrapper | |
| except ImportError: # soft_moe is optional and only needed for MoE | |
| class SoftMoELayerWrapper: # type: ignore[no-redef] | |
| """Placeholder wrapper used when soft_moe is not installed. | |
| This is instantiated only if MoE functionality is enabled. If you hit this | |
| error, install the optional `soft_moe` dependency: | |
| pip install soft-moe | |
| """ | |
| def __init__(self, *args, **kwargs) -> None: | |
| raise ImportError( | |
| "soft_moe is required for MoE support in TransformerDecoderLayer. " | |
| "Please install the optional dependency with:\n\n" | |
| " pip install soft-moe\n" | |
| ) |
| if self.MoE == True: | ||
| print("\n" + "="*80) | ||
| print("Loading Mixture of Expert (MoE) Architecture") | ||
| print("="*80) | ||
| print(f"Experts Count : {MoE_params[0]}") | ||
| print(f"Slots per Expert : {MoE_params[1]}") | ||
| print("-"*80) | ||
| print("Warning: This custom architecture prevents loading full pretrained weights.") | ||
| print("Note : It may be slightly slower but could improve accuracy.") | ||
| print("="*80 + "\n") | ||
|
|
||
| self.moe_layer = SoftMoELayerWrapper( | ||
| dim=d_model, | ||
| num_experts=MoE_params[0], | ||
| slots_per_expert=MoE_params[1], | ||
| layer=FFNBlock, | ||
| d_model=d_model, | ||
| dim_feedforward=dim_feedforward, | ||
| dropout=dropout | ||
| ) |
There was a problem hiding this comment.
There's no validation for the MoE_params parameter. The code assumes it's a list with exactly 2 elements and directly accesses MoE_params[0] and MoE_params[1]. If the user provides an invalid value (e.g., empty list, single element, non-list), this will raise an IndexError. Add validation to ensure MoE_params contains exactly 2 positive integers before using them.
| decoder_norm_type='LN', | ||
| bbox_reparam=False): | ||
| bbox_reparam=False, | ||
| MoE=False, MoE_params=[32,1]): |
There was a problem hiding this comment.
Inconsistent spacing in list literal. The default value [32,1] should have spaces after commas as per PEP 8 style guide: [32, 1]. This applies to all occurrences in the code.
| MoE=False, MoE_params=[32,1]): | |
| MoE=False, MoE_params=[32, 1]): |
| self.moe_layer = SoftMoELayerWrapper( | ||
| dim=d_model, | ||
| num_experts=MoE_params[0], | ||
| slots_per_expert=MoE_params[1], | ||
| layer=FFNBlock, | ||
| d_model=d_model, | ||
| dim_feedforward=dim_feedforward, | ||
| dropout=dropout | ||
| ) |
There was a problem hiding this comment.
The MoE implementation will significantly increase memory usage and model parameters. With the default num_experts=32, the FFN layer is replicated 32 times per decoder layer. For a 3-layer decoder, this means 96 expert FFN blocks instead of 3 regular FFN blocks. This should be documented in the docstring or config to help users understand the memory/compute tradeoffs. Consider adding a warning or documentation about the expected memory increase based on the number of experts.
| "pydantic", | ||
| "supervision", | ||
| "matplotlib", | ||
| "soft_moe", |
There was a problem hiding this comment.
The new dependency soft_moe is added without any version pinning, meaning builds will always pull the latest mutable release from an external third-party package, which increases the risk of a malicious or compromised update being pulled into your supply chain. An attacker who gains control over the soft_moe package or its distribution channel could execute arbitrary code in your environment with the application's privileges. To reduce this risk, pin soft_moe to a specific, vetted version (or hash) and update it intentionally after review.
Description
Hello, I propose here an optional soft mixture of experts implementation for RF-DETR. The idea comes from the article From Sparse to Soft Mixtures of Experts, and it seems to usually improve the overall performance of a model. I thought that it would be a nice option to add.
As the article suggests, I set the default value of experts per slots to 1.
In order to add this option we would need to add the python library "soft_moe". This library offers good a Soft MoE Wrapper, the source code is available here.
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
I tested it on a small private dataset and it improved a bit the accuracy (I gained roughly 1% mAP, so a small gain but still a gain). I am not able to test it on the full COCO dataset but I guess it would also improve a bit the overall accuracy.
As expected the MoE RF-DETR model is slightly slower than the basic one, but still very fast.